pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 13:44:15 +08:00

Author	SHA1	Message	Date
Zachary DeVito	45b564766d	[memory snapshots] removed chained history (#106079 ) For free blocks of memory in the allocator, we previously kept a linked list of the stack frames of previous allocations that lived there. This was only ever used in one flamegraph visualization and never proved useful at understanding what was going on. When memory history tracing was added, it became redundant, since we can see the history of the free space from recording the previous actions anyway. This patch removes this functionality and simplifies the snapshot format: allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history. Previously the memory history tracked the real size of allocations before rounding. Since history was added, 'requested_size' has been added directly to the block which records the same information, so this patch also removes that redundancy. None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter this part of the format. This patch also updates our visualization tools to work with the simplified format. Visualization tools keep support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079 Approved by: https://github.com/eellison	2023-07-28 06:45:48 +00:00
Justin Chu	4cc1745b13	[BE] f-stringify torch/ and scripts (#105538 ) This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`. - https://docs.python.org/3/reference/lexical_analysis.html#f-strings - https://pypi.org/project/flynt/ Command used: ``` flynt torch/ -ll 120 flynt scripts/ -ll 120 flynt tools/ -ll 120 ``` and excluded `collect_env.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-21 19:35:24 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Nikita Shulga	c3e4a67905	Refactor multigpu tests to `test_cuda_multigpu` (#104059 ) Mostly refactor, that moves all the tests from `test_cuda` that benefit from multiGPU environment into its own file. - Add `TestCudaMallocAsync` class for Async tests ( to separate them from `TestCudaComm`) - Move individual tests from `TestCuda` to `TestCudaMultiGPU` - Move `_create_scaling_models_optimizers` and `_create_scaling_case` to `torch.testing._internal.common_cuda` - Add newly created `test_cuda_multigpu` to the multigpu periodic test <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at f4d46fa</samp> This pull request fixes a flaky test and improves the testing of gradient scaling on multiple GPUs. It adds verbose output for two CUDA tests, and refactors some common code into helper functions in `torch/testing/_internal/common_cuda.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104059 Approved by: https://github.com/huydhn	2023-06-27 05:32:05 +00:00
Zachary DeVito	afc788a99c	Re-land _cycleviz.py: visualize reference cycles holding cuda memory (#104051 ) Reference cycles are freed by the cycle collector rather than being cleaned up when the objects in the cycle first become unreachable. If a cycle points to a tensor, the CUDA memory for that tensor will not be freed until garbage collection runs. Accumulation of CUDA allocations can lead to out of memory errors (OOMs), as well as non-deterministic allocation behavior which is harder to debug. This visualizer installs a garbage collection hook to look for cycles containing CUDA tensors and saves a visualization of the garbage: ``` from torch.cuda._cycleviz import warn_tensor_cycles warn_tensor_cycles() # do some work that results in a cycle getting garbage collected # ... > WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html ``` Reland to make windows skip the test. This reverts commit 7b3b6dd4262337c5289d64dd3e824b0614cf68e3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104051 Approved by: https://github.com/aaronenyeshi, https://github.com/malfet	2023-06-23 13:44:58 +00:00
PyTorch MergeBot	7b3b6dd426	Revert "_cycleviz.py: visualize reference cycles holding cuda memory (#102656 )" This reverts commit dba67f71c9b5abbdca5aa64913c50f9aa5ac6f51. Reverted https://github.com/pytorch/pytorch/pull/102656 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I think the change is failing on Windows CUDA https://github.com/pytorch/pytorch/actions/runs/5341701630/jobs/9683293600 ([comment](https://github.com/pytorch/pytorch/pull/102656#issuecomment-1603035364))	2023-06-22 17:16:47 +00:00
Zachary DeVito	dba67f71c9	_cycleviz.py: visualize reference cycles holding cuda memory (#102656 ) Reference cycles are freed by the cycle collector rather than being cleaned up when the objects in the cycle first become unreachable. If a cycle points to a tensor, the CUDA memory for that tensor will not be freed until garbage collection runs. Accumulatin of CUDA allocations can lead to out of memory errors (OOMs), as well as non-deterministic allocation behavior which is harder to debug. This visualizer installs a garbage collection hook to look for cycles containing CUDA tensors and saves a visualization of the garbage: ``` from torch.cuda._cycleviz import warn_tensor_cycles warn_tensor_cycles() # do some work that results in a cycle getting garbage collected # ... > WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102656 Approved by: https://github.com/aaronenyeshi	2023-06-22 04:00:28 +00:00
Nikita Shulga	cd05c3b98c	[BE] Use `TEST_MULTIGPU` from `common_cuda.py` (#103982 ) Comment about `TEST_CUDNN` called over and over has long been alleviated by wrapping the check with `LazyVal`, that caches the results. Also, delete unused `TEST_MAGMA`. Prep change for https://github.com/pytorch/pytorch/issues/100006 <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at e3a5b39</samp> > _`common_cuda.py`_ > _Refactored for dynamo tests_ > _Winter code cleanup_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/103982 Approved by: https://github.com/atalman, https://github.com/janeyx99	2023-06-22 00:07:44 +00:00
Zachary DeVito	19b3e07fe0	[memory_viz] Unified viewer (#103565 ) This replaces the invidual visualization routines in _memory_viz.py with a single javascript application. The javascript application can load pickled snapshot dumps directly using drag/drop, requesting them via fetch, or by embedding them in a webpage. The _memory_viz.py commands use the embedding approach. We can also host MemoryViz.js on a webpage to use the drag/drop approach, e.g. https://zdevito.github.io/assets/viz/ (eventually this should be hosted with the pytorch docs). All views/multiple cuda devices are supported on one page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103565 Approved by: https://github.com/eellison, https://github.com/albanD	2023-06-16 03:49:48 +00:00
Xiao Wang	39f3514fa3	Add an env PYTORCH_TEST_SKIP_CUDAGRAPH to skip all cuda graph-related unit tests (#103032 ) Skip all cuda graph-related unit tests by setting env var `PYTORCH_TEST_SKIP_CUDAGRAPH=1` This PR refactors the `TEST_CUDA` python variable in test_cuda.py into common_utils.py. This PR also creates a new python variable `TEST_CUDA_GRAPH` in common_utils.py, which has an env var switch to turn off all cuda graph-related tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103032 Approved by: https://github.com/malfet	2023-06-06 07:51:57 +00:00
Nikita Shulga	ca470fc59f	[BE] Make `test_no_triton_on_import` simple (#102674 ) Do not try to parse raised exception for no good reason Add short description Reduce script to a single line <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at ea4164e</samp> > _`test_no_triton_on_import`_ > _Cleans up the code, adds docs_ > _No hidden errors_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102674 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-06-01 20:31:18 +00:00
Nikita Vedeneev	d80d3b18d0	nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations (#98403 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 08f7a6a</samp> This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403 Approved by: https://github.com/malfet, https://github.com/cpuhrsch	2023-05-31 13:09:45 +00:00
Masaki Kozuki	c8579b7374	Run `test_cpp_memory_snapshot_pickle` only when linux and x86_64 (#101366 ) On Arm, I got ``` Traceback (most recent call last): File "/opt/pytorch/pytorch/test/test_cuda.py", line 5260, in test_cpp_memory_snapshot_pickle mem = run() File "/opt/pytorch/pytorch/test/test_cuda.py", line 5257, in run t = the_script_fn() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 496, in prof_func_call return prof_callable(func_call, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 493, in prof_callable return callable(args, **kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): File "/opt/pytorch/pytorch/test/test_cuda.py", line 5254, in the_script_fn @torch.jit.script def the_script_fn(): return torch.rand(311, 411, device='cuda') ~~~~~~~~~~ <--- HERE RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms ``` `dfe484a3b3/torch/csrc/profiler/unwind/unwind.cpp (L4-L24)` seems related Pull Request resolved: https://github.com/pytorch/pytorch/pull/101366 Approved by: https://github.com/zdevito	2023-05-17 19:44:21 +00:00
Elias Ellison	3edff6b6ec	Improve detection of workspace/non-output allocations in cudagraphs (#99985 ) When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off. This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985 Approved by: https://github.com/zdevito	2023-05-01 15:58:45 +00:00
Jane Xu	808267767c	Prevent grad scale from overflowing (#98876 ) Fixes #98828 by capping the growth in the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/98876 Approved by: https://github.com/ngimel	2023-04-25 20:59:44 +00:00
Aaron Gokaslan	e2a3817dfd	[BE] Enable C419 rule for any all shortcircuiting (#99890 ) Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet	2023-04-25 15:02:13 +00:00
Masaki Kozuki	b87c7ab6d6	Remove redundant `found_inf` recompute from `_step_supports_amp_unscaling` path (#98620 ) following https://github.com/pytorch/pytorch/pull/97415#issuecomment-1499787115. Rel: https://github.com/pytorch/pytorch/pull/98613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98620 Approved by: https://github.com/janeyx99	2023-04-20 19:24:09 +00:00
Animesh Jain	971df458db	Reland of "Python binding to set/get CUDA rng state offset" (#99565 ) Why? * To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377 Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way. ~~~~ import torch torch.cuda.manual_seed(123) print(torch.cuda.get_rng_state()) torch.cuda.set_rng_state_offset(40) print(torch.cuda.get_rng_state()) tensor([123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) tensor([123, 0, 0, 0, 0, 0, 0, 0, 40, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) ~~~~ Reland of https://github.com/pytorch/pytorch/pull/98965 (cherry picked from commit 8214fe07e8a200e0fe9ca4264bb6fca985c4911e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99565 Approved by: https://github.com/anijain2305	2023-04-20 15:42:25 +00:00
PyTorch MergeBot	bb2cd4a107	Revert "Python binding to set/get CUDA rng state offset (#98965 )" This reverts commit 8214fe07e8a200e0fe9ca4264bb6fca985c4911e. Reverted https://github.com/pytorch/pytorch/pull/98965 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-19 11:23:32 +00:00
Animesh Jain	8214fe07e8	Python binding to set/get CUDA rng state offset (#98965 ) Why? * To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377 Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way. ~~~~ import torch torch.cuda.manual_seed(123) print(torch.cuda.get_rng_state()) torch.cuda.set_rng_state_offset(40) print(torch.cuda.get_rng_state()) tensor([123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) tensor([123, 0, 0, 0, 0, 0, 0, 0, 40, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/98965 Approved by: https://github.com/kulinseth, https://github.com/ezyang	2023-04-18 07:52:21 +00:00
Zachary DeVito	7ff1f3f3f6	Revert "Revert "Expandable blocks in allocator (#96995 )"" (#99275 ) This reverts commit 851e89c8e817f28270e0fc21d74ced9446bea747. Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275 Approved by: https://github.com/eellison	2023-04-17 23:46:08 +00:00
PyTorch MergeBot	851e89c8e8	Revert "Expandable blocks in allocator (#96995 )" This reverts commit 6a50b83b739c2d37d0f518f98b8e624eca0ea153. Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests	2023-04-16 19:23:37 +00:00
Zachary DeVito	6a50b83b73	Expandable blocks in allocator (#96995 ) Common advice we give for handling memory fragmentation issues is to allocate a big block upfront to reserve memory which will get split up later. For programs with changing tensor sizes this can be especially helpful to avoid OOMs that happen the first time we see a new largest input and would otherwise have to allocate new segments. However the issue with allocating a block upfront is that is nearly impossible to correctly estimate the size of that block. If too small, space in the block will run out and the allocator will allocate separate blocks anyway. Too large, and other non-PyTorch libraries might stop working because they cannot allocate any memory. This patch provides the same benefits as using a pre-allocating block but without having to choose its size upfront. Using the cuMemMap-style APIs, it adds the ability to expand the last block in a segment when more memory is needed. Compared to universally using cudaMallocAsync to avoid fragmentation, this patch can fix this common fragmentation issue while preserving most of the existing allocator behavior. This behavior can be enabled and disabled dynamically. This should allow users to, for instance, allocate long-lived parameters and state in individual buffers, and put temporary state into the large expandable blocks, further reducing fragmentation. See inline comments for information about the implementation and its limitations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995 Approved by: https://github.com/eellison	2023-04-14 09:49:11 +00:00
Peeyush Agarwal	ebd4c165ff	Back out "`GradScaler` recomputes `optimizer_state["found_inf_per_device"]` before `optimizer.step` (#97415 )" (#98613 ) Summary: This change causes multi-GPU job from XI team to hang after 8K steps. Differential Revision: D44797248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98613 Approved by: https://github.com/ngimel	2023-04-07 23:31:31 +00:00
Zachary DeVito	b1a83c4da4	[memory history] cleanup recording API (#97406 ) This makes the options for recording memory history easier to understand and makes the default to record the most information. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 4706acf</samp> This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`. Also adds a quick _dump_snapshot function to make it easier to look at the common visualizations. <!-- copilot:walkthrough --> ### <samp>🤖 Generated by Copilot at 4706acf</samp> * Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696)) * Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713)) * Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085)) * Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406 Approved by: https://github.com/ezyang	2023-03-28 16:31:10 +00:00
soulitzer	51c3fd39a5	Modify all calls to checkpoint pass use_reentrant explicitly (#97376 ) Fixes #ISSUE_NUMBER This is the first step toward making use_reentrant=False the default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97376 Approved by: https://github.com/albanD	2023-03-27 13:37:42 +00:00
Masaki Kozuki	b5edf18334	`GradScaler` recomputes `optimizer_state["found_inf_per_device"]` before `optimizer.step` (#97415 ) I found a discrepancy between non-fused and fused optimizers, which is to use `optimizer_state["found_inf"]` or to recompute `found_inf`. - non fused: `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L289)` - fused: `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)` - where `_check_inf_per_device` is `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L564-L573)` The other way to align the behavior is to use the existing `found_inf` in `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)`. I'd say this PR is for the sake of "safety" and the alternative is to keep the existing behavior. I honestly have no idea if it's expected to double-check the sanity of gradients in `GradScaler.step`. --- what I've observed in huggingface/transformers T5-base example so far seems like that non-fused optimizers lead to invalid parameters while the fused not. The cause seems to be that `gradients` become inf/nan before `GradScaler.step(optimizer)` after `GradScaler._unscale_grads_` (more precicely, the call of `torch._amp_foreach_non_finite_check_and_unscale_`) in the script of the issue linked below, i.e. the gradient clipping and/or unscaling lead to inf/nan as these happen after the grad check. See `788300cc2a/aten/src/ATen/native/cuda/AmpKernels.cu (L165-L174)`. Fixes #96755 🙏 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97415 Approved by: https://github.com/ngimel, https://github.com/janeyx99	2023-03-24 17:36:47 +00:00
Tailing Yuan	63e1f12b49	Speedup bincount and histc on CUDA (#97090 ) This is to speed up torch.bincount and torch.histc on CUDA. 1. Speed up int64_t gpuAtomicAdd, 2. and optimize the histogram kernel. # Fixes #96626 After speedup, time cost in #96626 would be ``` ... (run 2 times and ignore the first run) case 1 CPU 0.0003631114959716797 seconds case 1 CUDA 0.0005860328674316406 seconds case 2 CPU 0.0013742446899414062 seconds case 2 CUDA 0.0008623600006103516 seconds ``` Note that in "case 1 CUDA", the max op takes the most time, i.e., `5ee5a164ff/aten/src/ATen/native/cuda/SummaryOps.cu (L334-L335)`, which is not to be optimized in this PR. # Benchmark Time is measured on i7-10700 + RTX 3080, Ubuntu 22.04 (in WSL). The baseline is PyTorch 2.0.0+cu117. My dev version of PyTorch is compiled with CUDA 11.8. Each case is measured 15 times to take the median. ## torch.bincount #elem \| nbins \| distribution \| CPU \| PyTorch 2.0.0 \| this PR \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- 220 \| 80 \| random.uniform \| 0.000834 \| 0.005783 \| 0.000266 \| 21.8x 220 \| 80 \| narrow in 1 bin \| 0.001576 \| 0.003967 \| 0.000563 \| 7.0x 220 \| 500 \| random.uniform \| 0.000852 \| 0.003641 \| 0.000334 \| 10.9x 220 \| 500 \| narrow in 1% bins \| 0.000894 \| 0.001878 \| 0.000349 \| 5.4x 220 \| 2048 \| random.uniform \| 0.000891 \| 0.000820 \| 0.000298 \| 2.8x 220 \| 2048 \| narrow in 1% bins \| 0.000958 \| 1.043251 \| 0.000335 \| 3,116.6x 226 \| 80 \| random.uniform \| 0.067715 \| 0.322409 \| 0.003032 \| 106.3x 226 \| 80 \| narrow in 1 bin \| 0.110940 \| 0.194644 \| 0.017651 \| 11.0x 226 \| 500 \| random.uniform \| 0.066666 \| 0.192302 \| 0.002535 \| 75.8x 226 \| 500 \| narrow in 1% bins \| 0.066130 \| 0.092237 \| 0.005462 \| 16.9x 226 \| 2048 \| random.uniform \| 0.066371 \| 0.035308 \| 0.002476 \| 14.3x 226 \| 2048 \| narrow in 1% bins \| 0.068453 \| 72.122858 \| 0.003185 \| 22,644.3x ## torch.histc (float32) #elem \| nbins \| distribution \| CPU \| PyTorch 2.0.0 \| this PR \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- 220 \| 80 \| random.uniform \| 0.001261 \| 0.000145 \| 9.47E-05 \| 1.5x 220 \| 80 \| narrow in 1 bin \| 0.001074 \| 0.000356 \| 0.000311 \| 1.1x 220 \| 500 \| random.uniform \| 0.001162 \| 0.000227 \| 9.18E-05 \| 2.5x 220 \| 500 \| narrow in 1% bins \| 0.001082 \| 0.000201 \| 0.000152 \| 1.3x 220 \| 2048 \| random.uniform \| 0.001100 \| 0.000203 \| 0.000118 \| 1.7x 220 \| 2048 \| narrow in 1% bins \| 0.001089 \| 0.000396 \| 0.000107 \| 3.7x 226 \| 80 \| random.uniform \| 0.064219 \| 0.001170 \| 0.000786 \| 1.5x 226 \| 80 \| narrow in 1 bin \| 0.056471 \| 0.013283 \| 0.011939 \| 1.1x 226 \| 500 \| random.uniform \| 0.078183 \| 0.003411 \| 0.000562 \| 6.1x 226 \| 500 \| narrow in 1% bins \| 0.056711 \| 0.002763 \| 0.002738 \| 1.0x 226 \| 2048 \| random.uniform \| 0.059296 \| 0.003503 \| 0.000533 \| 6.6x 226 \| 2048 \| narrow in 1% bins \| 0.061754 \| 0.015703 \| 0.000962 \| 16.3x ## torch.histc (int64) #elem \| nbins \| distribution \| CPU \| PyTorch 2.0.0 \| this PR \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- 220 \| 80 \| random.uniform \| N/A \| 0.005614 \| 9.47E-05 \| 59.3x 220 \| 80 \| narrow in 1 bin \| N/A \| 0.003799 \| 0.000395 \| 9.6x 220 \| 500 \| random.uniform \| N/A \| 0.003665 \| 9.58E-05 \| 38.2x 220 \| 500 \| narrow in 1% bins \| N/A \| 0.001760 \| 0.000178 \| 9.9x 220 \| 2048 \| random.uniform \| N/A \| 0.000693 \| 0.000111 \| 6.2x 220 \| 2048 \| narrow in 1% bins \| N/A \| 1.082904 \| 0.000123 \| 8,802.4x 226 \| 80 \| random.uniform \| N/A \| 0.320400 \| 0.001145 \| 279.9x 226 \| 80 \| narrow in 1 bin \| N/A \| 0.193668 \| 0.015229 \| 12.7x 226 \| 500 \| random.uniform \| N/A \| 0.182897 \| 0.000823 \| 222.2x 226 \| 500 \| narrow in 1% bins \| N/A \| 0.089363 \| 0.00376 \| 23.8x 226 \| 2048 \| random.uniform \| N/A \| 0.033190 \| 0.000832 \| 39.9x 226 \| 2048 \| narrow in 1% bins \| N/A \| 71.721012 \| 0.001525 \| 47,017.8x ## Banchmark code Here is the benchmark code: ```python3 import time import torch cases = [ ("bincount bins=80 wide ", torch.randint(80, [220]), lambda x: torch.bincount(x, minlength=80)), ("bincount bins=80 narrow", torch.randint(1, [220]), lambda x: torch.bincount(x, minlength=80)), ("bincount bins=500 wide ", torch.randint(500, [220]), lambda x: torch.bincount(x, minlength=500)), ("bincount bins=500 narrow", torch.randint(5, [220]), lambda x: torch.bincount(x, minlength=500)), ("bincount bins=2048 wide ", torch.randint(2048, [220]), lambda x: torch.bincount(x, minlength=2048)), ("bincount bins=2048 narrow", torch.randint(20, [220]), lambda x: torch.bincount(x, minlength=2048)), ("histc_float bins=80 wide ", torch.rand(220), lambda x: torch.histc(x, bins=80, min=0., max=1.)), ("histc_float bins=80 narrow", torch.rand(220).01, lambda x: torch.histc(x, bins=80, min=0., max=1.)), ("histc_float bins=500 wide ", torch.rand(220), lambda x: torch.histc(x, bins=500, min=0., max=1.)), ("histc_float bins=500 narrow", torch.rand(220).01, lambda x: torch.histc(x, bins=500, min=0., max=1.)), ("histc_float bins=2048 wide ", torch.rand(220), lambda x: torch.histc(x, bins=2048, min=0., max=1.)), ("histc_float bins=2048 narrow", torch.rand(220).01, lambda x: torch.histc(x, bins=2048, min=0., max=1.)), ("histc_int bins=80 wide ", torch.randint(80, [220]), lambda x: torch.histc(x, bins=80, min=0., max=80.)), ("histc_int bins=80 narrow", torch.randint(1, [220]), lambda x: torch.histc(x, bins=80, min=0., max=80.)), ("histc_int bins=500 wide ", torch.randint(500, [220]), lambda x: torch.histc(x, bins=500, min=0., max=500.)), ("histc_int bins=500 narrow", torch.randint(5, [220]), lambda x: torch.histc(x, bins=500, min=0., max=500.)), ("histc_int bins=2048 wide ", torch.randint(2048, [220]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)), ("histc_int bins=2048 narrow", torch.randint(20, [2*20]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)), ] def test(case, device): name, x, func = case x = x.to(device) time_samples = [] for _ in range(15): torch.cuda.synchronize() t1 = time.time() func(x) torch.cuda.synchronize() t2 = time.time() time_samples.append(t2 - t1) median = sorted(time_samples)[len(time_samples) // 2] print(device, name, median) for case in cases: test(case, device="cuda") # for case in cases: # test(case, device="cpu") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97090 Approved by: https://github.com/ngimel	2023-03-24 00:25:34 +00:00
Masaki Kozuki	22ea21da3d	Change 1D Tensor of 1 element to 0D Tensor (#96994 ) add 0d tensor to graph adam/adamw test Affected: - `torch.cuda.amp.GradScaler`'s `found_inf`, `_scale`, and `_growth_tracker` - `step` of Adam & AdamW of `capturable` Fixes #96776 🤞 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96994 Approved by: https://github.com/janeyx99	2023-03-21 18:24:19 +00:00
Elias Ellison	571f96bf59	cudagraph trees (#89146 ) CUDA Graph Trees Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit Not currently implemented : - Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr. - Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr. - Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146 Approved by: https://github.com/ezyang	2023-03-17 02:47:03 +00:00
Elias Ellison	ea7415087a	Expose Stream Recording Apis in python (#96384 ) Differential Revision: [D43999891](https://our.internmc.facebook.com/intern/diff/D43999891) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96384 Approved by: https://github.com/zdevito	2023-03-16 23:45:43 +00:00
Zachary DeVito	e74f70d212	Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )"" (#96878 ) This reverts commit e1ea584b1caf9c50de25ce69396dfeb523a452c0. Adds __has_include check to fix fbcode build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878 Approved by: https://github.com/ezyang	2023-03-16 04:12:54 +00:00
PyTorch MergeBot	e1ea584b1c	Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )" This reverts commit 4e1060c609c094fd5f58041ebed803f74410ee36. Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2023-03-15 13:28:41 +00:00
Zachary DeVito	4e1060c609	[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 ) This refactors the stack trace facility specific to memory profiling in python+cuda to make a generic facility to generate combined stack traces. The generic facility (combined_traceback.h) does not require python to be around to work, but will return python stacks if it is present. This facility is then used to add support for stack trace gathering in memory profiling that happens directly from C++. It is also used to expose a python API for gathering and symbolizing combineds stacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541 Approved by: https://github.com/ezyang	2023-03-14 18:26:05 +00:00
Elias Ellison	da265652d6	Return Live Data Pointers from Checkpoint, swap onto tensors (#95020 ) When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors. The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general. This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration. Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020 Approved by: https://github.com/zdevito	2023-03-14 01:22:19 +00:00
Elias Ellison	1cc32aedb0	Handle additional live allocations not in checkpointed state (#94943 ) We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die. Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943 Approved by: https://github.com/zdevito	2023-03-14 01:00:47 +00:00
Elias Ellison	d798de2b05	Checkpoint CUDA Allocator Private Pool State (#94653 ) Copying note from cuda caching allocator: ``` * Note [Checkpointing PrivatePoolState] * * Refer above to Note [Interaction with CUDA graph capture]. Allocations made * during graph capture are made from a separate private pool. During graph * capture allocations behave as usual. During graph replay the allocator * state does not change even as new tensors are created. The private pool * will not free its blocks to the main caching allocator until cuda graph use * is finished to prevent an allocation from eager clobbering the memory from * a live but unaccounted for tensor that was created during replay. * * `make_graphed_callables`, a series of separate callables chained in * successive cuda graphs, can share a memory pool because after a cuda graph * recording the allocations in the shared private pool exactly reflect the * tensors that are allocated. * * We would like to extend callable chaining to support a graphed callable * tree. In this scenario, we have a tree of callable chains which will be * captured with cuda graphs. In the diagram below, we have a tree with four * callables, A, B, C, and D. Suppose we have captured, and subsequently * replayed, A, B, and C. Then on a new invocation, we replay A and B, but * would now like to record D. At this point the private pool will not reflect * any of the live tensors created during graph replay. Allocations made * during a new recording with the pool could overwrite those live tensors. * * In order to record a new graph capture after replaying prior callables in * the tree, we need the allocator to reflect the state of the live tensors. * We checkpoint the state of the private after each recording, and then * reapply it when we are starting a new recording chain. Additionally, we * must free the allocations for any tensors that died between the end of our * previous graph replaying and our new recording (TODO). All of the allocated * segments that existed in the checkpointed state must still exist in the * pool. There may also exist new segments, which we will free (TODO : link * note [live tensors between iterations] when it exists). * * * ---------------> A ---------------> B ---------------> C * \| * \| * \| * \| * ---------------> D ``` A few TODOs: - need to add logic for freeing tensors that have died between a last replay and current new recording - Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors) The two scenarios above have not been exercised in the tests yet. Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653 Approved by: https://github.com/zdevito	2023-03-14 00:47:30 +00:00
Zachary DeVito	4b372e3958	[memory profiling] C++ tracing support (#95357 ) Adds the ability to quickly generate stack traces for C++, and combine Python, TorchScript, and C++ frames into a single trace. This makes it possible for the memory tracer to record allocations inside C++ code (e.g. convolution temporaries, backward operators). The unwinder code is ~10x faster than execinfo.h's backward because it cache fast unwinder routines for instruction pointers that have already been seen. It is also only 1.2--2x slower than copying the entire stack (the approach perf takes), while using 2 orders of magnitude less space per stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357 Approved by: https://github.com/bertmaher	2023-03-12 07:24:14 +00:00
Zachary DeVito	266089a3fe	[memory snapshots] record scripted stack traces (#95356 ) Adds support for seeing both python and script stack traces in memory debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95356 Approved by: https://github.com/aaronenyeshi	2023-03-12 07:24:14 +00:00
Zachary DeVito	d6d8d3484e	_memory_viz.py: Visualize how blocks fit into segments. (#91336 ) Add a segment_plot command that visualizes how blocks are allocated into segments. This is similar to the 'stats' command but produces an interactive html viewer rather than text dump, allowing exploration of stack traces. It also adds the ability to see the layout at any point in the trace by starting from the snapshot and then apply the events backwards to reconstruct what memory would have looked like. Example: ![Screen Shot 2022-12-22 at 3 32 49 PM](https://user-images.githubusercontent.com/370202/209242650-b952372e-37ac-400a-a01c-13be2b5426fa.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91336 Approved by: https://github.com/bhosmer	2023-03-07 21:07:18 +00:00
Zachary DeVito	71f369092d	Revert "Revert "memory viz: Add colors for categories and a legend (#90587 )"" (#96133 ) This reverts commit b38b39c441f12be90fd5d7eafe74246d050665c8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96133 Approved by: https://github.com/bhosmer	2023-03-07 21:07:18 +00:00
Catherine Lee	eea0733045	Reduce pytest blocklist (#96016 ) `TestCase = object` or variations of it get switched to `TestCase = NoTest`. unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection. pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-03-07 18:30:27 +00:00
Eli Uriegas	b38b39c441	Revert "memory viz: Add colors for categories and a legend (#90587 )" This reverts commit ee4384250589f870f24e4d24894a03824ed1c49e.	2023-03-06 11:38:58 -08:00
Zachary DeVito	ee43842505	memory viz: Add colors for categories and a legend (#90587 ) Adds a category legend to memory trace plots that colors allocations by their role (activation, parameter, gradient, etc.) as captured by kineto. Differential Revision: [D43757381](https://our.internmc.facebook.com/intern/diff/D43757381) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90587 Approved by: https://github.com/aaronenyeshi	2023-03-03 20:42:22 +00:00
Mark Saroufim	9f707f164e	Add more GPU metric instrumentation (#91717 ) Fixes https://github.com/pytorch/serve/issues/1937 A fairly common query I see folks running while using pytorch is `nvidia-smi --format=csv,noheader,nounits --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw,clocks.current.sm,clocks.current.memory -l 10` Existing metrics we have * For kernel utilization`torch.cuda.utilization()` * For memory utilization we have them under `torch.cuda.memory` the memory allocated with `torch.cuda.memory.memory_allocated()` * For total available memory we have `torch.cuda.get_device_properties(0).total_memory` Which means the only metrics we're missing are * Temperature: now in `torch.cuda.temperature()` * Power draw: now in `torch.cuda.power()` * Clock speed: now in `torch.cuda.clock_speed()` With some important details on each * Clock speed settings: I picked the SM clock domain which is documented here https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g805c0647be9996589fc5e3f6ff680c64 * Temperature: I use `pynvml.nvmlDeviceGetTemperature(handle, 0)` where 0 refers to the GPU die temperature Pull Request resolved: https://github.com/pytorch/pytorch/pull/91717 Approved by: https://github.com/ngimel	2023-02-24 00:38:03 +00:00
Pearu Peterson	cece63f197	Add warn-once deprecation warning to legacy sparse constructors (#94850 ) Addresses https://github.com/pytorch/pytorch/issues/68323#issuecomment-1425174341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94850 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2023-02-23 15:05:12 +00:00
puririshi98	8aa34602f7	Jetson Update for CI Redo (#94549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-21 17:13:38 +00:00
dllehr-amd	98012e4a59	[ROCm] hipGraph support for pytorch mainline (#88202 ) With the release of ROCm 5.3 hip now supports a hipGraph implementation. All necessary backend work and hipification is done to support the same functionality as cudaGraph. Unit tests are modified to support a new TEST_GRAPH feature which allows us to create a single check for graph support instead of attempted to gather the CUDA level in annotations for every graph test Pull Request resolved: https://github.com/pytorch/pytorch/pull/88202 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2023-02-14 22:18:56 +00:00
Xuehai Pan	b005ec62b9	[BE] Remove dependency on `six` and `future` (#94709 ) Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-14 09:14:14 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00

... 5 6 7 8 9 ...

914 Commits