The test
f508d3564c/test/test_cuda_multigpu.py (L1282-L1290)
Torch cuda caching allocator may cache the allocation and cause the "new_alloc" being the same as the "old_alloc".
```python
self.assertGreater(memory_allocated(0), current_alloc[0])
```
I suggest that we use `assertGreaterEqual` instead of `assertGreater` in the test.
Individually running only this test does not make it fail but running it together with other tests from the same test module will make it fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105501
Approved by: https://github.com/zou3519
Mostly refactor, that moves all the tests from `test_cuda` that benefit from multiGPU environment into its own file.
- Add `TestCudaMallocAsync` class for Async tests ( to separate them from `TestCudaComm`)
- Move individual tests from `TestCuda` to `TestCudaMultiGPU`
- Move `_create_scaling_models_optimizers` and `_create_scaling_case` to `torch.testing._internal.common_cuda`
- Add newly created `test_cuda_multigpu` to the multigpu periodic test
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at f4d46fa</samp>
This pull request fixes a flaky test and improves the testing of gradient scaling on multiple GPUs. It adds verbose output for two CUDA tests, and refactors some common code into helper functions in `torch/testing/_internal/common_cuda.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104059
Approved by: https://github.com/huydhn