Helion relies on torch/fx/experimental 's fake_tensor tracing but does its own dtype checking, which conflicts with some meta kernel's existing dtype checking. This PR adds a config so that we skip those dtype checking in meta kernels and rely on the calling system to do the dtype checking.
Currently it only applies to `baddbmm`, but I expect that similar changes will need to be done to other meta kernels in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153513
Approved by: https://github.com/jansel
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.
The behavior of set_device before:
* If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
* If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES
So this PR improves the device selection logic to:
* If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
* If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
* If not above, then we throw warning to users about situation, and fallback to the old heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897
Approved by: https://github.com/tianyu-l
ghstack dependencies: #150898
Previously, we launch the a2av kernel with at most 8 blocks for intra-node cases, which turns out to saturate only 57 GB/s bandwidth.
This PR adds more blocks for intra-node, up to 8 per peer, pumping up data parallelism. The kernel now achieves 350 GB/s SOL for Hopper. See figure.
It also uses a simple tuning based on input size to avoid jumping to 8 CTAs directly (i.e. 1, 2, 4, then 8)
For inter-node, we cap at 8 blocks, since 57 GB/s seems bigger than regular NIC bandwidths (400 Gb/s).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153509
Approved by: https://github.com/ngimel
ghstack dependencies: #153483
This PR adds a tensor LR variant for the CPU Adagrad(fused=True).
I copied the behavior from the tensor LR variant of CPU Adam(fused=True), where the `lr.item()` is cast to a double and passed in the default function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153078
Approved by: https://github.com/janeyx99
Updates heuristic for bmm/baddbmm and consolidates all heuristic logic in a single location
- The goal of the consolidation is to improve maintainability and readability of the heuristic logic. Instead of different parts scattered across two files, this patch centralizes everything inside `Matmul.cpp`, where there already exists heuristic-based selection for mkldnn.
- The logic of the check itself doesn't change (existing code is reused where possible) but a separate heuristic threshold for bmm/baddbmm is introduced based on newer, benchmarking data. Use the script below to see the performance improvement for bmm from the new heuristic:
```
import torch
import time
# Set below to True to use cases selected by only one of the hueristics.
USE_ONLY_DIVERGENT_TEST_CASES = True
BATCH_SIZES = [ 1, 8, 32, 64, 128, 256 ]
M_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ]
N_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ]
K_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ]
ITERS = 50
def old_heuristic(m, n, k):
is_above_min_dims = m > 8 and n > 8 and k > 8
is_above_min_size = m*n*k > 8_192
return is_above_min_dims and is_above_min_size
def new_heuristic(b, m, n, k):
return b*b*m*n*k >= 4_194_304
def generate_test_cases():
test_cases = []
for b in BATCH_SIZES:
for m in M_DIMS:
for n in N_DIMS:
for k in K_DIMS:
if USE_ONLY_DIVERGENT_TEST_CASES:
if old_heuristic(m, n, k) != new_heuristic(b, m, n, k):
test_cases.append([b, m, n, k])
else:
test_cases.append([b, m, n, k])
return test_cases
def test(x, y):
for _ in range(5):
torch.bmm(x, y)
perf = 0.0
for _ in range(ITERS):
start = time.time()
torch.bmm(x, y)
end = time.time()
perf += (end - start) / ITERS
return perf
def main():
print(f"{'b':<10}{'m':<10}{'n':<10}{'k':<10}{'time (s)':10}")
cumulative_mean_time = 0.0
for b, m, n, k in generate_test_cases():
mean_time = test(torch.rand(b, m, n), torch.rand(b, n, k))
cumulative_mean_time += mean_time
print(f"{b:<10}{m:<10}{n:<10}{k:<10}{mean_time:10.3e}")
print(f"Cumulative mean time = {cumulative_mean_time:.4f} s")
if __name__ == "__main__":
main()
```
From the script we see that cumulative mean time from all test cases (at 16 threads) is:
- 1.6195 s for the old heuristic
- 0.7012 s for the new heuristic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149122
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](fa6f9eb2be/torch/distributed/launcher/api.py (L290)) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749).
#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.
Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.
Fixes#150916Fixes#147064
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525
Approved by: https://github.com/kiukchung
Summary: Use pybind11::gil_scoped_acquire instead of old impl as it will automatically take care of error handling. In the original implementation we missed releasing the GIL on each possible error which could put the program in a deadlock
Test Plan: Induced error manually and saw that GIL was released
Differential Revision: D74593564
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153415
Approved by: https://github.com/Skylion007, https://github.com/cyyever
Summary:
This diff adds a justknobs check for static cuda launcher. In particular, it supports a fractional rollout where each mast job/version can be consistently enrolled in the config on or off.
It also adds a set_feature_use so we can track whether static cuda launcher is enabled on a given dynamo compile.
Test Plan: Existing unit tests. The justknobs in question are set to be disabled right now, so this diff does not launch the feature yet.
Differential Revision: D74599203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153400
Approved by: https://github.com/oulgen
This appears to be slow in production (potentially a quadratic explosion), and
logging this explicitly in pt2_compile_events and wait_counters makes it a lot easier to see how
bad of an issue this is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153270
Approved by: https://github.com/masnesral
Summary: I forgot to remove this unused field in D73809989.
Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413
Approved by: https://github.com/c00w