Fixes#160520
Summary:
When running Inductor with cpp_wrapper under a DeviceContext, non-tensor arguments were being wrapped with torch.tensor(arg) without specifying the device.
creating the tensor on the current active device (like CUDA), and later fetching it back to CPU via .item(), causing unnecessary host-device-host memory transfers.
PR fixes issue by explicitly creating scalar tensors on the CPU:
```
input_tensors = [
arg if isinstance(arg, torch.Tensor) else torch.tensor(arg, device='cpu')
for arg in args
]
```
impact: inductor, codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160584
Approved by: https://github.com/benjaminglass1, https://github.com/desertfire, https://github.com/mlazos, https://github.com/jeffdaily
The environ var PYTORCH_TESTING_DEVICE_ONLY_FOR controls the devices
in get_desired_device_type_test_bases, so we add RUN_CPU and RUN_GPU to
make sure cases are only enabled for devices specified for PYTORCH_TESTING_DEVICE_ONLY_FOR.
eg. Only enable GPU cases, not CPU cases even HAS_CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149023
Approved by: https://github.com/jansel, https://github.com/cyyever
Longer term would be good to add as a feature to cpp_wrapper, but this makes sure it doesn't fail on main.
Not sure if this needs a test because it's not meant to compose, but will add one if necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145538
Approved by: https://github.com/desertfire
This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266
Approved by: https://github.com/EikanWang, https://github.com/jansel