Commit Graph

2074 Commits

Author SHA1 Message Date
19a9de114a Forbid subclassing _TensorBase directly (#125558)
As per title.
This ensures that all the places where we assume the method defined in _tensor.py do exist.

BC-Breaking: This is bc-breaking as the user cannot subclass this private class anymore.
You should replace any use of _TensorBase to Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125558
Approved by: https://github.com/ezyang
2024-05-08 20:29:29 +00:00
a0e2f62edd Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809)"
This reverts commit 9e24c263f998819f849bb8293323213101e9aefc.

Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2091751002))
2024-05-02 21:36:18 +00:00
9e24c263f9 Include support for the scatter gather cuda kernels to allow for comp… (#124809)
Fixes #121965

This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.

C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.

Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.

Environment:
3080 & WSL 2. `nvcc` is at 12.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/mikaylagawarecki
2024-05-01 23:58:35 +00:00
4d410155b2 Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809)"
This reverts commit e09f98c705e4851414cd8ddf21949177af2b13aa.

Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/clee2000 due to windows build failure is real, https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612#step:11:11236 is the correct failure line, ignore the statement saying build passed, batch is errorcodes arent propagating again ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2088680371))
2024-05-01 16:02:02 +00:00
e09f98c705 Include support for the scatter gather cuda kernels to allow for comp… (#124809)
Fixes #121965

This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.

C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.

Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.

Environment:
3080 & WSL 2. `nvcc` is at 12.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
2024-05-01 14:31:31 +00:00
91a4740e72 Disable the CUDA fast path for split_with_sizes_copy when capturing (#125052)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125052
Approved by: https://github.com/awgu, https://github.com/eellison, https://github.com/eqy
2024-04-27 07:59:39 +00:00
a8574a9719 Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-26 15:35:53 +00:00
1ac60484c1 Revert "Fix global flake8 issues (#124771)"
This reverts commit f01275934bfa1ff358b1c01d3754f2807cd04ee2.

Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))
2024-04-26 06:15:17 +00:00
f01275934b Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-25 14:25:00 +00:00
edcd968b51 Add out wrappers to some decompositions (#115437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115437
Approved by: https://github.com/lezcano
2024-04-23 06:26:11 +00:00
29cc293725 [BE]: FURB142 - Remove set mutations. Use set update (#124551)
Uses set mutation methods instead of manually reimplementing (update, set_difference etc).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551
Approved by: https://github.com/ezyang
2024-04-21 14:12:33 +00:00
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
b412b75b42 [optim] add fused_adam/adamw_kernel support for CPU device (#123074)
On par with `CUDA` implementation.

For `autocast` logic, same with `CUDA` + `Fused Adam`:
 - check inf in `gradscalar.step`
 - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param.

**TestPlan**:
```
# extend CUDA only test for CPU fused adagrad
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_torch.py -k test_grad_scaling_autocast_fused

# extend fused test
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
python test_optim.py -k test_can_load_older_state_dict

# newly added test (follow 6b1f13ea2f/test/test_cuda.py (L1108))
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
```

**Benchmark**:
**5.1x** on 56 core SPR
**Parameter-size=1M**
**Nparams=10**
[test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7)

```
numactl -C 0-55 -m 0 python bench_adam.py
non-fused 6.0174267292022705 s
fused 1.1787631511688232 s
```

**Note: Fused kernel accuracy**
The accuracy failure in CI shows a little higher than default tolerance
```
2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%)
2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed)
2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed)
```
I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations.
For example, in non-fused impl
```
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
and in fused impl
```
  exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d];
  //  std::cout << "exp_avg_sq " <<   exp_avg_sq_ptr[d] << std::endl;
  exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] +
      scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val;
```
If I keep `std::cout`, I can get exactly same results in UT
```
===============param
0.6796758770942688
0.6796758770942688
```
But when I comment out it, there will be a difference
```
===============param
0.6796758770942688
0.6796759366989136
```
So I will make the tolerance a little higher than default one.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-19 11:14:04 +00:00
6502c888cf Enable fx graph cache in torch_test.py when using PYTORCH_TEST_WITH_INDUCTOR=1 (#122010)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122010
Approved by: https://github.com/eellison
2024-03-19 02:17:10 +00:00
13a54ce279 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-03-01 05:05:28 +00:00
86ff31c4a0 Revert "Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)"
This reverts commit cabc09a5f259f1cc1e3bad1d80b5e5274838bced.

Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100))
2024-02-28 22:30:18 +00:00
cabc09a5f2 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-02-28 00:37:33 +00:00
bd9db6a9c7 Update to TorchFix 0.4.0 (#119424)
`torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424
Approved by: https://github.com/zou3519
2024-02-12 23:30:12 +00:00
02c24b0b5e Add Python binding resizable to class {Untyped,Typed}Storage (#119286)
This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users.

Fixes #119233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286
Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki
2024-02-07 19:15:55 +00:00
113138aa55 add test cases for GradScaler on CPU (#109994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-02-02 21:49:07 +00:00
0f7e63620f CUDA fast path for split_with_sizes_copy.out (#117203)
### Motivation
In per-parameter sharding FSDP, each rank holds one shard of every parameter. Before a bucket of parameters is used, FSDP performs all-gather to reconstruct the full parameters. The following example demonstrates the process for `world_size=2`, `num_params=3` (`A`, `B`, `C` standands for values in param `A`, `B`, `C`):

All-gather output:
```
AAAABBBCCAAAABBBCC
```

After all-gather-copy-out:
```
AAAAAAAA  BBBBBB  CCCC
```

The performance of all-gather-copy-out is crucial for the viability of per-parameter sharding FSDP. After thorough experiments, we believe that acceptable performance for this op is not achievable via composing existing ATen ops today.

We have proven that ideal performance is achievable with a [custom kernel](https://github.com/pytorch/pytorch/pull/115515). This PR aims to incorporate the optimizations to appropriate ATen ops (as suggested by @albanD).

### all-gather-copy-out via Composing ATen Ops

Carrying out the op out via composing ATen ops involves a combination of view ops and copy ops. After thorough experiments, we found that the most natural/performant way to express the op is via `split_with_sizes` + `_foreach_copy_`, which works as follows:

Reshape all-gather output as (world_size, -1):
```
AAAABBBCC
AAAABBBCC
```

`split_with_sizes` + `_foreach_copy_`:
```
AAAA BBB CC
AAAA BBB CC
```

However, the performance of this approach is still far below that of the custom kernel. We've identified the following reasons:
- The approach requires materializing `O(num_params)` intermediate views, which induces large amount of CPU overhead when `num_params` is high.
- `_foreach_copy_` uses the same block size all tensors, leading to waste for small tensors and insufficient thread count for large tensors. This means low effective occupancy.
- `_foreach_copy_` dispatches multiple kernels for typical problem sizes for all-gather-copy-out. This further lowers the effective occupancy.
- Due to the nature of the workload, the underlying copies are unaligned. `_foreach_copy_` isn't aggressive enough in exploiting vectorization oppurtunities in such workloads.

### PR
Introduces a CUDA backend for `split_with_sizes_copy.out` that addresses the above inefficiencies. See code for details.

### Benchmarks
The benchmarks are conducted on a set of representative problems sizes on an A100. CPU overhead and GPU execution time is measured separately, as reasonable CPU overhead doesn't directly affect e2e throughput. The reported copy bandwidth is calculated with GPU execution time.

Compared to the baseline, we observe 3x-10x higher throughput compared to the baseline depending on the problem size. We also observe lower CPU overhead across the board compared to the baseline.

Baseline:
```
num_params=150   world_size=8     mixed=True    Param size: 0.059 GB    Copy bandwidth: 67.564 GB/s (gpu ms/iter: 0.869, cpu ms/iter 10.460)
num_params=54    world_size=8     mixed=True    Param size: 1.453 GB    Copy bandwidth: 260.373 GB/s (gpu ms/iter: 5.582, cpu ms/iter 0.572)
num_params=54    world_size=8     mixed=True    Param size: 0.512 GB    Copy bandwidth: 239.585 GB/s (gpu ms/iter: 2.135, cpu ms/iter 0.587)
num_params=50    world_size=8     mixed=True    Param size: 0.200 GB    Copy bandwidth: 205.361 GB/s (gpu ms/iter: 0.976, cpu ms/iter 0.534)
num_params=3     world_size=8     mixed=True    Param size: 0.983 GB    Copy bandwidth: 268.397 GB/s (gpu ms/iter: 3.663, cpu ms/iter 0.084)
num_params=9     world_size=8     mixed=True    Param size: 0.802 GB    Copy bandwidth: 265.240 GB/s (gpu ms/iter: 3.024, cpu ms/iter 0.154)
num_params=3     world_size=8     mixed=True    Param size: 1.573 GB    Copy bandwidth: 268.918 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.087)
num_params=9     world_size=8     mixed=True    Param size: 2.248 GB    Copy bandwidth: 268.141 GB/s (gpu ms/iter: 8.384, cpu ms/iter 0.151)
num_params=150   world_size=128   mixed=True    Param size: 0.064 GB    Copy bandwidth: 73.237 GB/s (gpu ms/iter: 0.874, cpu ms/iter 10.664)
num_params=54    world_size=128   mixed=True    Param size: 1.458 GB    Copy bandwidth: 259.902 GB/s (gpu ms/iter: 5.609, cpu ms/iter 0.584)
num_params=54    world_size=128   mixed=True    Param size: 0.515 GB    Copy bandwidth: 238.703 GB/s (gpu ms/iter: 2.158, cpu ms/iter 0.612)
num_params=50    world_size=128   mixed=True    Param size: 0.203 GB    Copy bandwidth: 205.144 GB/s (gpu ms/iter: 0.987, cpu ms/iter 0.559)
num_params=3     world_size=128   mixed=True    Param size: 0.983 GB    Copy bandwidth: 270.467 GB/s (gpu ms/iter: 3.635, cpu ms/iter 0.073)
num_params=9     world_size=128   mixed=True    Param size: 0.802 GB    Copy bandwidth: 267.700 GB/s (gpu ms/iter: 2.997, cpu ms/iter 0.133)
num_params=3     world_size=128   mixed=True    Param size: 1.573 GB    Copy bandwidth: 268.913 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.093)
num_params=9     world_size=128   mixed=True    Param size: 2.248 GB    Copy bandwidth: 266.589 GB/s (gpu ms/iter: 8.433, cpu ms/iter 0.207)
num_params=150   world_size=1024  mixed=True    Param size: 0.202 GB    Copy bandwidth: 135.107 GB/s (gpu ms/iter: 1.495, cpu ms/iter 10.904)
num_params=54    world_size=1024  mixed=True    Param size: 1.524 GB    Copy bandwidth: 258.675 GB/s (gpu ms/iter: 5.890, cpu ms/iter 0.996)
num_params=54    world_size=1024  mixed=True    Param size: 0.575 GB    Copy bandwidth: 238.919 GB/s (gpu ms/iter: 2.408, cpu ms/iter 0.765)
num_params=50    world_size=1024  mixed=True    Param size: 0.246 GB    Copy bandwidth: 209.836 GB/s (gpu ms/iter: 1.172, cpu ms/iter 0.611)
num_params=3     world_size=1024  mixed=True    Param size: 1.007 GB    Copy bandwidth: 270.607 GB/s (gpu ms/iter: 3.720, cpu ms/iter 0.100)
num_params=9     world_size=1024  mixed=True    Param size: 0.818 GB    Copy bandwidth: 266.375 GB/s (gpu ms/iter: 3.071, cpu ms/iter 0.176)
num_params=3     world_size=1024  mixed=True    Param size: 1.611 GB    Copy bandwidth: 270.601 GB/s (gpu ms/iter: 5.952, cpu ms/iter 0.099)
num_params=9     world_size=1024  mixed=True    Param size: 2.248 GB    Copy bandwidth: 268.558 GB/s (gpu ms/iter: 8.371, cpu ms/iter 0.207)
num_params=150   world_size=8     mixed=False   Param size: 0.035 GB    Copy bandwidth: 43.749 GB/s (gpu ms/iter: 0.797, cpu ms/iter 10.531)
num_params=54    world_size=8     mixed=False   Param size: 0.961 GB    Copy bandwidth: 254.084 GB/s (gpu ms/iter: 3.781, cpu ms/iter 0.752)
num_params=54    world_size=8     mixed=False   Param size: 0.282 GB    Copy bandwidth: 216.792 GB/s (gpu ms/iter: 1.299, cpu ms/iter 0.717)
num_params=50    world_size=8     mixed=False   Param size: 0.149 GB    Copy bandwidth: 188.025 GB/s (gpu ms/iter: 0.793, cpu ms/iter 0.633)
num_params=3     world_size=8     mixed=False   Param size: 0.655 GB    Copy bandwidth: 267.793 GB/s (gpu ms/iter: 2.447, cpu ms/iter 0.107)
num_params=9     world_size=8     mixed=False   Param size: 0.634 GB    Copy bandwidth: 264.232 GB/s (gpu ms/iter: 2.401, cpu ms/iter 0.182)
num_params=3     world_size=8     mixed=False   Param size: 1.049 GB    Copy bandwidth: 268.455 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.089)
num_params=9     world_size=8     mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.633 GB/s (gpu ms/iter: 6.394, cpu ms/iter 0.177)
num_params=150   world_size=128   mixed=False   Param size: 0.038 GB    Copy bandwidth: 46.698 GB/s (gpu ms/iter: 0.807, cpu ms/iter 10.488)
num_params=54    world_size=128   mixed=False   Param size: 0.963 GB    Copy bandwidth: 253.450 GB/s (gpu ms/iter: 3.799, cpu ms/iter 0.655)
num_params=54    world_size=128   mixed=False   Param size: 0.283 GB    Copy bandwidth: 216.857 GB/s (gpu ms/iter: 1.307, cpu ms/iter 0.671)
num_params=50    world_size=128   mixed=False   Param size: 0.151 GB    Copy bandwidth: 189.059 GB/s (gpu ms/iter: 0.799, cpu ms/iter 0.572)
num_params=3     world_size=128   mixed=False   Param size: 0.655 GB    Copy bandwidth: 269.849 GB/s (gpu ms/iter: 2.429, cpu ms/iter 0.078)
num_params=9     world_size=128   mixed=False   Param size: 0.634 GB    Copy bandwidth: 264.501 GB/s (gpu ms/iter: 2.399, cpu ms/iter 0.149)
num_params=3     world_size=128   mixed=False   Param size: 1.049 GB    Copy bandwidth: 268.426 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.086)
num_params=9     world_size=128   mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.495 GB/s (gpu ms/iter: 6.398, cpu ms/iter 0.170)
num_params=150   world_size=1024  mixed=False   Param size: 0.122 GB    Copy bandwidth: 101.151 GB/s (gpu ms/iter: 1.211, cpu ms/iter 10.476)
num_params=54    world_size=1024  mixed=False   Param size: 1.000 GB    Copy bandwidth: 252.323 GB/s (gpu ms/iter: 3.963, cpu ms/iter 0.633)
num_params=54    world_size=1024  mixed=False   Param size: 0.318 GB    Copy bandwidth: 218.322 GB/s (gpu ms/iter: 1.455, cpu ms/iter 0.622)
num_params=50    world_size=1024  mixed=False   Param size: 0.185 GB    Copy bandwidth: 196.369 GB/s (gpu ms/iter: 0.944, cpu ms/iter 0.576)
num_params=3     world_size=1024  mixed=False   Param size: 0.671 GB    Copy bandwidth: 269.369 GB/s (gpu ms/iter: 2.491, cpu ms/iter 0.076)
num_params=9     world_size=1024  mixed=False   Param size: 0.645 GB    Copy bandwidth: 264.441 GB/s (gpu ms/iter: 2.439, cpu ms/iter 0.140)
num_params=3     world_size=1024  mixed=False   Param size: 1.074 GB    Copy bandwidth: 269.955 GB/s (gpu ms/iter: 3.978, cpu ms/iter 0.073)
num_params=9     world_size=1024  mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.168 GB/s (gpu ms/iter: 6.405, cpu ms/iter 0.147)
```
New kernel:
```
num_params=150   world_size=8     mixed=True    Param size: 0.059 GB    Copy bandwidth: 560.946 GB/s (gpu ms/iter: 0.105, cpu ms/iter 1.066)
num_params=54    world_size=8     mixed=True    Param size: 1.453 GB    Copy bandwidth: 732.657 GB/s (gpu ms/iter: 1.984, cpu ms/iter 0.417)
num_params=54    world_size=8     mixed=True    Param size: 0.512 GB    Copy bandwidth: 753.514 GB/s (gpu ms/iter: 0.679, cpu ms/iter 0.419)
num_params=50    world_size=8     mixed=True    Param size: 0.200 GB    Copy bandwidth: 719.400 GB/s (gpu ms/iter: 0.279, cpu ms/iter 0.410)
num_params=3     world_size=8     mixed=True    Param size: 0.983 GB    Copy bandwidth: 782.121 GB/s (gpu ms/iter: 1.257, cpu ms/iter 0.098)
num_params=9     world_size=8     mixed=True    Param size: 0.802 GB    Copy bandwidth: 766.458 GB/s (gpu ms/iter: 1.047, cpu ms/iter 0.134)
num_params=3     world_size=8     mixed=True    Param size: 1.573 GB    Copy bandwidth: 790.611 GB/s (gpu ms/iter: 1.989, cpu ms/iter 0.099)
num_params=9     world_size=8     mixed=True    Param size: 2.248 GB    Copy bandwidth: 789.754 GB/s (gpu ms/iter: 2.847, cpu ms/iter 0.138)
num_params=150   world_size=128   mixed=True    Param size: 0.064 GB    Copy bandwidth: 565.667 GB/s (gpu ms/iter: 0.113, cpu ms/iter 0.996)
num_params=54    world_size=128   mixed=True    Param size: 1.458 GB    Copy bandwidth: 670.681 GB/s (gpu ms/iter: 2.174, cpu ms/iter 0.289)
num_params=54    world_size=128   mixed=True    Param size: 0.515 GB    Copy bandwidth: 676.135 GB/s (gpu ms/iter: 0.762, cpu ms/iter 0.264)
num_params=50    world_size=128   mixed=True    Param size: 0.203 GB    Copy bandwidth: 662.603 GB/s (gpu ms/iter: 0.306, cpu ms/iter 0.249)
num_params=3     world_size=128   mixed=True    Param size: 0.983 GB    Copy bandwidth: 769.283 GB/s (gpu ms/iter: 1.278, cpu ms/iter 0.078)
num_params=9     world_size=128   mixed=True    Param size: 0.802 GB    Copy bandwidth: 761.057 GB/s (gpu ms/iter: 1.054, cpu ms/iter 0.104)
num_params=3     world_size=128   mixed=True    Param size: 1.573 GB    Copy bandwidth: 774.325 GB/s (gpu ms/iter: 2.031, cpu ms/iter 0.075)
num_params=9     world_size=128   mixed=True    Param size: 2.248 GB    Copy bandwidth: 773.048 GB/s (gpu ms/iter: 2.908, cpu ms/iter 0.099)
num_params=150   world_size=1024  mixed=True    Param size: 0.202 GB    Copy bandwidth: 641.405 GB/s (gpu ms/iter: 0.315, cpu ms/iter 0.616)
num_params=54    world_size=1024  mixed=True    Param size: 1.524 GB    Copy bandwidth: 646.772 GB/s (gpu ms/iter: 2.356, cpu ms/iter 0.276)
num_params=54    world_size=1024  mixed=True    Param size: 0.575 GB    Copy bandwidth: 658.157 GB/s (gpu ms/iter: 0.874, cpu ms/iter 0.278)
num_params=50    world_size=1024  mixed=True    Param size: 0.246 GB    Copy bandwidth: 642.032 GB/s (gpu ms/iter: 0.383, cpu ms/iter 0.245)
num_params=3     world_size=1024  mixed=True    Param size: 1.007 GB    Copy bandwidth: 728.990 GB/s (gpu ms/iter: 1.381, cpu ms/iter 0.080)
num_params=9     world_size=1024  mixed=True    Param size: 0.818 GB    Copy bandwidth: 689.763 GB/s (gpu ms/iter: 1.186, cpu ms/iter 0.102)
num_params=3     world_size=1024  mixed=True    Param size: 1.611 GB    Copy bandwidth: 765.507 GB/s (gpu ms/iter: 2.104, cpu ms/iter 0.078)
num_params=9     world_size=1024  mixed=True    Param size: 2.248 GB    Copy bandwidth: 757.626 GB/s (gpu ms/iter: 2.967, cpu ms/iter 0.106)
num_params=150   world_size=8     mixed=False   Param size: 0.035 GB    Copy bandwidth: 584.272 GB/s (gpu ms/iter: 0.060, cpu ms/iter 0.656)
num_params=54    world_size=8     mixed=False   Param size: 0.961 GB    Copy bandwidth: 728.234 GB/s (gpu ms/iter: 1.319, cpu ms/iter 0.264)
num_params=54    world_size=8     mixed=False   Param size: 0.282 GB    Copy bandwidth: 730.059 GB/s (gpu ms/iter: 0.386, cpu ms/iter 0.279)
num_params=50    world_size=8     mixed=False   Param size: 0.149 GB    Copy bandwidth: 670.899 GB/s (gpu ms/iter: 0.222, cpu ms/iter 0.274)
num_params=3     world_size=8     mixed=False   Param size: 0.655 GB    Copy bandwidth: 775.699 GB/s (gpu ms/iter: 0.845, cpu ms/iter 0.077)
num_params=9     world_size=8     mixed=False   Param size: 0.634 GB    Copy bandwidth: 773.612 GB/s (gpu ms/iter: 0.820, cpu ms/iter 0.112)
num_params=3     world_size=8     mixed=False   Param size: 1.049 GB    Copy bandwidth: 781.395 GB/s (gpu ms/iter: 1.342, cpu ms/iter 0.081)
num_params=9     world_size=8     mixed=False   Param size: 1.711 GB    Copy bandwidth: 789.156 GB/s (gpu ms/iter: 2.169, cpu ms/iter 0.116)
num_params=150   world_size=128   mixed=False   Param size: 0.038 GB    Copy bandwidth: 517.056 GB/s (gpu ms/iter: 0.073, cpu ms/iter 0.632)
num_params=54    world_size=128   mixed=False   Param size: 0.963 GB    Copy bandwidth: 684.246 GB/s (gpu ms/iter: 1.407, cpu ms/iter 0.294)
num_params=54    world_size=128   mixed=False   Param size: 0.283 GB    Copy bandwidth: 680.593 GB/s (gpu ms/iter: 0.416, cpu ms/iter 0.286)
num_params=50    world_size=128   mixed=False   Param size: 0.151 GB    Copy bandwidth: 682.197 GB/s (gpu ms/iter: 0.221, cpu ms/iter 0.255)
num_params=3     world_size=128   mixed=False   Param size: 0.655 GB    Copy bandwidth: 759.470 GB/s (gpu ms/iter: 0.863, cpu ms/iter 0.074)
num_params=9     world_size=128   mixed=False   Param size: 0.634 GB    Copy bandwidth: 765.694 GB/s (gpu ms/iter: 0.829, cpu ms/iter 0.094)
num_params=3     world_size=128   mixed=False   Param size: 1.049 GB    Copy bandwidth: 766.535 GB/s (gpu ms/iter: 1.368, cpu ms/iter 0.075)
num_params=9     world_size=128   mixed=False   Param size: 1.711 GB    Copy bandwidth: 787.608 GB/s (gpu ms/iter: 2.173, cpu ms/iter 0.105)
num_params=150   world_size=1024  mixed=False   Param size: 0.122 GB    Copy bandwidth: 640.203 GB/s (gpu ms/iter: 0.191, cpu ms/iter 0.668)
num_params=54    world_size=1024  mixed=False   Param size: 1.000 GB    Copy bandwidth: 713.947 GB/s (gpu ms/iter: 1.401, cpu ms/iter 0.274)
num_params=54    world_size=1024  mixed=False   Param size: 0.318 GB    Copy bandwidth: 642.855 GB/s (gpu ms/iter: 0.494, cpu ms/iter 0.276)
num_params=50    world_size=1024  mixed=False   Param size: 0.185 GB    Copy bandwidth: 643.297 GB/s (gpu ms/iter: 0.288, cpu ms/iter 0.262)
num_params=3     world_size=1024  mixed=False   Param size: 0.671 GB    Copy bandwidth: 690.626 GB/s (gpu ms/iter: 0.972, cpu ms/iter 0.078)
num_params=9     world_size=1024  mixed=False   Param size: 0.645 GB    Copy bandwidth: 754.431 GB/s (gpu ms/iter: 0.855, cpu ms/iter 0.109)
num_params=3     world_size=1024  mixed=False   Param size: 1.074 GB    Copy bandwidth: 769.985 GB/s (gpu ms/iter: 1.395, cpu ms/iter 0.080)
num_params=9     world_size=1024  mixed=False   Param size: 1.711 GB    Copy bandwidth: 766.337 GB/s (gpu ms/iter: 2.233, cpu ms/iter 0.103)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117203
Approved by: https://github.com/albanD, https://github.com/awgu
ghstack dependencies: #118512
2024-02-01 18:23:01 +00:00
bacbad5bc9 add GradScaler on CPU (#109993)
Step 2 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-29 23:42:35 +00:00
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
41a56f7828 Fix swap_tensors to swap PyObjects associated with TensorImpl (#116955)
This PR intends to fix the following issue when swapping two tensors

```python
>>> import torch
>>> torch.manual_seed(5)
>>> t1 = torch.randn(2)
>>> t2 = torch.randn(3)
>>> t1
tensor([-0.4868, -0.6038])
>>> t2
tensor([-0.5581,  0.6675, -0.1974])
>>> torch.utils.swap_tensors(t1, t2)
>>> t1
tensor([-0.5581,  0.6675, -0.1974])
>>> t2
tensor([-0.4868, -0.6038])
>>> t1.fill_(0.5) # t1 back to its unswapped state :o
tensor([-0.4868, -0.6038])
```

What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned.

57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)

When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead.

The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955
Approved by: https://github.com/albanD
2024-01-24 01:40:18 +00:00
cd084c4909 Add TensorIteratorConfig::add_const_input to avoid COW materialize (#118053)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118053
Approved by: https://github.com/ezyang
2024-01-23 22:32:39 +00:00
3b38f7b266 Remove skips for passing tests (#118000)
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 16:11:38 +00:00
bb28965924 Revert "Remove skips for passing tests (#118000)"
This reverts commit 3c339b5b21fdbd530f82765f84bcabde8266d3e0.

Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752))
2024-01-23 06:10:25 +00:00
3c339b5b21 Remove skips for passing tests (#118000)
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 03:41:23 +00:00
0ae952db76 enable mkldnn bf32 matmul (#116015)
### Testing
FP32 matmul vs. mkldnn BF32 matmul  on SPR

single core:

Input | BF32   / ms | FP32  /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 32.842 | 38.279 | 1.165
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 38.590 | 73.967 | 1.917
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 18456.267 | 74588.002 | 4.041

56 cores:
Input | BF32   / ms | FP32 /   ms | Speed up
-- | -- | -- | --
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 1199.400 | 1715.548 | 1.430
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True |1129.204 | 1708.912 |  1.513
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 3655.915  | 7992.877 | 2.186
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 3707.993 |  8026.191 | 2.165
Batch: 768, M: 128, N: 64, K: 128  | 1296.419 | 1308.411 | 1.009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-20 09:30:23 +00:00
29516bd2a0 add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281)
Step1 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-16 15:25:08 +00:00
2200118f59 Enable some uint{16,32,64} tests that are working (#116809)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116809
Approved by: https://github.com/albanD
2024-01-15 02:25:21 +00:00
edec54b9de Add torch._lazy_clone to create COW tensors (#113397)
Part of #109833

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
* __->__ #113397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
2024-01-11 01:32:44 +00:00
8bcdde5058 Support uint{16,32,64} deterministic empty fill and scalar Python binding handling (#116807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116807
Approved by: https://github.com/albanD
ghstack dependencies: #116805, #116806
2024-01-10 02:17:23 +00:00
43a23a704a Support uint{16,32,64} copy (#116806)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116806
Approved by: https://github.com/albanD
ghstack dependencies: #116805
2024-01-10 02:17:23 +00:00
2e983fcfd3 Support unsigned int for randint, item, equality, fill, iinfo, tensor (#116805)
These are some basic utilities that are often used for testing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116805
Approved by: https://github.com/albanD
2024-01-10 02:17:23 +00:00
3fe437b24b [BE]: Update flake8 to v6.1.0 and fix lints (#116591)
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
  - `assert(a == b)` -> `assert a == b`
  - `if(x > y or y < z):`->`if x > y or y < z:`
  - And `return('...')` -> `return '...'`

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-03 06:04:44 +00:00
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
f657b2b1f8 [Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
  - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-27 18:47:05 +00:00
3b709d7c1e Revert "[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)"
This reverts commit 015bd0e0a189f929e469c6bc75fe1541c18a014d.

Reverted https://github.com/pytorch/pytorch/pull/116312 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/116312#issuecomment-1869825506))
2023-12-26 23:47:15 +00:00
015bd0e0a1 [Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
  - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-23 09:44:09 +00:00
f206e31e2f Swap slots if slots match in swap_tensor (#116128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116128
Approved by: https://github.com/albanD
2023-12-21 00:43:30 +00:00
8a8d0adc0b Fix troch.gradient check for spacing arg list length (#115686)
Fixes #114207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115686
Approved by: https://github.com/albanD
2023-12-13 20:17:20 +00:00
d521857411 Terminate handler (#101332)
Fixes #50051.
This PR is based on #50320 and I address the last feedback.
On Windows it is enabled by default. Can be enabled or disabled via USE_CUSTOM_TERMINATE env variable.

This PR adds support for overriding the terminate handler in order to log uncaught exceptions in the threads.
If an exception is thrown and not caught, it will print <Unhandled exception caught in c10/util/AbortHandler.h>
The point of doing this is that in issue #50051, exceptions were thrown but not logged. With this logging system it will be easier to debug it in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101332
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-12 17:55:27 +00:00
65651d970b Optimize the copy of Half to Float and Float to Half on CPU (#103148)
### Description
Optimize the copy of Half to Float and Float to Half on CPU.

### Testing

Single core:
shape | fp16 -> fp32 / ms | fp32 -> fp16 / ms | bf16 -> fp32 / ms | fp32 -> bf16 / ms
-- | -- | -- | -- | --
size: (1, 777) | 0.00345 | 0.00344 | 0.00411 | 0.00410
size: (2, 512) | 0.00355 | 0.00344 | 0.00431 | 0.00400
size: (10, 555) | 0.00473 | 0.00391 | 0.00562 | 0.00477
size: (1, 2048, 1024) | 0.488 | 0.480 | 0.498 | 0.499
size: (32, 100, 777) | 0.584 | 0.568 | 0.571 | 0.587

28 cores:
shape | fp16 -> fp32 / ms | fp32 -> fp16 / ms | bf16 -> fp32 / ms | fp32 -> bf16 / ms
-- | -- | -- | -- | --
size: (10, 555) |  0.00472 | 0.00369 | 0.00576 |  0.00481
size: (1, 2048, 1024) |  0.0189 | 0.0188 | 0.0173 | 0.0251
size: (64, 512, 1024) | 3.159 | 2.375 |  3.152 | 2.358
size: (32, 100, 777) | 0.0225 | 0.0195 | 0.0193 | 0.0261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103148
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-12-12 05:57:52 +00:00
3361496f96 Fix the corner case of index_add (#114929)
Fixes #114864

As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114929
Approved by: https://github.com/mikaylagawarecki
2023-12-09 01:57:25 +00:00
a2b89154bf New swap function (#111747)
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.

This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.

The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.

Draft right now to see what @colesbury thinks about doing this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
2023-12-08 18:49:35 +00:00
6f32eb7eef Add decomp for replication_pad2d and use for CUDA deterministic (#111590)
Fixes #95578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-12-01 18:56:09 +00:00
013675ff59 Revert "Add decomp for replication_pad2d and use for CUDA deterministic (#111590)"
This reverts commit f1286161a637e9fc0797a22a7b7d90eaa04ddc4f.

Reverted https://github.com/pytorch/pytorch/pull/111590 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing XLA job.  The job is also failing on the PR, but the log classifier failed to find the failed test which lead to it being marked wrongly as flaky ([comment](https://github.com/pytorch/pytorch/pull/111590#issuecomment-1833004794))
2023-11-30 02:28:14 +00:00
f1286161a6 Add decomp for replication_pad2d and use for CUDA deterministic (#111590)
Fixes #95578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-11-29 21:50:46 +00:00
fe428a284b Revert "Add torch._lazy_clone to create COW tensors (#113397)"
This reverts commit 9916d8a9eaaf2c05c131f2a2dbe9eabeeaa9dffc.

Reverted https://github.com/pytorch/pytorch/pull/113397 on behalf of https://github.com/DanilBaibak due to Unfortunately, I need to revert your PR because the lower [PR in the stack](https://github.com/pytorch/pytorch/pull/113396) is failing a bunch of internal build jobs. ([comment](https://github.com/pytorch/pytorch/pull/113397#issuecomment-1818761224))
2023-11-20 10:21:09 +00:00