Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.
However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.
Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).
other changes:
need to add torch.compiler.cudagraph_mark_step_begin() to avoid the
slowdown from # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards
also updated the torchao APIs to the current versions
X-link: https://github.com/pytorch/benchmark/pull/2394
Test Plan:
python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
(should all be ~1.0
0.997x
1.006x
0.994x
Reviewed By: xuzhao9
Differential Revision: D60252821
Pulled By: HDCharles
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935
Approved by: https://github.com/xuzhao9
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.
However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.
Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).
Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16
```
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
```
Differential Revision: D59332736
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136
Approved by: https://github.com/jerryzh168
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.
## sebotnet33ts_256
The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).
I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.
## DebertaForQuestionAnswering
This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```
From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```
0.02 tolerance should suppress this error.
## gluon_inception_v3
This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```
From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.
# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.
# yolov3
Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
Fix it by using a larger multiplier for smaller tensors and raising the tolereance.
# timm_efficientdet
Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training
```
Raise the tolerance should fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.
## sebotnet33ts_256
The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).
I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.
## DebertaForQuestionAnswering
This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```
From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```
0.02 tolerance should suppress this error.
## gluon_inception_v3
This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```
From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.
# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.
# yolov3
Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
Fix it by using a larger multiplier for smaller tensors and raising the tolereance.
# timm_efficientdet
Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training
```
Raise the tolerance should fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
**Performance mode Issue**: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files.

- **Fix**: The warm-up failed models will be recorded into csv file shown as following:

**Accuracy mode issue**: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR ee557d8f61.
```
Dynamic Shape:
Traceback (most recent call last):
File "benchmarks/dynamo/torchbench.py", line 449, in <module>
torchbench_main()
File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main
main(TorchBenchmarkRunner(), original_dir)
File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main
process_entry(0, runner, original_dir, args)
File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry
return run(runner, args, original_dir)
File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run
assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```

- **Fix**: same as PR ee557d8f61, the batch_size will be skipped to set as 4 when testing dynamic shapes.
Dynamic shapes passrate improved from 89% -> **95%**
| Comp Item | Compiler | suite | before | After fix |
|-----------|----------|------------|------------|------------|
| Pass Rate | Inductor | torchbench | 89%, 73/82 | 95%, 79/83 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764
Approved by: https://github.com/jansel
As reported in https://github.com/pytorch/pytorch/issues/119434, `hf_T5_generate` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.
* Error msg is
```
File "/home/jiayisun/pytorch/torch/_dynamo/guards.py", line 705, in SHAPE_ENV
guards = output_graph.shape_env.produce_guards(
File "/home/jiayisun/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3253, in produce_guards
raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs_tensor'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
- Not all values of RelaxedUnspecConstraint(L['inputs_tensor'].size()[0]) are valid because L['inputs_tensor'].size()[0] was inferred to be a constant (4).
```
* Root Cause is
This error happens while creating guard for this [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L561): `scores += position_bias_masked`
I run it with TORCH_LOGS="+dynamic" and got the key line : `I0305 00:21:00.849974 140376923287424 torch/fx/experimental/symbolic_shapes.py:3963] [6/0_1] eval Eq(s0, 4) [guard added] at miniconda3/envs/pt2/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:561 in forward (_refs/__init__.py:403 in _broadcast_shapes)`
The reason for this error is that the batch dimension of `inputs_tensor` in the dynamic batch size test is marked as dynamic shape `s0`, so the batch dimension of `scores` generated by a series of operations with `inputs_tensor` is also `s0`. However, because the function of creating `attention_mask` is not in Dynamo but in python. The batch dimension of `attention_mask` is the real shape `4`, and the batch dimension of `position_bias_masked` generated by a series of operations with `attention_mask` is also the real shape `4`, not the dynamic shape `s0`. The current line of `scores += position_bias_masked` requires creating a guard and check whether the batch dimension of `scores` is always equal to the batch dimension of `position_bias_masked`, Eq(s0, 4), the error happens.
So the root cause of this error is that the function of creating `attention_mask` not in Dynamo but in python. The reason why the function of `attention_mask` not in Dynamo is that Dynamo has a graph break on this function (happened in the [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L476): `is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)`) due to the following error:
`torch._dynamo.exc.Unsupported: Tensor.item`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121129
Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang
Fixes
> ERROR: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE
Error that would occur when composing pt2 fsdp and cudagraphs. Cudagraphs caches output tensor impls in the fast path, so we were inadvertently accumulating multiple hooks on what should have been fresh allocations.
from code comment:
```
# this output represents a fresh allocated tensor.
# We return the same TensorImpl from run to run to avoid overhead.
# autograd.Function will reset the Autograd meta of output tensors
# as part of aot_autograd, but _backward_hooks are stored on tensors separately,
# so we need to manually reset hooks.
``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126914
Approved by: https://github.com/awgu, https://github.com/xmfan
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
# Motivation
## for `torch.amp.GradScaler`,
- `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`.
- `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`.
So, we intend to depreate them and **strongly recommend** developer to use `torch.amp.GradScaler`.
## for `custom_fwd` and `custom_bwd`,
this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU.
So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`.
# Additional Context
Add UT to cover the deprecated warning.
No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them.
To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang
Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125953
Approved by: https://github.com/desertfire
ghstack dependencies: #125917
Summary: This change introduces a new flagg to perform a "warm start" test from the benchmark harness. The idea is to test a model twice: first with a fresh inductor cache (i.e., a "cold start"), and then a second run in a fresh process with the cache available (i.e. a "warm start"). We can later add this mode to CI runs to collect compile times for warm start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125353
Approved by: https://github.com/eellison, https://github.com/desertfire
# Motivation
As discussed in [#124479](https://github.com/pytorch/pytorch/pull/124479), `torch.amp.autocast` can NOT be completely equivalent to `torch.cuda.amp.autocast` and `torch.cpu.amp.autocast` since `torch.amp.autocast` has NOT the default `dtype` for CPU (`torch.bfloat16` by default) and CUDA (`torch.float16` by default) respectively. We would like `torch.amp.autocast` to be more generic to help the developer/customer write the device-agnostic code. Because there are not enough reasons to add device-specific autocast `torch.xxx.amp.autocast` for each device backend.
# Solution
When `None` is passed to `dtype`, we should use `torch.get_autocast_dtype` to get the related dtype for each backend. Meanwhile, `torch.get_autocast_dtype` is necessary to be supported in JIT path for BC.
# Additional Context
With this PR, `torch.amp.autocast(device_type='cuda')` is equivalent to `torch.cuda.amp.autocast`.
Add two new UTs to cover this change in eager and jit path respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125103
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui
A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one.
This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are.
I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this:
```
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False
```
Potential later follow ups:
* Improve the warning messages (in particular, should provide user frames)
* GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115
Approved by: https://github.com/IvanKobzarev
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```
There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.
Inference:
<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">
Training:
<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229, #122825