Commit Graph

500 Commits

Author SHA1 Message Date
e1ed5ad5a5 Add a timeout to benchmark script (#90634)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90634
Approved by: https://github.com/voznesenskym
2022-12-11 23:12:29 +00:00
181d37475d Simple fix: add missing positional arg in init_optimizer() call (#90641)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90641
Approved by: https://github.com/kit1980
2022-12-11 13:18:05 +00:00
fd3f5d7bf7 [inductor] Update TIMM skip list (#90188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90188
Approved by: https://github.com/anijain2305
2022-12-09 21:30:23 +00:00
d91d7a3221 [reland][dynamo] use optimizers correctly in benchmarking (#87492)
Reland https://github.com/pytorch/pytorch/pull/87311

mlazos: updated to use SGD to not add a bunch of additional memory allocations (like Adam)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87492
Approved by: https://github.com/desertfire
2022-12-09 20:32:53 +00:00
351d73b97f Fix exception causes all over the codebase (#90271)
This is the continuation to #90134 and hopefully the final PR in this series.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271
Approved by: https://github.com/kit1980
2022-12-07 04:29:00 +00:00
8f079b895b [Dynamo+FSDP] Update benchmarks with use_orig_params=True (#90100)
After https://github.com/pytorch/pytorch/pull/89523, we now need to assert use_orig_params=True, even in the non-recursive case where (I think) we wouldn't otherwise need to run with use_orig_params=True.

Tested with `python benchmarks/dynamo/torchbench.py --training --accuracy --only hf_T5 --fsdp`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90100
Approved by: https://github.com/wconstab
2022-12-07 03:33:58 +00:00
4068c5467d [Reland] Move functorch/_src to torch/_functorch (#88756) (#90091)
This will be the last disruptive functorch internals change.

Why are we moving these files?
- As a part of rationalizing functorch we are moving the code in
functorch/_src to torch/_functorch
- This is so that we can offer the functorch APIs as native PyTorch APIs
(coming soon) and resolve some internal build issues.

Why are we moving all of these files at once?
- It's better to break developers all at once rather than many times

Test Plan:
- wait for tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90091
Approved by: https://github.com/anijain2305, https://github.com/ezyang
2022-12-03 14:17:15 +00:00
0bde810572 Add more debug information for Inductor (#90008)
- Add graph index to the profile information of the Inductor kernel for better debugability.

  The generated code for different graphs could produce kernels with the same name. The side effect is that it is hard to identify the portion of E2E performance for these kernels because the profiler will aggregate the performance with the same kernel name regardless of different graphs. Hence, this PR added the graph index to the profile information to address this limitation.

- Label arbitrary code ranges for `eager` and `opt` modes for better debugability

  The profile information of dynamo benchmarks mixes the eager mode and opt mode. It is hard to separate the range for different modes. This PR added eager and opt marks to the profile information to address this limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90008
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-02 09:34:48 +00:00
3162a48a77 [dynamo][benchmarks] Call zero grad (#90026)
Hoping that it might reduce some flakiness

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90026
Approved by: https://github.com/williamwen42
2022-12-02 04:05:57 +00:00
68805b08d1 [benchmarks][dynamo] Trying CI - Set train() for TIMM models accuracy tests (#89780)
Moving to train mode for TIMM models and also raising batch size for accuracy testing.

Raising batch size seems to remove a lot of noise/instability coming from batch_norm decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89780
Approved by: https://github.com/ngimel
2022-11-30 12:57:35 +00:00
218d9c6e09 Revert "Move functorch/_src to torch/_functorch (#88756)"
This reverts commit 52bc5c1cfe098fd4b4b13902b4fea83b455b9773.

Reverted https://github.com/pytorch/pytorch/pull/88756 on behalf of https://github.com/clee2000 due to broke imports in tests 52bc5c1cfe https://github.com/pytorch/pytorch/actions/runs/3574742513/jobs/6010814968 probably a landrace
2022-11-29 17:17:11 +00:00
52bc5c1cfe Move functorch/_src to torch/_functorch (#88756)
This will be the last disruptive functorch internals change.

Why are we moving these files?
- As a part of rationalizing functorch we are moving the code in
functorch/_src to torch/_functorch
- This is so that we can offer the functorch APIs as native PyTorch APIs
(coming soon) and resolve some internal build issues.

Why are we moving all of these files at once?
- It's better to break developers all at once rather than many times

Test Plan:
- wait for tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88756
Approved by: https://github.com/ezyang
2022-11-29 13:55:42 +00:00
465ee7bc09 [inductor] skip dm_nfnet_f0 in TIMM model test (#89768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89768
Approved by: https://github.com/clee2000
2022-11-28 20:08:41 +00:00
cdf4087597 [benchmarks] Disabling gradscaler (#89741)
Disabling Gradscaler because
 1) Benchmark setup runs 2 iterations of fwd-bwd. So, not useful.
 2) Current setup shares grad_scaler for eager and dynamo model,
 which is bad as Gradscaler has state and can adjust the scaling
 factor between eager and dynamo run, making accuracy check
 harder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89741
Approved by: https://github.com/ngimel
2022-11-28 20:08:37 +00:00
049a0f2cd5 [inductor] Update CI model tests (#89499)
Summary:
1) Add model inference test
2) Switch model training test to use AMP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89499
Approved by: https://github.com/bertmaher
2022-11-23 18:30:51 +00:00
ed32511974 Don't use explain() for --explain; instead read it off the counters (#89518)
Fixes huggingface problem where example_inputs is not actually the
args.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89518
Approved by: https://github.com/albanD
2022-11-23 02:43:53 +00:00
26322544b8 Add limited FSDP correctness to torchdynamo benchmark (#89469)
- Does not do recursive wrapping
- Only supports accuracy bench
- Mainly useful for sweeping over models for correctness, in part
  to evaluate whether dynamo support for FSDP is breaking anywhere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89469
Approved by: https://github.com/davidberard98, https://github.com/aazzolini
2022-11-23 00:19:36 +00:00
f281f435a8 Fix benchmarks - xla tensor test (#89509)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89509
Approved by: https://github.com/ngimel, https://github.com/shunting314
2022-11-22 18:42:13 +00:00
e545caa50f dynamo/torchxla integration: trace on xla rather than eager (#88904)
In #87741 we added the inference support for dynamo/torchxla integration. Later on in #88449 we attempt to add the training support. That attempt is not smooth because
- we try 2 things together
   1. let dynamo trace the model on xla rather than eager
   2. enable training
- It turns out neither of these two tasks are trivial enough.

Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync.

This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training.

Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x.
```
+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |            1.38    |                 1.008   |
+-------------------------+--------------------+-------------------------+
| resnet50                |            1.227   |                 0.998   |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |            1.544   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| alexnet                 |            1.085   |                 1.045   |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |            2.028   |                 1.013   |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |            1.516   |                 0.995   |
+-------------------------+--------------------+-------------------------+
| squeezenet1_1           |            0.868   |                 1.01    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |            1.099   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |            3.26    |                 1.027   |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |            2.182   |                 1.015   |
+-------------------------+--------------------+-------------------------+
| geomean                 |            1.50389 |                 1.01261 |
+-------------------------+--------------------+-------------------------+
```

Example command
```
GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88904
Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel
2022-11-22 03:57:04 +00:00
e4d9dbd7d2 Port torchdynamo's torchbench script to userbenchmark (#89239)
Summary:
This Diff ports the torchbench.py script from torchdynamo to torchbench to support the development of internal models.

Currently, only works with the `--only` option, and can only test one model at a time.

Note that the noisy logs are from upstream model code, not the benchmark code.
In the internal environment, `torch._dynamo.config.base_dir` is not writable, so we add an option to specify the output directory.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --performance --only ads_dhen_5x --part over --output-directory /tmp/tb-test/
cuda eval  ads_dhen_5x
  1/  1 +0 frames   2s  1 graphs  1 graph calls  412/ 411 = 100% ops 100% time
```

```
$  buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --performance --only cmf_10x --part over --output-directory /tmp/tb-test/
cuda eval  cmf_10x
  1/  1 +0 frames   1s  1 graphs  1 graph calls  306/ 305 = 100% ops 100% time
```

Reviewed By: jansel

Differential Revision: D41294311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89239
Approved by: https://github.com/jansel
2022-11-21 17:25:28 +00:00
631baecbcd Add --explain flag to bench (#89316)
TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 time python benchmarks/dynamo/torchbench.py  --accuracy --explain  --backend aot_eager --train --only BERT_pytorch

Dynamo produced 76 graphs with 75 graph break and 198 ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89316
Approved by: https://github.com/ezyang
2022-11-19 03:35:09 +00:00
19fcb80551 [inductor] Skip DALLE2_pytorch in torchbench (#89288)
Summary: DALLE2_pytorch fails in eager as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89288
Approved by: https://github.com/Krovatkin
2022-11-18 16:21:17 +00:00
1f7c0ff6e7 [inductor] Temporarily disable functorch_dp_cifar10 test in TorchBench (#89281)
Summary: The failure wasn't caught because of a land race. Skip the test
for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89281
Approved by: https://github.com/Krovatkin
2022-11-18 16:07:44 +00:00
31b10e7d40 Enable inductor CI for TorchBench (#87465)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87465
Approved by: https://github.com/malfet
2022-11-17 23:16:21 +00:00
74610a1ced [dynamo][benchmarks] HF - Fix seq len and batch sizes (#89165)
Fixes many models in https://github.com/pytorch/torchdynamo/issues/1842
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89165
Approved by: https://github.com/ngimel
2022-11-17 06:14:24 +00:00
a13433940c allow loading model from a path in torchbench (#89028)
Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument.

This PR is split out from #88904

Here is the usage:

        Specify the path and class name of the model in format like:
        --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME>

        Due to the fact that dynamo changes current working directory,
        the path should be an absolute path.

        The class should have a method get_example_inputs to return the inputs
        for the model. An example looks like
        ```
        class LinearModel(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(10, 10)

            def forward(self, x):
                return self.linear(x)

            def get_example_inputs(self):
                return (torch.randn(2, 10),)
        ```

Test command:
```
# python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager
WARNING:common:torch.cuda.is_available() == False, using CPU
cpu  eval  LinearModel                        0.824x p=0.00
```

Content of model_collection.py
```
from torch import nn
import torch

class LinearModel(nn.Module):
    """
    AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes.
    Make sure this model calls 2 linear layers to avoid being skipped.
    """
    def __init__(self, nlayer=2):
        super().__init__()
        layers = []
        for _ in range(nlayer):
            layers.append(nn.Linear(10, 10))
        self.layers = nn.Sequential(*layers)

    def forward(self, x):
        return self.layers(x)

    def get_example_inputs(self):
        return (torch.randn(2, 10),)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89028
Approved by: https://github.com/jansel
2022-11-16 00:29:08 +00:00
4108367123 Exclude poolformer_m36 from the inductor model test (#88908)
Summary: The root cause is still to be investigated. Issue tracked at
https://github.com/pytorch/torchdynamo/issues/1856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88908
Approved by: https://github.com/malfet
2022-11-12 03:10:25 +00:00
6e3555edea Add absolute latency to dashboard (#88790)
Add absolute latency to dashboard, as requested by https://github.com/pytorch/torchdynamo/issues/1833#issuecomment-1302742914

Tested by setting `run.sh` to
```
# Setup the output directory
rm -rf ../test-dynamo-runner-logs-7/
mkdir ../test-dynamo-runner-logs-7/

# Commands for torchbench for device=cuda, dtype=float32 for training and for performance testing
python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=../test-dynamo-runner-logs-7//inductor_torchbench_float32_training_cuda_performance.csv --training --inductor   --no-skip --dashboard --only mobilenet_v2 --cold_start_latency

# Commands for torchbench for device=cuda, dtype=float32 for training and for accuracy testing
python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcuda --output=../test-dynamo-runner-logs-7//inductor_torchbench_float32_training_cuda_accuracy.csv --training --inductor   --no-skip --dashboard --only mobilenet_v2
```
and running `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-7/ --dashboard-archive-path /data/home/williamwen/dynamo-runner-logs-copy --training --run --compilers inductor --flag-compilers inductor --suites torchbench --update-dashboard`  (need to comment out the `generate_commands` line and change the github issue ID from 681 to something else).

Sample comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1309645562

NOTE: this change breaks processing old logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88790
Approved by: https://github.com/anijain2305
2022-11-10 01:45:52 +00:00
de53d4143a Fix TorchInductor benchmarking in fbcode (#88689)
Summary: Makes the C++ TorchInductor benchmarking work in fbcode plus some minor fixed to enable that.

Test Plan: Test added

Differential Revision: D41045910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88689
Approved by: https://github.com/soumith
2022-11-09 18:13:06 +00:00
100b55637b Mark dynamo torchbench dlrm as unsupported (#88712)
- DLRM requires special configuration of embedding layers which are sparse
  and not compatible with DDP.
- I could mark the embedding params as ignored in DDP
  to make the benchmark pass, but this isn't a representative benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88712
Approved by: https://github.com/ezyang
2022-11-09 17:23:56 +00:00
89c5819626 Dynamo DDP accuracy bench uses find_unused_parameters (#88645)
- find_unused_parameters adds a slight overhead, but is required
  in cases where users do not manually specify parameters to ignore
  which will not receive grads.  In some models, some parameters
  do not receive grads, and this causes DDP to throw an exception
  as it waits for a grad for each parameter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88645
Approved by: https://github.com/soumith
2022-11-08 16:13:10 +00:00
1f32c3c087 Add single-process DDP accuracy support to dynamo benchmark suite (#88511)
- does not intend to support multi-process, as that is more complex
  and we have torchbench scripts for that
- currently only works in accuracy mode as this was the main goal,
  but could be extended for measuring single-gpu perf impact of
  graph breaks

Run with

`python benchmarks/dynamo/torchbench.py --inductor --training --accuracy --only hf_Bert --ddp`

Example output
```
cuda train hf_Bert
[2022-11-04 18:52:08,304] torch._inductor.compile_fx: [WARNING] skipping cudagraphs due to complex input striding
PASS
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88511
Approved by: https://github.com/davidberard98, https://github.com/aazzolini
2022-11-05 02:41:17 +00:00
1b575782a0 [dynamo][benchmarks] use fresh inductor cache and raise batch size wherever possible (#88044)
cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88044
Approved by: https://github.com/ngimel
2022-10-30 17:10:17 +00:00
e4a8661ab8 torchdynamo and xla integration (#87741)
# Motivation
- torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing
- guard system is quite low overhead but torchxla tracing overhead is quite high

The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that
- we can integration torchdynamo with XLA
- we reduce or even completely avoid tracing overhead of torchxla

# Technique details
## XLA baseline
We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA.

## New dynamo backends added
We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets.

torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline.

torchxla_trace_once only traces once during AOT compiling time. Here are the steps:
1. dynamo capture guards and the subgraph
2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup
3. at inference time, the hash is used directly to lookup the optimized graph and run it.

# Limitations
We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models.

# Results
The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once  and torchxla_trivial is mostly neutral as expected.
```
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |            1.346   |                 1.045   |
+-------------------------+--------------------+-------------------------+
| resnet50                |            1.153   |                 1.007   |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |            1.381   |                 1.039   |
+-------------------------+--------------------+-------------------------+
| alexnet                 |            1.045   |                 1.018   |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |            1.562   |                 1.021   |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |            1.303   |                 1.069   |
+-------------------------+--------------------+-------------------------+
| squeezenet1_1           |            1.278   |                 1.025   |
+-------------------------+--------------------+-------------------------+
| vgg16                   |            1.076   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |            2.224   |                 0.978   |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |            1.81    |                 1.025   |
+-------------------------+--------------------+-------------------------+
| geomean                 |            1.38101 |                 1.02324 |
+-------------------------+--------------------+-------------------------+
```

The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there):
https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5

# Next steps
- Use AOT autograd to enable training
- Share results on XLA devices
- Do more extensive tests on torchbench models

Example command
```
GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once
```

Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: https://github.com/pytorch/xla/pull/4119

topic: not user facing

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87741
Approved by: https://github.com/wconstab
2022-10-29 17:52:26 +00:00
2cb7c3f865 [dynamo][benchmarks] Prepone Cold start setup (#87913)
Parallel compilation warms the Threadpool when we call `torch._dynamo.optimize()`. In current benchmarks, we were setting up the TRITON_CACHE_DIR much later. Because of this parallel compilation artifacts were not used and compilation latency improvements were not visible in dashboard. This PR just prepones the setup of TRITON_CACHE_DIR.

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87913
Approved by: https://github.com/wconstab
2022-10-28 02:41:13 +00:00
83b381d34d [dynamo] add inductor runs w/o cudagraphs (#87847)
as title

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87847
Approved by: https://github.com/jansel
2022-10-27 19:49:29 +00:00
ebe5aad466 [inductor] Revert channels-last support (#87588)
We witnessed slow compilation times last week. Earlier, I thought it was due to parallel compilation. But, after git bisect, I found the source of extra time to be my PR - https://github.com/pytorch/pytorch/pull/87049

For 1x1 kernel, the current striding check incorrectly declares channels-first 1x1 convs to channels last. I am not sure why it caused so much compilation time jump.  Or why it did not fail? There was no change in performance speedup. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu to identify what could be source of this compilation time increase, so that we can manually check that part of the stack.

With this `res2next50` compilation time went back to 96 seconds (which was raised to 900 seconds with my earlier PR) for single thread. And parallel-compilation brings it down to ~30 seconds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87588
Approved by: https://github.com/soumith, https://github.com/jansel, https://github.com/ngimel
2022-10-25 19:58:25 +00:00
f047dadab9 Enable inductor CI for TIMM (#87462)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87462
Approved by: https://github.com/anijain2305
2022-10-22 05:50:00 +00:00
c55b332517 Delete unused static runtime experiment (#87473)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87473
Approved by: https://github.com/anijain2305
2022-10-21 20:03:24 +00:00
dfc65f43f9 Delete unused ts experiment (#87472)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87472
Approved by: https://github.com/anijain2305
2022-10-21 20:03:24 +00:00
7baf4b1969 Delete unused ltc experiments (#87471)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87471
Approved by: https://github.com/anijain2305
2022-10-21 20:03:22 +00:00
62d30f5a8a Remove unused cold_start experiment (#87470)
- this `--cold_start` experiment didn't end up being used
- there is a new `--cold_start_latency` flag that is used
- this experiment was only hooked up for nvfuser anyway

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87470
Approved by: https://github.com/anijain2305
2022-10-21 20:00:05 +00:00
96691865b9 [dynamo] Unify raise_on_* config to suppress_errors and raise by default (#87440)
I noticed that a lot of bugs are being suppressed by torchdynamo's default
error suppression, and worse yet, there's no way to unsuppress them.  After
discussion with voz and soumith, we decided that we will unify error suppression
into a single option (suppress_errors) and default suppression to False.

If your model used to work and no longer works, try TORCHDYNAMO_SUPPRESS_ERRORS=1
to bring back the old suppression behavior.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87440
Approved by: https://github.com/voznesenskym, https://github.com/albanD
2022-10-21 17:03:29 +00:00
b1cf377cce Enable inductor CI for huggingface (#86792)
Summary: Unit tests will be enabled after fixed in trunck. TorchBench and TIMM need
more setup and are coming later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86792
Approved by: https://github.com/jansel, https://github.com/huydhn
2022-10-21 01:38:46 +00:00
f38a88c4dd Revert "[dynamo] use optimizers correctly in benchmarking (#87311)"
This reverts commit 703c19008df4700b6a522b0ae5c4b6d5ffc0906f.

Reverted https://github.com/pytorch/pytorch/pull/87311 on behalf of https://github.com/anijain2305 due to Bin (desertfire) is trying to get torchbench models in CI, and this PR prevents that. I will bring this back after models are in CI.
2022-10-20 22:01:51 +00:00
703c19008d [dynamo] use optimizers correctly in benchmarking (#87311)
We were not setting optimizers correctly

* This hid the issue that we see here - https://github.com/pytorch/torchdynamo/issues/1687
* This has also revealed that we are activating profilers for every dynamo optimized model call. This could affect speedup

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87311
Approved by: https://github.com/mlazos, https://github.com/yanboliang
2022-10-20 05:46:25 +00:00
c30cfb07ab [dynamo][dashboard] Run 2 iterations for the correctness runs (#87104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87104
Approved by: https://github.com/soumith
2022-10-18 15:53:40 +00:00
30f6f6903c [inductor] Move size asserts to C++, fix bug (#87028)
Inductor internally models any `size=1` dimension as having `stride=0` to simplify indexing formulas (sympy will remove these terms from the expression).

This caused a bug in our generate stride assert in detectron2_maskrcnn_r_50_fpn, where we asserted the wrong stride of a size==1 dimension.

This fixes that bug, and moves size/stride assert logic to C++ which should be a small perf gain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87028
Approved by: https://github.com/anijain2305
2022-10-16 20:17:22 +00:00
054a2fd6c2 Sync changes from pytorch/torchdynamo (#87013)
This updates to:
6380959be2

Generated with:
https://github.com/pytorch/torchdynamo/blob/main/copy_to_core.sh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87013
Approved by: https://github.com/voznesenskym
2022-10-15 21:00:57 +00:00
c7c09722ad Move TorchDynamo into PyTorch core (#86461)
Context:
https://github.com/pytorch/torchdynamo/issues/1588

This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core.
- `torchdynamo` becomes `torch._dynamo`
- `torchinductor` becomes `torch._inductor`

This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461
Approved by: https://github.com/voznesenskym
2022-10-13 23:18:06 +00:00