Summary:
When Static Runtime graph node has sub-blocks, the memory planner does not consider sub-blocks' inputs as a node's input in memory planner. As the result, such nodes' inputs' lifetime is incorrect and corresponding tensor memory is released earlier than required and causes errors.
Differential Revision: D69195886
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146855
Approved by: https://github.com/swolchok
gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise.
I added a number of ops to the blocklist:
```
+ "_nested_tensor_storage_offsets",
+ "_nested_get_values", # no CPU backend
+ "_nested_get_values_copy", # no CPU backend
+ "_nested_view_from_jagged", # testing needs to be patched
+ "_nested_view_from_jagged_copy", # testing needs to be patched
+ "_nested_view_from_buffer", # testing needs to be patched
+ "_nested_view_from_buffer_copy", # testing needs to be patched
+ "_int_mm", # testing needs to be patched
+ "_to_sparse_csc", # testing needs to be patched
+ "_to_sparse_csr", # testing needs to be patched
+ "segment_reduce", # testing needs to be patched
```
Most of these are added just because testing doesn't work right now.
Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though.
Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299
Approved by: https://github.com/YuqingJ
Summary: This permute copy change seems to be causing huge regressions on machines without AVX512. Revert to mitigate. This shouldn't be problematic since the improvement from changing it was super small anyways.
Differential Revision: D41450088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89463
Approved by: https://github.com/hlu1
This build uses the wrong BUILD_ENVIRONMENT `pytorch-linux-focal-py3`, thus it hasn't been run for a long time (forgotten). The name was probably the old name of the build environment we used in the past. The convention today doesn't have the `pytorch-` prefix. There is a TODO for this:
> TODO: this condition is never (BUILD_ENVIRONMENT doesn't start with pytorch-), need to fix this.
This is done as part of [T131829540](https://www.internalfb.com/intern/tasks/?t=131829540), where we want
`static_runtime_benchmark` build and test jobs to run in OSS CI to avoid breaking internal
* I also fix some compiler warning errors `-Werror=sign-compare`, `-Werror,-Wunused-const-variable`, and gcc7 compatibility issue along the way because this hasn't been run for a long time.
* Reviving this test also reveals a small bug in `PrepackWeights` test in `test_static_runtime.cc` added recently in https://github.com/pytorch/pytorch/pull/85289. The test refers to an internal ops and should only be run internally. This has been fixed by https://github.com/pytorch/pytorch/pull/87799 (To be merged)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87660
Approved by: https://github.com/malfet
Summary:
D40798763 broke this op. Unfortunately, it wasn't caught at land time due to the recent OSS Static Runtime test problems.
The problem is C++ overload resolution. After D40798763, the int that we were passing to `at::native::tensor_split` was getting implicitly converted to `IntArrayRef`. Fix this by converting the int to a `SymInt` and calling the correct overload.
Test Plan:
```
buck2 test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Tensor_Split --run-disabled
```
Differential Revision: D40862394
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88113
Approved by: https://github.com/hlu1
Summary: `ReplaceWithMaybeCopy` is guarded by `FBCODE_CAFFE` in `OptimizeGraph`. Run the pass manually to ensure it does the replacement.
Test Plan: Existing tests
Differential Revision: D40858743
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88099
Approved by: https://github.com/huydhn
Summary:
The pass introduces an `fb::` operator and thus cannot be used in OSS.
The test failure was not exposed because the Static Runtime tests have been disabled in OSS for a while. The Dev Infra folks encountered this failure when re-enabling the tests.
Test Plan: Existing tests
Differential Revision: D40724547
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87799
Approved by: https://github.com/huydhn
Summary:
Someone was running into problems where
1) Static Runtime enablement would fail
2) We would try to fall back to the JIT interpreter *after trying to create `StaticModule`*
3) The fallback fails because Static Runtime mangled the graph.
We don't want to prevent Static Runtime from mutating its input due to memory concerns. The intent of `canEnableStaticRuntime` is to catch issues in the module before Static Runtime messes with it.
With this diff, `StaticModule` instantiation can be avoided by querying `canEnableStaticRuntime` and the issue is fixed.
Test Plan: New unit test
Differential Revision: D40564452
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87396
Approved by: https://github.com/tenpercent
Summary:
The test is causing issues:
```
terminate called after throwing an instance of 'std::runtime_error'
what(): The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
graph(%A: Tensor, %driver: str?):
%bias: None = prim::Constant()
%ret = aten::linalg_svdvals(%A, %driver)
~~~~ <--- HERE
%cloned = aten::clone(%ret, %bias)
return (%cloned)
RuntimeError: torch.linalg.svd: keyword argument `driver=` is only supported on CUDA inputs with cuSOLVER backend.
```
Just block the op and re-run the codegen script to remove everything and update the generated ops.
Test Plan: Existing tests
Differential Revision: D39973860
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85983
Approved by: https://github.com/xuzhao9, https://github.com/tenpercent
Summary: We currently assume that a tuple output implies that the prim::If node returns multiple unpacked outputs, but this is not guaranteed to be the case. Add some logic to return the wrapped tuple if necessary
Test Plan: New unit test
Differential Revision: D39712050
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85446
Approved by: https://github.com/tenpercent
Summary: Split `quantized_linear_unpacked_weight_v2` into `linear_prepack` and `quantized_linear` so that the prepacking operation may be eliminated by constant folding.
Test Plan:
Fixes a huge regression in an internal model:
```
Before
89.6141 ms. 99.0923%. fb::quantized_linear_unpacked_weight_v2 (12 nodes)
After
0.806852 ms. 53.5365%. quantized::linear (12 nodes, out variant)
(prepacking eliminated)
```
Differential Revision: D39622530
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85289
Approved by: https://github.com/davidberard98
Summary: Apparently static runtime's list construct return value is always a `GenericList`, so we cannot use the `toOptionalTensorList` method in the general case -- we must convert each item individually.
Test Plan: New unit test
Differential Revision: D39628979
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85298
Approved by: https://github.com/tenpercent
Summary:
`index_put` can take a list of tensors, but Static Runtime always tries to convert its argument to a list of optional tensors. This was causing crashes for some users. Add some schema checks to prevent this, and add a new overload for the new case.
Also, I found a clear bug in the JIT interpreter (mutating the argument when its not supposed to), so I fixed that too.
Test Plan: New unit test
Differential Revision: D39072214
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84152
Approved by: https://github.com/tenpercent
Summary:
The previous implementation assumed that there was only one overload and unconditionally tried to convert its input into a string. Some users were running into crashes because of this. Added a new overload for the list overload and schema checks.
Also, I managed to uncover another bug when writing tests for this case (yikes). Returning inputs didn't work because the input cleanup process would destroy the output. Extended `CreateOwnedRefsForSpecialIValues` to fix that.
Test Plan: CI + new unit tests
Differential Revision: D38870803
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83753
Approved by: https://github.com/tenpercent, https://github.com/albanD
Summary:
- Test cases related to DeepAndWideSciptModel() was crashing at random due to precision issue
- test cases related for precision: DeepWide, KWargsAPI_1, KWargsAPI_2, KWargsAPI_Optional, FusionPass
- test failure was not observed always due to random input to the model (via torch::randn)
- Increasing the absolute tolerance for test cases
Differential Revision: D37639067
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80935
Approved by: https://github.com/mikeiovine
Summary: Added the third bool argument in inalg_solve test case to remove the runtime error.
Test Plan: buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: mikeiovine
Differential Revision: D37324419
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79971
Approved by: https://github.com/tenpercent
Summary: This adds the missing jit prim ops appear in the non ads models for c2->pt mitigation: aten::cpu, aten::list, aten::numel, aten::__range_length
Test Plan: static runtime unit tests
Differential Revision: D36984960
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79111
Approved by: https://github.com/davidberard98
Summary: The test is throwing a jit alias analysis not supporting error. Disabling it for now.
Test Plan: buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: mikeiovine
Differential Revision: D37056032
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79574
Approved by: https://github.com/mikeiovine
Summary: This adds the pytorch operators that are currently missing in non-ads models from c2->pt mitigation: aten::index_put, aten::item, aten::tensor_split
Test Plan: buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
Differential Revision: D36984961
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79065
Approved by: https://github.com/davidberard98
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78297
Clone followed by expand/expand_as due to memoryOverlap check on copy_ native method. Refer to T118519310 for more details.
Crashing test case:
a = tensor(3,1) // strides = (1,1)
B = tensor(3,2) // strides = (2,1)
Temp = a.expand_as(b). // creates temp with shape as (3,2) and strides as (1,0)
temp.clone() // crashe on copy_ due to memoryOverlap
Fix: Disable the out variant for the expanded tensor.
- Calls native clone instead of out variant for clone dealing with expanded tensors
- Added test case for both clone variants (out and native clones)
- Increased the tensor size for memory planner test case to trigger dynamic allocation
Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Differential Revision: D36672180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78322
Approved by: https://github.com/mikeiovine
Summary: Previously the op was auto-generated but it only covered the pointwise overload of aten::max. This adds support for reduction, overall and along a dim
Test Plan: Added a unit test
Differential Revision: D36656378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78271
Approved by: https://github.com/mikeiovine
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75661
`fast_sigmoid` is a variant of sigmoid in NNC that is implemented in terms of `fast_tanh` (which is a fast rational function approximation).
ghstack-source-id: 155604086
Reviewed By: navahgar, hlu1
Differential Revision: D35481390
fbshipit-source-id: 1d64b5c375539f3b2461a1f3d9b86cd696eae7a1
(cherry picked from commit 8106c2512b8d7b373cb6545a43c3e8fc04805c4b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76391
I've seen this pattern in many important internal models:
```
x = torch.permute(a, [0, 2, 1])
y = torch.softmax(x, 2)
z = torch.permute(y, [0, 2, 1])
```
This is equivalent to
```
z = torch.softmax(x, 1)
```
The `permute` ops can degrade performance, especially if copy variants are on. Add another pattern to our `EliminateExtraPermuteOpsPass` to handle this.
ghstack-source-id: 155466506
Test Plan: New unit tests
Reviewed By: navahgar, huiguoo
Differential Revision: D35938289
fbshipit-source-id: 398b5528077b0b3f1c6fc5544e483803e96d68e9
(cherry picked from commit d742abd094d1fef23ca6a34703d97a6da2d14bd1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76398
This diff adds the large files that include the newly generated ops from D34913736. Refer to the base diff for more details.
Test Plan: buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: mikeiovine, tenpercent
Differential Revision: D35945633
fbshipit-source-id: 53497bd5c490a57ea1521837762f740deb42bfd8
(cherry picked from commit e0fbdcb0bf09f5c192430f95f450c0a946c80074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75775
fbgemm kernels already implement the fused kernel, no reason not to use it
ghstack-source-id: 155450342
Test Plan: New unit tests
Reviewed By: navahgar
Differential Revision: D35633297
fbshipit-source-id: a744a33a65ce7dbb9ce8900dbe091b6d56dd4e48
(cherry picked from commit b1361b349862715aa17e6318c5e658cd6401a464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75774
`list[0:]` is a no-op. This should really be eliminated on the modeling side, implement as a graph pass for now until we can get this into prod models.
Test Plan: New unit tests
Reviewed By: navahgar
Differential Revision: D35632947
fbshipit-source-id: 0c564193c35039130e99172e0185e124ea24f62d
(cherry picked from commit e01d5273185e39a563c7acb15662d9c1549d4b58)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75776
The output was returned directly instead of a clone, so the output of the relevant op would not be managed.
ghstack-source-id: 154935103
Test Plan: CI
Reviewed By: navahgar
Differential Revision: D35633469
fbshipit-source-id: 7b08b7368e0349a12abf8802a4c625ffecdc5abb
(cherry picked from commit 24bed9ba4da39cff7f3b40f5e49dfded2552b373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993
Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0
{F723683014}
More details in https://fb.quip.com/MKumAjz1YD4 (1f47a80e88)a#temp:C:FPD3 (ecd5567980)e5a0871ae5d481286b511ef7
The last 3 outputs of embedding_bag are unused in the graph: P495814049.
* max_indices output isn't necessary for the main output, so remove it when it's not used in the graph.
* offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph.
* bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now.
Test Plan:
`./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only`
Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604`
Before:
I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78
0.11156 ms. 99.0457%. aten::embedding_bag (52 nodes, out variant)
After:
I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2
0.0789221 ms. 98.7096%. static_runtime::embedding_bag (52 nodes, out variant)
* Ads prod canary:
https://www.internalfb.com/intern/ads/canary/443002539593035806/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594`
https://www.internalfb.com/intern/servicelab/602875732/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594`
https://www.internalfb.com/intern/servicelab/1002874745/
Reviewed By: mikeiovine
Differential Revision: D35726594
fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a
(cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)