[pt][static_runtime] Memory model (#46896)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46896 The idea of the memory model is quite similar to that of BlackBoxPredictor, however, it's more complicated in pt due to 1) tensor views that share storage with storage refcount bumps but with different TensorImpls, 2) tensors sharing the same TensorImpl and the same storage, but with no refcount bump of the StorageImpl, 3) data types such as TensorList and Tuples that have Tensors in them, 4) need to support non-out/out variant mix while we move the aten ops to out variants. As a result, I have to make the following adjustments: 1) remove tensors in output Tuples from internal blob list; 2) for memory allocation/deallocation, get candidate Tensors from the outputs of ops with out variant, extract StorageImpls from the Tensors, dedup, and remove output tensor StorageImpls, and get the final list of blobs for memory planning; 3) during the clean_up_memory pass, clean up memory held by the StorageImpls as well as Tensors/Lists/Tuples in IValues that don't participate in memory planning to reduce overall memory usage Risk: PyTorch team is planning to deprecate the current resize_outout api, which we do rely on. This is a pretty big risk. https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/aten/src/ATen/native/Resize.cpp?commit=6457b329847607553d34e788a3a7092f41f38895&lines=9-23 Test Plan: ``` buck test //caffe2/test:static_runtime buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test ``` Benchmarks: ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \ buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \ --scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \ --pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \ --iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \ --pt_cleanup_activations=true --pt_enable_out_variant=false ``` |pt_cleanup_activations |pt_enable_out_variant |old ms/iter |new ms/iter | |--- |--- |--- |--- | |0 |0 |0.31873 |0.30228 | |0 |1 |0.30018 |0.29184 | |1 |0 |0.35246 |0.31895 | |1 |1 |0.35742 |0.30417 | Reviewed By: bwasti, raziel Differential Revision: D24471854 fbshipit-source-id: 4ac37dca7d2a0c362120a7f02fd3995460c9a55c
2025-10-20 21:14:14 +08:00 · 2020-11-03 23:42:24 -08:00
parent 5c4bd9a38f
commit 996f444c00
9 changed files with 398 additions and 53 deletions
--- a/test/test_static_runtime.py
+++ b/test/test_static_runtime.py
@ -128,8 +128,10 @@ class TestStaticRuntime(TestCase):
        attention_a = StaticRuntime(attention)
        o_test = attention_a(src, src, src, src_mask)
        o_test_kw = attention_a(src, src, value=src, mask=src_mask)
+
        for a, b in zip(o_ref, o_test):
            torch.testing.assert_allclose(a, b)
+
        for a, b in zip(o_ref, o_test_kw):
            torch.testing.assert_allclose(a, b)

@ -150,9 +152,9 @@ class TestStaticRuntime(TestCase):
        attention = torch.jit.script(attention)
        attention_a = StaticRuntime(attention)

-        attention_a.benchmark([src, src, src, src_mask], {}, 10, 10)
+        attention_a.benchmark([src, src, src, src_mask], {}, 2, 2)
        metrics = attention_a.benchmark_individual_ops(
-            [src, src, src, src_mask], {}, 10, 10
+            [src, src, src, src_mask], {}, 2, 2
        )

    def test_mlp(self):