Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
bypass_size_limit
allow-large-files
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D30652629
fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66123
Some models may take in a list of tensors as inputs, thus the bundled inputs will contain `IValues` that are of the type `c10::List`. For Vulkan models, every tensor in the `IValue` list has to be converted to a vulkan tensor first, and this case is not currently handled by the Vulkan model wrapper in the benchmark binary.
This diff introduces `IValue` type checking to the input processor of the Vulkan model wrapper, and adds support for Tensor and List types.
Test Plan:
```
# Build the binary
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:ptmobile_compareAndroid\#android-arm64 --show-output
# Push it to the device
adb push buck-out/gen/xplat/caffe2/ptmobile_compareAndroid\#android-arm64 /data/local/tmp/compare_models
# Run the benchmark binary
BENCH_CMD="/data/local/tmp/compare_models"
BENCH_CMD+=" --model=$PATH_TO_MODEL"
BENCH_CMD+=" --refmodel=$PATH_TO_REFERENCE_MODEL"
BENCH_CMD+=" --input_type=float --input_dims=$MODEL_INPUT_SIZE"
BENCH_CMD+=" --iter=100"
BENCH_CMD+=" --tolerance 1e-5"
```
Reviewed By: beback4u
Differential Revision: D31276862
fbshipit-source-id: 1d9abf958963da6ecad641202f0458402bee5ced
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65707
Refactoring aotCompile to return a pair of compiled function and the LLVM assembly instead of updating an incoming string with assembly code
Testing: Gives expected results when compiled and run
```
(pytorch) ~/local/pytorch refactor_aot
└─ $ build/bin/aot_model_compiler --model mobilenetv3.pt --model_name=pytorch_dev_mobilenetv3 --model_version=v1 --input_dims="2,2,2"
The compiled model was saved to mobilenetv3.compiled.pt
```
Test Plan: Imported from OSS
Reviewed By: qihqi
Differential Revision: D31220452
Pulled By: priyaramani
fbshipit-source-id: f957c53ba83f876a2e7dbdd4b4571a760b3b6a9a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65171
Add the model comparison binary to BUCK, and also add some quality of life features such as controlling the input range.
Test Plan:
```
# Build the binary
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:ptmobile_compareAndroid\#android-arm64 --show-ou
# Push it to the device
adb push buck-out/gen/xplat/caffe2/ptmobile_compareAndroid\#android-arm64 /data/local/tmp/compare_models
# Run the benchmark binary
BENCH_CMD="/data/local/tmp/compare_models"
BENCH_CMD+=" --model=$PATH_TO_MODEL"
BENCH_CMD+=" --refmodel=$PATH_TO_REFERENCE_MODEL"
BENCH_CMD+=" --input_type=float --input_dims=$MODEL_INPUT_SIZE"
BENCH_CMD+=" --iter=100"
BENCH_CMD+=" --tolerance 1e-5"
```
Reviewed By: beback4u
Differential Revision: D30371322
fbshipit-source-id: 5e520aaf119c90985a1d5a135f76e4057148333b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60919
Update make_mnist_db.cc and make_image_db.cc to work with the DB API changes
in D29204425 (00896cb9ed). This is similar to the changes to make_cifar_db.cc landed in
D29374754 (394f60b0fc).
ghstack-source-id: 132621346
Test Plan: buck build caffe2/binaries/...
Reviewed By: valmikir
Differential Revision: D29447314
fbshipit-source-id: 33aff85c24d8b785211287de23d46704c7eb0726
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60692
Update make_cifar_db.cc to work with the DB API changes in D29204425 (00896cb9ed).
Test Plan: buck build caffe2/binaries:make_cifar_db
Differential Revision: D29374754
fbshipit-source-id: 23d2acd24031d11071791e398433b537215ffd38
Summary:
Some machines don't have a versionless `python` on their PATH, which breaks these existing shebangs.
I'm assuming that all the existing versionless `python` shebangs are meant to be `python3` and not `python2`; please let me know if my assumption was incorrect for any of these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58275
Test Plan: CI.
Reviewed By: zhouzhuojie
Differential Revision: D28428143
Pulled By: samestep
fbshipit-source-id: 6562be3d12924db72a92a0207b060ef740f61ebf
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49408
Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118665808
Test Plan:
Wait for GitHub CI since we had C++14-specific issues with
this one in previous PR https://github.com/pytorch/pytorch/pull/48629
Reviewed By: malfet
Differential Revision: D25563207
fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48629
Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118568240
Test Plan: CI
Reviewed By: dhruvbird
Differential Revision: D25135415
fbshipit-source-id: 5e92dc79da6473ed15d1e381a21ed315879168f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48620
In preparation for storing bare function pointer (8 bytes)
instead of std::function (32 bytes).
ghstack-source-id: 118568242
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D25132183
fbshipit-source-id: 3790cfb5d98479a46cf665b14eb0041a872c13da
Summary:
This PR aims to reduce the import overhead and symbol noises from the `windows.h` headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48009
Reviewed By: gchanan
Differential Revision: D25045840
Pulled By: ezyang
fbshipit-source-id: 01fda70f433ba2dd0cd2d7cd676ab6ffe9d98b90
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46112
### Summary
This PR adds the support of running torchscript models on iOS GPU via Metal (Inference only). The feature is currently in prototype state, API changes are expected. The tutorial and the documents will be added once it goes to beta.
allow-large-files
- Users API
```
auto module = torch::jit::load(model);
module.eval();
at::Tensor input = at::ones({1,3,224,224}, at::ScalarType::Float).metal();
auto output = module.forward({input}).toTensor().cpu();
```
- Supported Models
- Person Segmentation v106 (FB Internal)
- Mobilenetv2
- Supported Operators
- aten::conv2d
- aten::addmm
- aten::add.Tensor
- aten::sub.Tensor
- aten::mul.Tensor
- aten::relu
- aten::hardtanh
- aten::hardtanh_
- aten::sigmoid
- aten::max_pool2d
- aten::adaptive_avg_pool2d
- aten::reshape
- aten::t
- aten::view
- aten::log_softmax.int
- aten::upsample_nearest2d.vec
- Supported Devices
- Apple A9 and above
- iOS 10.2 and above
- CMake scripts
- `IOS_ARCH=arm64 ./scripts/build_ios.sh -DUSE_METAL=ON`
### Test Plan
- Circle CI
ghstack-source-id: 114155638
Test Plan:
1. Sandcastle CI
2. Circle CI
Reviewed By: dreiss
Differential Revision: D23236555
fbshipit-source-id: 98ffc48b837e308bc678c37a9a5fd8ae72d11625
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45364
Plus add some more comments about the usage, limitations and cons.
Test Plan: Build and run benchmark binary.
Reviewed By: gchanan
Differential Revision: D23944193
fbshipit-source-id: 30d4f4991d2185a0ab768d94c846d73730fc0835
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44970
Right now, when RecordFunction is not active (usual case),
we do two TLS accesses (check for thread local callbacks, and check for
thread local boolean).
Experimenting with reducing number of TLS accesses in RecordFunction
constructor.
Test Plan: record_function_benchmark
Reviewed By: dzhulgakov
Differential Revision: D23791165
Pulled By: ilia-cher
fbshipit-source-id: 6137ce4bface46f540ece325df9864fdde50e0a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43721
We can combine optimization pass and save_for_mobile together to reduce friction. Since lite interpreter model can also be used in full JIT, I don't think we need the option to save it as full JIT model.
Also
- improved usage message
- print op list before and after optimization pass
Test Plan:
```
buck run //xplat/caffe2:optimize_for_mobile -- --model=/home/linbin/sparkspot.pt
Building: finished in 12.4 sec (100%) 2597/2597 jobs, 2 updated
Total time: 12.5 sec
pt_operator_library(
name = "old_op_library",
ops = [
"aten::_convolution",
"aten::adaptive_avg_pool2d",
"aten::add_.Tensor",
"aten::batch_norm",
"aten::mul.Tensor",
"aten::relu_",
"aten::softplus",
"aten::sub.Tensor",
],
)
pt_operator_library(
name = "new_op_library",
ops = [
"aten::adaptive_avg_pool2d",
"aten::add_.Tensor",
"aten::batch_norm",
"aten::mul.Tensor",
"aten::relu_",
"aten::softplus",
"aten::sub.Tensor",
"prepacked::conv2d_clamp_run",
],
)
The optimized model for lite interpreter was saved to /home/linbin/sparkspot_mobile_optimized.bc
```
```
buck run //xplat/caffe2:optimize_for_mobile -- --model=/home/linbin/sparkspot.pt --backend=vulkan
```
Reviewed By: kimishpatel
Differential Revision: D23363533
fbshipit-source-id: f7fd61aaeda5944de5bf198e7f93cacf8368babd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42006
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only.
Test Plan:
android test: cpu_caching_allocator_test
Imported from OSS
Reviewed By: dreiss
Differential Revision: D22726976
fbshipit-source-id: 9a38b1ce34059d5653040a1c3d035bfc97609e6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758
Currently we flip a coin for each sampled callback each time
we run RecordFunction, this PR is an attempt to skip most of the coin
flips (for the low-probability observers) and keep the distribution
close to the original one
Test Plan:
CI and record_function_benchmark
```
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 30108 us.
Time per iteration (1x1): 1496.78 us.
Time per iteration (16x16): 2142.46 us.
Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 19051 us.
Time per iteration (1x1): 1581.89 us.
Time per iteration (16x16): 2195.67 us.
Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18715 us.
Time per iteration (1x1): 1566.11 us.
Time per iteration (16x16): 2131.17 us.
Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18814 us.
Time per iteration (1x1): 1536.2 us.
Time per iteration (16x16): 1985.82 us.
Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18278 us.
Time per iteration (1x1): 1526.32 us.
Time per iteration (16x16): 2093.77 us.
Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18545 us.
Time per iteration (1x1): 1524.65 us.
Time per iteration (16x16): 2080 us.
Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048
```
Reviewed By: dzhulgakov
Differential Revision: D22320879
Pulled By: ilia-cher
fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37462
Instead of running all the optimization pass in optimizeForMobile method,
introducing a whitelist optimizer dictionary as second param in the method,
when it is not passed during calling, the method will run all the optimization
passes, otherwise the method will read the dict and only run the pass with
value of True.
ghstack-source-id: 106104503
Test Plan:
python test/test_mobile_optimizer.py
Imported from OSS
Differential Revision: D22096029
fbshipit-source-id: daa9370c0510930f4c032328b225df0bcf97880f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39076
`--vulkan` argument to use torch benchmark on Vulkan Backend
if it is True - inputs will be converted to Vulkan backend before module.forward
Usage for mobilenetv2 fp32:
```
/build/bin/speed_benchmark_torch --model=mn-fp32.pt --input_type=float --input_dims=1,3,224,224 --warmup=1 --iter=5 --vulkan=true
```
Test Plan: Imported from OSS
Differential Revision: D21962428
Pulled By: IvanKobzarev
fbshipit-source-id: 3136af5386b6bce9ea53ba4a9019af2d312544b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37548
Moving RecordFunction from torch::autograd::profiler into at namespace
Test Plan:
CI
Imported from OSS
Differential Revision: D21315852
fbshipit-source-id: 4a4dbabf116c162f9aef0da8606590ec3f3847aa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37491
This PR modernizes RecordFunction API and adds thread local callbacks
in addition to the global ones
Changes:
- support for TLS callbacks, this is going to be the foundation of profiler and other tools
- modernize interface around simple set of functions (add|remove|has|clear)(Global|ThreadLocal)(Callback) and adding RecordFunctionCallback to easily construct callbacks to be passed
- we also add `.setShouldRun` into the callback interface to support cases when simple uniform sampling is not enough
- to properly support add/remove introduce the idea of callback handle returned by add
- internal implementation still uses SmallVector to store intermediate state (as before) - in this case these are vector of handles of callbacks that were picked to run
- to speed up runtime we keep these vectors sorted, this way we can quickly enumerate callbacks that need to be run
- added tests for new functionality
Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install
./build/bin/test_jit
CI
record_function_benchmark: https://gist.github.com/ilia-cher/f1e094dae47fe23e55e7672ac4dcda2f
Imported from OSS
Differential Revision: D21300448
fbshipit-source-id: 6d55c26dbf20b33d35c3f1604dcc07bb063c8c43
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37721
Even though we disabled caffe2 test configs in Python, the BUILD_TEST
option was still building caffe2 test cpp binaries and various CI
configurations were running them (since they just run every binary in
`torch/test`).
This PR adds a caffe2-specific BUILD_TEST option (BUILD_CAFFE2_TEST),
which defaults to OFF, and gates the compilation of caffe2 test cpp
binaries under it.
Test Plan: Imported from OSS
Differential Revision: D21369541
Pulled By: suo
fbshipit-source-id: 669cff70c5b53f016e8e016bcb3a99bf3617e1f9