Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.
Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:
`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)
```python
class BooleanOptionalAction(Action):
def __init__(...):
if option_string.startswith('--'):
option_string = '--no-' + option_string[2:]
_option_strings.append(option_string)
```
It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
We're no longer building Caffe2 mobile as part of our CI, and it adds a lot of clutter to our make files. Any lingering internal dependencies will use the buck build and so wont be effected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84338
Approved by: https://github.com/dreiss
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75807
There is a tension in RecordFunction between two use cases:
1) In the normal eager path we don't run any callbacks, so we need to bail out of the profiling path as soon as possible to minimize eager overhead.
2) When profiling we want to determine which callbacks to run as efficiently as possible to minimize instrumentation overhead.
The confounding factor in all of this is sampling callbacks because they change which callbacks will run on each call, even in steady state operation. This has traditionally been handled with a two stage procedure: first we flip a coin to determine if a sampled callback *might* run. If false (which it usually is), do nothing. This solves (1). If true, check to see if we need to build the full callback set or if it was a false positive. This procedure has two negative effects:
* It forces us to rebuild the set of callbacks to run on every step when profiling
* It leaks the sampling abstraction, requiring other parts of the code to bump certain values and forces RecordFunction to lazily initialize.
This change introduces a multi-level cache which can (in the common case) quickly determine which callbacks *will* run, rather than if callbacks *might* run. This means that rather than call `shouldRunRecordFunction`, we can simply get the callbacks for an invocation and check if they are empty. (And completely removes the pre-sampling heuristic.) Another major benefit of the new cache structure is that it allows thread-safe registration and unregistration of global callbacks.
It's worth briefly discussing how this maintains eager performance. In the standard eager case (only sampling callbacks registered) the cache first checks that the global callbacks haven't changed (atomic read), decrements a counter to see if a sampling callback fired, and then returns the active callbacks which is simply a SmallVector of pointer pairs and a couple POD values (scope, needs inputs/outputs/ids). The biggest cost according to perf is the SmallVector logic; we could consider adopting a hard limit on active callbacks; more than half a dozen callbacks *running* in a single step would be quite a lot. But the total cost relative to `PYTORCH_DISABLE_PER_OP_PROFILING` is only ~10ns, so debatable if it's worth it to switch to `std::array`.
The primary change is in `record_function.cpp`, which has a more detailed description of the new cache structure. `record_function.h` has some minor changes to align with the new calling convention and the remaining files are simply changes to the call sites.
Future work:
* RecordFunction no longer needs to be lazily initialized.
* We can deprecate the disable/reenable APIs, since we can not safely add and remove global callbacks.
Test Plan:
I tested eager mode performance using the overhead benchmark and found that the non-profiled path was unaffected. However the no-op observer dropped from 0.41us to 0.37us (0.25us if no observers are active) which is about 1/3rd reduction in the cost of the callback selection machinery.
I also added several C++ unit tests, as the core RecordFunction machinery (especially sampling) was largely untested.
Reviewed By: swolchok, davidberard98
Differential Revision: D35276158
fbshipit-source-id: 35135f444724fba4eb97c0ae7f3f710f0f9016fd
(cherry picked from commit 9e359b87422c18f2a195185f32e7e85c82f956fd)
This PR introduces 3 BC changes:
First, this PR propagates `BUILD_CAFFE2` flag to `libtorch` and `libtorch_python`, which is necessary for non-caffe2 ONNX runtimes when using `ONNX_ATEN_FALLBACK` operator export type.
Second, as a complement of https://github.com/pytorch/pytorch/pull/68490, this PR refactors Caffe2's Aten ops symbolics to consider not only the `operator_export_type` (aka `ONNX_ATEN_FALLBACK`) to emit Caffe2 Aten ops, but also whether `BUILD_CAFFE2` (which is called `torch.onnx._CAFFE2_ATEN_FALLBACK` in python binding) is set.
Lastly, it renames `onnx::ATen` to `aten::ATen` for ONNX spec consistency in a BC fashion.
ONNX doesn't have `ATen` op on its spec, but PyTorch ONNX converter emits them. Non-Caffe2 backend engines would be mislead by such operator's name/domain. A non-ideal workaround would be to have Aten ops handled based on its name and ignore the (non-complaint) domain. Moreover, users could incorrectly file bugs to either ONNX or ONNX Runtime when they inspect the model and notice the presence of an unspecified ONNX operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73954
Approved by: https://github.com/BowenBao, https://github.com/malfet, https://github.com/garymm, https://github.com/jiafatom
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72995
Add ability to specify input dimensions that need to be dynamic.
Example: if dim 115 can be dynamic in input sizes "1,115;1", then specify dynamic_dims as "115"
Also recompile and update CI models and some asm code as the old ones don't compile with compiler changes in context.cpp
Test Plan: - Compiles and runs BI Bytedoc model with and without dynamic inputs.
Reviewed By: ZolotukhinM
Differential Revision: D34233121
fbshipit-source-id: 35095e549ebd6d3bec98b9abb3f0764366a0ff6f
(cherry picked from commit 33166a9f9ac9194b5df0a35280b57708df255ebd)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73109
This change updates the Vulkan model runner in `speed_benchmark_torch` to be able to generate inputs for models that have input/output types other than just a single tensor. Input elements are processed depending on their type.
Test Plan: Imported from OSS
Reviewed By: mikaylagawarecki
Differential Revision: D34354839
Pulled By: SS-JIA
fbshipit-source-id: 993e55372d2664fa7eddb16146deba264727f399
(cherry picked from commit 4a140202acb336412676ac090a38d7b93ae49898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72758
Bug: FLAGS_output_llvm option was introduced recently to specify LLVM assembly code file. Without previously default value, now the llvm code is not being saved to a file if asmfile input is not specified and is resulting in making the compiled output unusable.
Fix: Use default value if output_llvm/asmfile input is not specified.
Test Plan: Verified that output is saved to deafult .ll file path
Reviewed By: IvanKobzarev
Differential Revision: D34189107
fbshipit-source-id: ee51e8c17de92d3045690ca871fb9569fc3164d6
(cherry picked from commit 46352d446b3e9d7df0eddf0a29790de6f7757d26)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66746
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D31705361
fbshipit-source-id: 33fd22eb03086d114e2c98e56703e8ec84460268
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68751
Add option to get input dtype from user for AOT compilation
Test Plan:
BI model compiles and runs fine
```
(pytorch) ~/fbsource/fbcode/caffe2/fb/nnc
└─ $ buck run //caffe2/binaries:aot_model_compiler -- --model=bi.pt --model_name=pytorch_dev_bytedoc --model_version=v1 '--input_dims=1,115;1' --input_types='int64;int64'
Building... 8.3 sec (99%) 7673/7674 jobs, 0/7674 updated
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1116 14:32:44.632536 1332111 TensorImpl.h:1418] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator())
E1116 14:32:44.673710 1332111 huge_pages_allocator.cc:287] Not using huge pages because not linked with jemalloc
The compiled llvm assembly code was saved to bi.compiled.ll
The compiled model was saved to bi.compiled.pt
```
> Error thrown when input dims and input types sizes don't match
```
(pytorch) ~/fbsource/fbcode/caffe2/fb/nnc
└─ $ buck run //caffe2/binaries:aot_model_compiler -- --model=bi.pt --model_name=pytorch_dev_bytedoc --model_version=v1 '--input_dims=1,115;1' --input_types='int64;int64;int64'
.
.
terminate called after throwing an instance of 'c10::Error'
what(): [enforce fail at aot_model_compiler.cc:208] split(';', FLAGS_input_dims).size() == split(';', FLAGS_input_types).size(). Number of input_dims and input_types should be the same
.
.
.
```
Reviewed By: ljk53
Differential Revision: D32477001
fbshipit-source-id: 8977b0b59cf78b3a2fec0c8428f83a16ad8685c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67229
Right now, assembly code generated for the a given method from the model is named wrapper or func by default. The function name is then replaced with a proper kernel_func_name after target specific assembly is generated.
This PR propagates a desired kernel_func_name right from aotCompiler API so that the generated function has the needed name that doesn't need to be replaced later.
Note: Most of this change was landed in https://github.com/pytorch/pytorch/pull/66337 which had to be reverted as it was breaking `test_profiler` in `test_jit_fuser_te` as it replaced the name generated for graph with the default kernel_func_name value. This PR fixes that as well.
```
(pytorch) ~/local/pytorch kname
└─ $ python3 test/test_jit_fuser_te.py
CUDA not available, skipping tests
monkeytype is not installed. Skipping tests for Profile-Directed Typing
........................................<string>:3: UserWarning: torch.cholesky is deprecated in favor of torch.linalg.cholesky and will be removed in a future PyTorch release.
L = torch.cholesky(A)
should be replaced with
L = torch.linalg.cholesky(A)
and
.
.
.
......................<string>:3: UserWarning: torch.symeig is deprecated in favor of torch.linalg.eigh and will be removed in a future PyTorch release.
The default behavior has changed from using the upper triangular portion of the matrix by default to using the lower triangular portion.
L, _ = torch.symeig(A, upper=upper)
should be replaced with
L = torch.linalg.eigvalsh(A, UPLO='U' if upper else 'L')
and
L, V = torch.symeig(A, eigenvectors=True)
should be replaced with
L, V = torch.linalg.eigh(A, UPLO='U' if upper else 'L') (Triggered internally at ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2492.)
......[W pybind_utils.cpp:35] Warning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (function operator())
/data/users/priyaramani/pytorch/torch/testing/_internal/common_utils.py:403: UserWarning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (Triggered internally at ../torch/csrc/jit/python/pybind_utils.h:691.)
return callable(*args, **kwargs)
.....................................................................[W Resize.cpp:23] Warning: An output with one or more elements was resized since it had shape [1], which does not match the required output shape [].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function resize_output_check)
[W Resize.cpp:23] Warning: An output with one or more elements was resized since it had shape [1, 5], which does not match the required output shape [5].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function resize_output_check)
........................................................................s.......s...s.s....s......s..sss............................
----------------------------------------------------------------------
Ran 503 tests in 37.536s
OK (skipped=10)
```
Test Plan: Imported from OSS
Reviewed By: navahgar, pbelevich
Differential Revision: D31945713
Pulled By: priyaramani
fbshipit-source-id: f2246946f0fd51afba5cb6186d9743051e3b096b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65967
Graph is an implementation detail. If user wants to get access to the
underlying graph, they should be able to explicitly dynamic cast instead.
ghstack-source-id: 141659819
Test Plan: no behavior change.
Reviewed By: gmagogsfm
Differential Revision: D31326153
fbshipit-source-id: a0e984f57c6013494b92a7095bf5bb660035eb84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67198
Fixing a couple instances where parameters were named method_compile_spec when they were actually compile_specs that could have multiple method_compile_specs inside.
Also use output dtype from buffer.
Test Plan:
Mobilenetv3 compiles and runs fine
```
(pytorch) ~/fbsource/fbcode/caffe2/fb/nnc
└─ $ PYTORCH_JIT_LOG_LEVEL="aot_compiler" buck run //caffe2/binaries:aot_model_compiler -- --model mobilenetv3.pt --model_name=pytorch_dev_mobilenetv3 --model_version=v1 --input_dims="1,3,224,224
"
Downloaded 4501/6195 artifacts, 433.89 Mbytes, 14.3% cache miss (for updated rules)
Building: finished in 06:34.6 min (100%) 20233/20233 jobs, 5467/20233 updated
Total time: 06:35.0 min
BUILD SUCCEEDED
The compiled llvm assembly code was saved to mobilenetv3.compiled.ll
The compiled model was saved to mobilenetv3.compiled.pt
└─ $ ./compile_model.sh -m pytorch_dev_mobilenetv3 -p /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/mobilenetv3.pt -v v1 -i "1,3,224,224"
+ VERSION=v1
+ getopts m:p:v:i:h opt
+ case $opt in
+ MODEL=pytorch_dev_mobilenetv3
.
.
Columns 961 to 9701e-11 *
-4.2304 -3.9674 2.4473 -0.8664 -0.7513 1.2140 0.0010 3.8675 1.2714 2.2989
Columns 971 to 9801e-11 *
-2.7203 1.6772 -0.7460 -0.6936 4.4421 -0.9865 -0.5186 -1.4441 1.3047 -1.6112
Columns 981 to 9901e-11 *
0.1275 -1.8815 2.5105 -0.4871 -2.2342 0.8520 0.8658 1.6180 3.8901 -0.2454
Columns 991 to 10001e-11 *
-1.4896 4.1337 -2.6640 0.8226 0.2441 -1.4830 -1.7430 1.8758 0.5481 0.5093
[ CPUFloatType{1,1000} ]
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Milliseconds per iter: 276.255. Iters per second: 3.61984
Memory usage before main runs: 104366080 bytes
Memory usage after main runs: 343441408 bytes
Average memory increase per iter: 2.39075e+07 bytes
0 value means "not available" in above
```
Reviewed By: ljk53
Differential Revision: D31698338
fbshipit-source-id: da6c74c1321ec02e0652f3afe6f97bf789d3361b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66337
Right now, assembly code generated for the a given method from the model is named wrapper or func by default. The function name is then replaced with a proper kernel_func_name after target specific assembly is generated.
This PR propagates a desired kernel_func_name right from aotCompiler API so that the generated function has the needed name that doesn't need to be replaced later.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31514095
Pulled By: priyaramani
fbshipit-source-id: b70c8e2c733600a435cd4e8b32092d37b7bf7de5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
bypass_size_limit
allow-large-files
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D30652629
fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66123
Some models may take in a list of tensors as inputs, thus the bundled inputs will contain `IValues` that are of the type `c10::List`. For Vulkan models, every tensor in the `IValue` list has to be converted to a vulkan tensor first, and this case is not currently handled by the Vulkan model wrapper in the benchmark binary.
This diff introduces `IValue` type checking to the input processor of the Vulkan model wrapper, and adds support for Tensor and List types.
Test Plan:
```
# Build the binary
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:ptmobile_compareAndroid\#android-arm64 --show-output
# Push it to the device
adb push buck-out/gen/xplat/caffe2/ptmobile_compareAndroid\#android-arm64 /data/local/tmp/compare_models
# Run the benchmark binary
BENCH_CMD="/data/local/tmp/compare_models"
BENCH_CMD+=" --model=$PATH_TO_MODEL"
BENCH_CMD+=" --refmodel=$PATH_TO_REFERENCE_MODEL"
BENCH_CMD+=" --input_type=float --input_dims=$MODEL_INPUT_SIZE"
BENCH_CMD+=" --iter=100"
BENCH_CMD+=" --tolerance 1e-5"
```
Reviewed By: beback4u
Differential Revision: D31276862
fbshipit-source-id: 1d9abf958963da6ecad641202f0458402bee5ced
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65707
Refactoring aotCompile to return a pair of compiled function and the LLVM assembly instead of updating an incoming string with assembly code
Testing: Gives expected results when compiled and run
```
(pytorch) ~/local/pytorch refactor_aot
└─ $ build/bin/aot_model_compiler --model mobilenetv3.pt --model_name=pytorch_dev_mobilenetv3 --model_version=v1 --input_dims="2,2,2"
The compiled model was saved to mobilenetv3.compiled.pt
```
Test Plan: Imported from OSS
Reviewed By: qihqi
Differential Revision: D31220452
Pulled By: priyaramani
fbshipit-source-id: f957c53ba83f876a2e7dbdd4b4571a760b3b6a9a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65171
Add the model comparison binary to BUCK, and also add some quality of life features such as controlling the input range.
Test Plan:
```
# Build the binary
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:ptmobile_compareAndroid\#android-arm64 --show-ou
# Push it to the device
adb push buck-out/gen/xplat/caffe2/ptmobile_compareAndroid\#android-arm64 /data/local/tmp/compare_models
# Run the benchmark binary
BENCH_CMD="/data/local/tmp/compare_models"
BENCH_CMD+=" --model=$PATH_TO_MODEL"
BENCH_CMD+=" --refmodel=$PATH_TO_REFERENCE_MODEL"
BENCH_CMD+=" --input_type=float --input_dims=$MODEL_INPUT_SIZE"
BENCH_CMD+=" --iter=100"
BENCH_CMD+=" --tolerance 1e-5"
```
Reviewed By: beback4u
Differential Revision: D30371322
fbshipit-source-id: 5e520aaf119c90985a1d5a135f76e4057148333b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60919
Update make_mnist_db.cc and make_image_db.cc to work with the DB API changes
in D29204425 (00896cb9ed). This is similar to the changes to make_cifar_db.cc landed in
D29374754 (394f60b0fc).
ghstack-source-id: 132621346
Test Plan: buck build caffe2/binaries/...
Reviewed By: valmikir
Differential Revision: D29447314
fbshipit-source-id: 33aff85c24d8b785211287de23d46704c7eb0726
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60692
Update make_cifar_db.cc to work with the DB API changes in D29204425 (00896cb9ed).
Test Plan: buck build caffe2/binaries:make_cifar_db
Differential Revision: D29374754
fbshipit-source-id: 23d2acd24031d11071791e398433b537215ffd38
Summary:
Some machines don't have a versionless `python` on their PATH, which breaks these existing shebangs.
I'm assuming that all the existing versionless `python` shebangs are meant to be `python3` and not `python2`; please let me know if my assumption was incorrect for any of these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58275
Test Plan: CI.
Reviewed By: zhouzhuojie
Differential Revision: D28428143
Pulled By: samestep
fbshipit-source-id: 6562be3d12924db72a92a0207b060ef740f61ebf