Summary:
fix Semmle warning: Comparison of narrow type with wide type in loop condition
For example there is below piece of code:
for (int i=0; i<array.size(); ++i) {}
The problem is that array.size() return type is size_t can be larger type than int depending on the implementation so there is chance that i overflows (for very large array that array size is beyond the range of integer) and this loop will never be terminated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53951
Reviewed By: zou3519
Differential Revision: D27181495
Pulled By: malfet
fbshipit-source-id: 0612c5cedcdc656c193085e7fbb87dd163f20688
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53404
This refactors `TensorSerializer::Serialize()` so that we have a separate
helper function for each data type.
This should make it slightly easier in the future to add new serialization
formats for specific data types.
ghstack-source-id: 124085413
Test Plan:
Confirmed the existing tests pass. This diff is not expected to have any
behavior changes.
Reviewed By: mraway, glamtechie
Differential Revision: D26658204
fbshipit-source-id: 232776262db6486ba845a7ba223e3987053dac27
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53403
This updates the `TensorProto` field to independently track the data type of
the in-memory (deserialized) data from the serialized data format.
This will allow us to support multiple different serialization formats in the
future. For instance, we could choose to perform quantization of floating
point data types, or varint encoding for integer fields.
For now this diff does not actually change the serialization code path yet,
and does not introduce any new serialization formats, but only refactors the
deserialization code path to make it easier to introduce new formats.
I'm not really that thrilled with the heavy use of macros and templates here,
but I didn't really see better alternatives that made it as simple to specify
new deserialization function implementations.
ghstack-source-id: 123594220
Test Plan:
Confirmed that the existing unit tests pass. This diff only touches the
deserialization code path and not the serialization code to help ensure that
the deserialization code works with the existing serialization logic, and that
there are no changes to the current serialization format.
Reviewed By: mraway
Differential Revision: D26658206
fbshipit-source-id: d7297d600aee28b92fd9f4ece437b7f519060942
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53754
Some of the PyTorch CircleCI builds still use gcc 5.4, and compile with
`-Werror=attributes` causing this old compiler to fail because it does not
understand the `[[nodiscard]]` attribute.
Let's define a `CAFFE2_NODISCARD` macro to work around this.
ghstack-source-id: 123594084
Test Plan: I'm using this macro in subsequent diffs in the stack.
Reviewed By: mraway
Differential Revision: D26959584
fbshipit-source-id: c7ba94f7ea944b6340e9fe20949ba41931e11d41
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53402
Add an `options` field to the `Save` operator which accepts options for how to
serialize different blobs. At the moment this simply allows controlling the
existing `chunk_size` behavior, but in the future we can add other options,
such as the ability to control compression settings or other serialization
formats.
ghstack-source-id: 123567034
Test Plan:
Added a new test to `load_save_test.py` that passes in options and verifies
that blobs were serialized with the expected number of chunks.
buck test caffe2/caffe2:caffe2_test_cpu \
caffe2/caffe2/core:serialization_test \
caffe2/caffe2/python/operator_test:load_save_test
Reviewed By: mraway
Differential Revision: D26502577
fbshipit-source-id: 6e302e530bb96990517c2e35c505db7f14a56284
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53400
This is a reland of D26617038 (b4a8d98247) after rebasing onto D26802576 (f595ba1bae).
Optimize the blob serialization code by using `AddNAlreadyReserved()` when
serializing tensor data, rather than making N separate `Add()` calls.
`AddNAlreadyReserved()` is a simple addition operation, while each `Add()`
call checks to see if it needs to reserve new space, and then updates the
element data, which is unnecessary in this case.
ghstack-source-id: 123567030
Test Plan:
This appears to improve raw serialization performance by 30 to 35% for float,
double, and int64_t types which use this function. This improvement appears
relatively consistent across large and small tensor sizes.
Reviewed By: mraway
Differential Revision: D26853941
fbshipit-source-id: 4ccaa5bc1dd7f7864068d71a0cde210c699cbdba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53389
Resize was written to take arguments by value, which was
totally fine if they were ArrayRef or a series of integers, but not so
fine if they're std::vector.
ghstack-source-id: 123212128
Test Plan:
Existing CI should make sure it builds
Inspected assembly for ios_caffe.cc and saw no more vector copy before
calling Resize
Reviewed By: smessmer
Differential Revision: D26852105
fbshipit-source-id: 9c3b9549d50d32923b532bbc60d0246e2c2b5fc7
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53319
Noticed these in profiles.
Also switch to `unordered_map`.
Test Plan: Unit tests.
Reviewed By: swolchok
Differential Revision: D26504408
fbshipit-source-id: 9e14d55909a4af019058b8c27c67ee2348cd02a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52903
Implement BlackBoxPredictor::BenchmarkIndividualOps so that we can clean up the output tensors properly after each iteration and get more accurate per operator timing.
Add four more metrics to track setup_time, memory_alloc_time, memory_dealloc_time, and output_dealloc_time.
Reviewed By: ajyu
Differential Revision: D26657473
fbshipit-source-id: 1cf282192b531513b9ee40b37252087818412f81
Summary:
Optimize the blob serialization code by using `AddNAlreadyReserved()` when
serializing tensor data, rather than making N separate `Add()` calls.
`AddNAlreadyReserved()` is a simple addition operation, while each `Add()`
call checks to see if it needs to reserve new space, and then updates the
element data, which is unnecessary in this case.
Test Plan:
This appears to improve raw serialization performance by 30 to 35% for float,
double, and int64_t types which use this function. This improvement appears
relatively consistent across large and small tensor sizes.
Differential Revision: D26617038
fbshipit-source-id: 97dedbae889d35463628f3016ac56986e685289e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52411
The `TensorDeserializer` code previously did not correctly handle unknown
`data_type` values. It attempted to deserialize the data as floats, rather
than recognizing that it did not understand the data type and erroring out.
Google protobuf will never return unknown values for enum fields. If an
unknown value is found in serialized data, the protobuf code discards it.
As a result `has_data_type()` will return false, but `get_data_type()` will
simply return the default value, which happens to be set to `FLOAT`. As a
result if we ever encounter a serialized blob with an unknown data type the
previous code would incorrectly think the data type was `FLOAT`.
This fixes the code to check if the `data_type` value is present before
reading it.
ghstack-source-id: 121915981
Test Plan:
Included a unit test that verifies this behavior. Confirmed that without this
fix the code proceeded with the float deserialization code path. When
deserializing int32_t data it fortunately did fail later due to an unexpected
field length check, but this isn't guaranteed to be the case. In some cases
it potentially could incorrectly succeed and return wrong data.
Reviewed By: mraway
Differential Revision: D26375502
fbshipit-source-id: 4f84dd82902e18df5e693f4b28d1096c96de7916
Summary:
Sub-step of my attempt to split up the torch_cuda library, as it is huge. Please look at https://github.com/pytorch/pytorch/issues/49050 for details on the split and which files are in which target.
This PR introduces two new macros for Windows DLL purposes, TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API. Both are defined as TORCH_CUDA_API for the time being.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50627
Reviewed By: mruberry
Differential Revision: D25955441
Pulled By: janeyx99
fbshipit-source-id: ff226026833b8fb2fb7c77df6f2d6c824f006869
Summary:
Since caffe2 and torch have been consolidated, CAFFE2_API should be merged with TORCH_API. Addresses a TODO.
Manually edited some references of the removed `CAFFE2_API`:
* `CONTRIBUTING.md`
* `caffe2/proto/CMakeLists.txt`
* `cmake/ProtoBuf.cmake`
* `c10/macros/Export.h`
* `torch/csrc/WindowsTorchApiMacro.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49496
Reviewed By: malfet, samestep
Differential Revision: D25600726
Pulled By: janeyx99
fbshipit-source-id: 7e068d959e397ac183c097d7e9a9afeca5ddd782
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49146
Add support for Storage arguments to IValue and the JIT typing system, and make ops that were blocked on that c10-full.
ghstack-source-id: 118710665
(Note: this ignores all push blocking failures!)
Test Plan: waitforsandcastle
Reviewed By: ezyang
Differential Revision: D25456799
fbshipit-source-id: da14f125af352de5fcf05a83a69ad5a69d5a3b45
Summary:
many newly added build settings are not saved in torch.__config__. adding them to the mix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48380
Reviewed By: samestep
Differential Revision: D25161951
Pulled By: walterddr
fbshipit-source-id: 1d3dee033c93f2d1a7e2a6bcaf88aedafeac8d31
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48308
The original regex that I added didn't correctly match namespaces that started with an underscore (e.g. `_test`), which caused a master-only test to fail.
The only change from the previous commit is that I updated the regex like so:
before: `^.*TORCH_LIBRARY_IMPL_init_([^_]+)_([^_]+)_[0-9]+(\(.*)?$`
after: `^.*TORCH_LIBRARY_IMPL_init_([_]*[^_]+)_([^_]+)_[0-9]+(\(.*)?$`
I added in a `[_]*` to the beginning of the namespace capture. I did the same for the `_FRAGMENT` regex.
Verified that running `ANALYZE_TEST=1 tools/code_analyzer/build.sh` (as the master-only test does) produces no diff in the output.
Fixing regex pattern to allow for underscores at the beginning of the
namespace
This reverts commit 3c936ecd3c68f395dad01f42935f20ed8068da02.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D25123295
Pulled By: bdhirsh
fbshipit-source-id: 54bd1e3f0c8e28145e736142ad62a18806bb9672
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48185
In a scenario where we have Caffe2 wrapped into a dynamic library, we were running into the memory corruption crash at program termination:
"corrupted size vs. prev_size in fastbins"
Turns out the crash occurs in glog's logging.cc, which is not thread-safe and has to initialize a static hostname string when flushing. If this ends up happening on multiple threads simultaneously, this can lead to a memory corruption.
```
==1533667== Invalid free() / delete / delete[] / realloc()
==1533667== at 0xA3976BB: operator delete(void*, unsigned long) (vg_replace_malloc.c:595)
==1533667== by 0x37E36AE: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (basic_string.h:647)
==1533667== by 0xAD87 (97b9712aed)F6B: __run_exit_handlers (in /usr/lib64/libc-2.28.so)
==1533667== by 0xAD8809 (153e2e96d4)F: exit (in /usr/lib64/libc-2.28.so)
==1533667== by 0xAD71799: (below main) (in /usr/lib64/libc-2.28.so)
==1533667== Address 0x165cd720 is 0 bytes inside a block of size 31 free'd
==1533667== at 0xA3976BB: operator delete(void*, unsigned long) (vg_replace_malloc.c:595)
==1533667== by 0x37E36AE: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (basic_string.h:647)
==1533667== by 0xAD87 (97b9712aed)F6B: __run_exit_handlers (in /usr/lib64/libc-2.28.so)
==1533667== by 0xAD8809 (153e2e96d4)F: exit (in /usr/lib64/libc-2.28.so)
==1533667== by 0xAD71799: (below main) (in /usr/lib64/libc-2.28.so)
==1533667== Block was alloc'd at
==1533667== at 0xA39641F: operator new(unsigned long) (vg_replace_malloc.c:344)
==1533667== by 0x37F4E18: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_mutate(unsigned long, unsigned long, char const*, unsigned long) (basic_string.tcc:317)
==1533667== by 0x37F4F2E: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const*, unsigned long) (basic_string.tcc:466)
==1533667== by 0x5170344: GetHostName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) (logging.cc:227)
==1533667== by 0x51702D4 (fc7f026980): google::LogDestination::hostname[abi:cxx11]() (logging.cc:555)
==1533667== by 0x5173789: google::(anonymous namespace)::LogFileObject::Write(bool, long, char const*, int) (logging.cc:1072)
==1533667== by 0x51746DF: google::LogDestination::LogToAllLogfiles(int, long, char const*, unsigned long) (logging.cc:773)
==1533667== by 0x5170BDC: google::LogMessage::SendToLog() (logging.cc:1386)
==1533667== by 0x5171236: google::LogMessage::Flush() (logging.cc:1305)
==1533667== by 0x517114D: google::LogMessage::~LogMessage() (logging.cc:1264)
==1533667== by 0x108DC840: caffe2::ReinitializeTensor(caffe2::Tensor*, c10::ArrayRef<long>, c10::TensorOptions) (tensor.cc:0)
==1533667== by 0x103BBED0: caffe2::int8::Int8GivenTensorFillOp::RunOnDevice() (int8_given_tensor_fill_op.h:29)
==1533667==
```
There doesn't seem to be an obvious easy solution here. The logging API being used by c10 is fundamentally not thread-safe, at least when it uses glog. Glog does have a threadsafe API (raw_logging), but this doesn't seem to be used by c10 right now. I suspect other callers are not running into this crash because:
- They have other libraries using glog in their module, so the static variable in glog gets initialized before getting into a race condition
- They don't use int8 network in a glog context, thus avoiding this problematic log statement
An alternative fix would be to correctly initialize the dtype of the int8 tensor, which is currently always uninitialized, making the log statement always trigger for int8 networks. Initializing the int8 tensor correctly in tensor_int8.h is proving to be challenging though, at least without knowledge of Caffe2's codebase. And even then, it wouldn't fix the issue for all use cases.
Test Plan: Ran my app with valgrind, I no longer get the crash and valgrind doesn't complain about a memory corruption anymore
Reviewed By: thyu, qizzzh
Differential Revision: D25040725
fbshipit-source-id: 1392a97ccf9b4c9ade1ea713610ee44a1578ae7d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48021
Extending operator schema check for simple memonger to dag memonger as well. As part of this a fix is being made to handle inplace ops (having at least one output name same as input blob). Earlier all the output blobs from ops were being treated as shareable but it failed assertion of external input blobs with the same name not allowed to share.
Test Plan: Added corresponding unit tests
Reviewed By: hlu1
Differential Revision: D24968862
fbshipit-source-id: b6679a388a82b0d68f65ade64b85560354aaa3ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47613
This is to test some more cancellation edge cases that were missing before. It passes under the current code.
Test Plan: buck test mode/opt caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 10
Reviewed By: dahsh
Differential Revision: D24836956
fbshipit-source-id: 3b00dc081cbf4f26e7756d597099636edb49d256
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47718
Distributed Inference splits a predict net into multiple parts, part0 being the main part which contains ops to make remote calls to other parts. part0 predict net may contain AsyncIf ops to optimize rpc call usage. AsyncIf ops have internal nets which may refer to memongered blobs. This change handles AsyncIf ops to update internal nets to refer to memongered blobs.
As part of this change, I am also updating dag memonger traversal to always start from root op, i.e. ops with 0 in degree. Earlier logic will start traversing ops based on input head blobs and if one of the head inputs is getting used in a non-root op which gets visited before its parent, the traversal will throwing assertion error here: https://fburl.com/diffusion/ob110s9z . Almost for all the distributed inference part0 nets, it was throwing this assertion error.
Test Plan: Added corresponding tests in memonger_test.py . Could not find unit tests in c++ version of memonger.
Reviewed By: hlu1
Differential Revision: D24872010
fbshipit-source-id: 1dc99b2fb52b2bc692fa4fc0aff6b7e4c5e4f5b0
Summary:
Distributed Inference splits a predict net into multiple parts, part0 being the main part which contains ops to make remote calls to other parts. part0 predict net may contain AsyncIf ops to optimize rpc call usage. AsyncIf ops have internal nets which may refer to memongered blobs. This change handles AsyncIf ops to update internal nets to refer to memongered blobs. Here is one reference part0 predict net with AsyncIf ops: https://www.internalfb.com/intern/paste/P145812115/
As part of this change, I am also updating dag memonger traversal to always start from root op, i.e. ops with 0 in degree. Earlier logic will start traversing ops based on input head blobs and if one of the head inputs is getting used in a non-root op which gets visited before its parent, the traversal will throwing assertion error here: https://fburl.com/diffusion/ob110s9z . Almost for all the distributed inference part0 nets, it was throwing this assertion error.
Reviewed By: hlu1
Differential Revision: D24346771
fbshipit-source-id: ad2dd2e63f3e822ad172682f6d63f8474492255d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46560
Follow-up for D24236604 (16c52d918b).
For nets that pass the schema check, memonger actually makes sure to preserve the inplaceness of operators if they are already inplace. So we can safely enable it for correct input nets.
(Note: this ignores all push blocking failures!)
Differential Revision: D24402482
fbshipit-source-id: a7e95cb0e3eb87adeac79b9b69eef207957b0bd5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43987
This replaces the caffe2 CPU random number (std::mt19937) with at::mt19937 which is the one currently used in pytorch. The ATen RNG is 10x faster than the std one and appears to be more robust given bugs in the std (https://fburl.com/diffusion/uhro7lqb)
For large embedding tables (10GB+) we see UniformFillOp taking upwards of 10 minutes as we're bottlenecked on the single threaded RNG. Swapping to at::mt19937 cuts that time to 10% of the current.
Test Plan: Ran all relevant tests + CI. This doesn't introduce new features (+ is a core change) so existing tests+CI should be sufficient to catch regressions.
Reviewed By: dzhulgakov
Differential Revision: D23219710
fbshipit-source-id: bd16ed6415b2933e047bcb283a013d47fb395814
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46424
Currently if an exception occurs in a reporter thread the process is killed via std::terminate. This adds support for handling the reporter exception if FLAGS_caffe2_handle_executor_threads_exceptions is set to true.
Test Plan: buck test mode/opt -c python.package_style=inplace //caffe2/caffe2/python:hypothesis_test //caffe2/caffe2:caffe2_test_cpu -- --stress-runs 100
Reviewed By: dahsh
Differential Revision: D24345027
fbshipit-source-id: 0659495c9e27680ebae41fe5a3cf26ce2f455cb3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46110
## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145).
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.
## Summary
* Added PlanExecutorTest `ErrorPlanWithCancellableStuckNet` for plan executor.
* Set cancelCount to zero at the beginning of tests to avoid global state be carried over in some test environment.
Test Plan:
## Unit Test Added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 1000
```
Reviewed By: d4l3k
Differential Revision: D24226577
fbshipit-source-id: c834383bfe6ab50747975c229eb42a363eed3458
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46080
temp removal of ErrorPlanWithCancellableStuckNet, will fill out more
Test Plan:
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
```
remove a test
Reviewed By: fegin
Differential Revision: D24213971
fbshipit-source-id: e6e600bad00b45c726311193b4b3238f1700526e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45319
## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145)
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.
## Summary
* Added `ErrorPlanWithCancellableStuckNet` for plan executor.
* We set a plan with two nets: one stuck net with blocking operator that never returns, and one with error
net with error op that throws, and tested it throw and cancel.
Test Plan:
## Unit Test added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
```
```
Summary
Pass: 400
ListingSuccess: 2
```
Reviewed By: d4l3k
Differential Revision: D23920548
fbshipit-source-id: feff41f73698bd6ea9b744f920e0fece4ee44438
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45981
This is a recommit of previously reverted D20850851 (3fbddb92b1).
TL;DR - combining condition_variables and atomics is a bad idea
https://stackoverflow.com/questions/49622713/c17-atomics-and-condition-variable-deadlock
This also adds some ifdefs to disable the death test for mobile, xplat and tsan builds since forking doesn't play nicely with them.
Test Plan:
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 1000 test_atomic_iter_with_concurrent_steps --timeout 120
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 100
buck test mode/opt caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
no timeouts https://www.internalfb.com/intern/testinfra/testconsole/testrun/7036874440059883/
will ensure no timeouts in OSS
Reviewed By: walterddr, dahsh
Differential Revision: D24165505
fbshipit-source-id: 17cd23bfbcd9c2826a4067a387023d5186353196
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45297
If we have two concurrent substeps and one of them throws an exception and the other is blocking, we'll currently hang. This waits up to 1 minute for it to complete before terminating the process.
Test Plan: buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
Reviewed By: dahsh
Differential Revision: D20850851
fbshipit-source-id: 330503775d8062a34645ba55fe38e6770de5e3c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062
Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory, and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.
Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789
Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/
vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/
Reviewed By: ezyang
Differential Revision: D23484192
fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38