1480 Commits

Author SHA1 Message Date
92770d25cd fix comparison of narrow type with wide type in loop condition (#53951)
Summary:
fix Semmle warning: Comparison of narrow type with wide type in loop condition

For example there is below piece of code:
for (int i=0; i<array.size(); ++i) {}

The problem is that array.size() return type is size_t can be larger type than int depending on the implementation so there is chance that i overflows (for very large array that array size is beyond the range of integer) and this loop will never be terminated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53951

Reviewed By: zou3519

Differential Revision: D27181495

Pulled By: malfet

fbshipit-source-id: 0612c5cedcdc656c193085e7fbb87dd163f20688
2021-03-22 16:40:35 -07:00
ccdcfba5de [caffe2] Refactor tensor serialization function (#53404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53404

This refactors `TensorSerializer::Serialize()` so that we have a separate
helper function for each data type.

This should make it slightly easier in the future to add new serialization
formats for specific data types.
ghstack-source-id: 124085413

Test Plan:
Confirmed the existing tests pass.  This diff is not expected to have any
behavior changes.

Reviewed By: mraway, glamtechie

Differential Revision: D26658204

fbshipit-source-id: 232776262db6486ba845a7ba223e3987053dac27
2021-03-17 12:36:31 -07:00
8a5b946ff6 [caffe2] Don't call TensorImpl::size() in dim32() (#53852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53852

dim32() requires that its argument is in range, so we can use the faster `TensorImpl::sizes()` call instead.
ghstack-source-id: 123784862

Test Plan:
Ran MergeNet AdIndexer benchmark under perf stat.

Before:

```
 Performance counter stats for 'scripts/bwasti/static_runtime/run.sh' (5 runs):

          7,008.70 msec task-clock                #    0.997 CPUs utilized            ( +-  0.25% )
             4,203      context-switches          #    0.600 K/sec                    ( +- 14.71% )
                 3      cpu-migrations            #    0.000 K/sec
            93,896      page-faults               #    0.013 M/sec                    ( +-  0.80% )
    13,869,719,763      cycles                    #    1.979 GHz                      ( +-  0.23% )  (50.05%)
    27,561,765,867      instructions              #    1.99  insn per cycle           ( +-  0.06% )  (50.04%)
     4,288,245,412      branches                  #  611.846 M/sec                    ( +-  0.05% )  (50.01%)
        19,633,433      branch-misses             #    0.46% of all branches          ( +-  0.83% )  (50.01%)

            # Table of individual measurements:
            7.0670 (+0.0379) #
            6.9897 (-0.0394) #
            7.0203 (-0.0088) #
            6.9829 (-0.0462) #
            7.0856 (+0.0565) #

            # Final result:
            7.0291 +- 0.0205 seconds time elapsed  ( +-  0.29% )
```

After:
```
 Performance counter stats for 'scripts/bwasti/static_runtime/run.sh' (5 runs):

          6,935.61 msec task-clock                #    0.997 CPUs utilized            ( +-  0.47% )
             2,913      context-switches          #    0.420 K/sec                    ( +- 15.25% )
                 3      cpu-migrations            #    0.000 K/sec
            92,628      page-faults               #    0.013 M/sec                    ( +-  0.50% )
    13,724,940,495      cycles                    #    1.979 GHz                      ( +-  0.47% )  (50.01%)
    27,226,217,974      instructions              #    1.98  insn per cycle           ( +-  0.02% )  (50.03%)
     4,220,129,358      branches                  #  608.472 M/sec                    ( +-  0.06% )  (50.04%)
        19,025,346      branch-misses             #    0.45% of all branches          ( +-  0.53% )  (50.04%)

            # Table of individual measurements:
            6.9402 (-0.0145) #
            6.8570 (-0.0978) #
            6.9311 (-0.0236) #
            7.0101 (+0.0554) #
            7.0352 (+0.0805) #

            # Final result:
            6.9547 +- 0.0315 seconds time elapsed  ( +-  0.45% )

```

Roughly 1% cycles win, which is outside the quoted noise level.

Reviewed By: hlu1

Differential Revision: D26994107

fbshipit-source-id: f4c4963be0a5c268cbcdac5359f8278750218ae6
2021-03-12 16:22:29 -08:00
33aaea912a [caffe2] Support deserializing tensors using alternate serialization formats (#53403)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53403

This updates the `TensorProto` field to independently track the data type of
the in-memory (deserialized) data from the serialized data format.

This will allow us to support multiple different serialization formats in the
future.  For instance, we could choose to perform quantization of floating
point data types, or varint encoding for integer fields.

For now this diff does not actually change the serialization code path yet,
and does not introduce any new serialization formats, but only refactors the
deserialization code path to make it easier to introduce new formats.

I'm not really that thrilled with the heavy use of macros and templates here,
but I didn't really see better alternatives that made it as simple to specify
new deserialization function implementations.
ghstack-source-id: 123594220

Test Plan:
Confirmed that the existing unit tests pass.  This diff only touches the
deserialization code path and not the serialization code to help ensure that
the deserialization code works with the existing serialization logic, and that
there are no changes to the current serialization format.

Reviewed By: mraway

Differential Revision: D26658206

fbshipit-source-id: d7297d600aee28b92fd9f4ece437b7f519060942
2021-03-12 11:35:15 -08:00
91531d3047 [caffe2] add a CAFFE2_NODISCARD macro to help support old compilers (#53754)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53754

Some of the PyTorch CircleCI builds still use gcc 5.4, and compile with
`-Werror=attributes` causing this old compiler to fail because it does not
understand the `[[nodiscard]]` attribute.

Let's define a `CAFFE2_NODISCARD` macro to work around this.
ghstack-source-id: 123594084

Test Plan: I'm using this macro in subsequent diffs in the stack.

Reviewed By: mraway

Differential Revision: D26959584

fbshipit-source-id: c7ba94f7ea944b6340e9fe20949ba41931e11d41
2021-03-12 11:32:30 -08:00
7e5ffbfa94 [caffe2] add a SerializationOptions field for the save operator (#53402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53402

Add an `options` field to the `Save` operator which accepts options for how to
serialize different blobs.  At the moment this simply allows controlling the
existing `chunk_size` behavior, but in the future we can add other options,
such as the ability to control compression settings or other serialization
formats.
ghstack-source-id: 123567034

Test Plan:
Added a new test to `load_save_test.py` that passes in options and verifies
that blobs were serialized with the expected number of chunks.

  buck test caffe2/caffe2:caffe2_test_cpu \
    caffe2/caffe2/core:serialization_test \
    caffe2/caffe2/python/operator_test:load_save_test

Reviewed By: mraway

Differential Revision: D26502577

fbshipit-source-id: 6e302e530bb96990517c2e35c505db7f14a56284
2021-03-11 13:02:58 -08:00
99d7c8ff94 [caffe2] use AddNAlreadyReserved() when serializing blobs (#53400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53400

This is a reland of D26617038 (b4a8d98247) after rebasing onto D26802576 (f595ba1bae).

Optimize the blob serialization code by using `AddNAlreadyReserved()` when
serializing tensor data, rather than making N separate `Add()` calls.
`AddNAlreadyReserved()` is a simple addition operation, while each `Add()`
call checks to see if it needs to reserve new space, and then updates the
element data, which is unnecessary in this case.
ghstack-source-id: 123567030

Test Plan:
This appears to improve raw serialization performance by 30 to 35% for float,
double, and int64_t types which use this function.  This improvement appears
relatively consistent across large and small tensor sizes.

Reviewed By: mraway

Differential Revision: D26853941

fbshipit-source-id: 4ccaa5bc1dd7f7864068d71a0cde210c699cbdba
2021-03-10 15:27:52 -08:00
b2758cdc77 [PyTorch] Don't copy vector arguments to caffe2::Tensor::Resize (#53389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53389

Resize was written to take arguments by value, which was
totally fine if they were ArrayRef or a series of integers, but not so
fine if they're std::vector.
ghstack-source-id: 123212128

Test Plan:
Existing CI should make sure it builds

Inspected assembly for ios_caffe.cc and saw no more vector copy before
calling Resize

Reviewed By: smessmer

Differential Revision: D26852105

fbshipit-source-id: 9c3b9549d50d32923b532bbc60d0246e2c2b5fc7
2021-03-08 12:33:33 -08:00
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
69bb0e0285 [caffe2] Avoid some double (and triple) lookups in workspace (#53319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53319

Noticed these in profiles.

Also switch to `unordered_map`.

Test Plan: Unit tests.

Reviewed By: swolchok

Differential Revision: D26504408

fbshipit-source-id: 9e14d55909a4af019058b8c27c67ee2348cd02a9
2021-03-04 22:57:02 -08:00
cyy
d8730194e7 use device methods (#52899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52899

Reviewed By: zou3519

Differential Revision: D26752203

Pulled By: albanD

fbshipit-source-id: eaef89377999b20655fe85d5a38ca7a2c5882de7
2021-03-02 20:14:23 -08:00
a296fa36ac [Caffe2] Implement BlackBoxPredictor::BenchmarkIndividualOps (#52903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52903

Implement BlackBoxPredictor::BenchmarkIndividualOps so that we can clean up the output tensors properly after each iteration and get more accurate per operator timing.

Add four more metrics to track setup_time, memory_alloc_time, memory_dealloc_time, and output_dealloc_time.

Reviewed By: ajyu

Differential Revision: D26657473

fbshipit-source-id: 1cf282192b531513b9ee40b37252087818412f81
2021-02-27 19:49:22 -08:00
21c3f6f415 Revert D26617038: [caffe2] use AddNAlreadyReserved() when serializing blobs
Test Plan: revert-hammer

Differential Revision:
D26617038 (b4a8d98247)

Original commit changeset: 97dedbae889d

fbshipit-source-id: 6921d0a64dee26e18f16628773953bbe7280998e
2021-02-25 21:32:40 -08:00
b4a8d98247 [caffe2] use AddNAlreadyReserved() when serializing blobs
Summary:
Optimize the blob serialization code by using `AddNAlreadyReserved()` when
serializing tensor data, rather than making N separate `Add()` calls.
`AddNAlreadyReserved()` is a simple addition operation, while each `Add()`
call checks to see if it needs to reserve new space, and then updates the
element data, which is unnecessary in this case.

Test Plan:
This appears to improve raw serialization performance by 30 to 35% for float,
double, and int64_t types which use this function.  This improvement appears
relatively consistent across large and small tensor sizes.

Differential Revision: D26617038

fbshipit-source-id: 97dedbae889d35463628f3016ac56986e685289e
2021-02-25 20:24:01 -08:00
27d89057f8 [caffe2] fix deserialization of unknown tensor data_type values (#52411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52411

The `TensorDeserializer` code previously did not correctly handle unknown
`data_type` values.  It attempted to deserialize the data as floats, rather
than recognizing that it did not understand the data type and erroring out.

Google protobuf will never return unknown values for enum fields.  If an
unknown value is found in serialized data, the protobuf code discards it.
As a result `has_data_type()` will return false, but `get_data_type()` will
simply return the default value, which happens to be set to `FLOAT`.  As a
result if we ever encounter a serialized blob with an unknown data type the
previous code would incorrectly think the data type was `FLOAT`.

This fixes the code to check if the `data_type` value is present before
reading it.
ghstack-source-id: 121915981

Test Plan:
Included a unit test that verifies this behavior.  Confirmed that without this
fix the code proceeded with the float deserialization code path.  When
deserializing int32_t data it fortunately did fail later due to an unexpected
field length check, but this isn't guaranteed to be the case.  In some cases
it potentially could incorrectly succeed and return wrong data.

Reviewed By: mraway

Differential Revision: D26375502

fbshipit-source-id: 4f84dd82902e18df5e693f4b28d1096c96de7916
2021-02-17 19:13:43 -08:00
cyy
39aa3db62b use make_shared and make_unique and clean unneeded code (#51829)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51829

Reviewed By: izdeby

Differential Revision: D26306098

Pulled By: smessmer

fbshipit-source-id: 4f6c0469c68f044c0bfe0925fcf7b030a25d15e2
2021-02-10 21:38:43 -08:00
fa325d7c9f Use sum_integers and multiply_integers (#51146)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51146

Test Plan: Sandcastle tests

Reviewed By: ngimel

Differential Revision: D25903430

fbshipit-source-id: 329c14018c9e5192864eed88a8ed0a5068ff1c69
2021-02-10 18:05:45 -08:00
fc314350ad Make RebatchingBuffer compatible with auto shape inference
Summary: no-op to operator behavior, resolve https://fburl.com/wte0v7tf

Test Plan: buck test

Reviewed By: huangyi1979

Differential Revision: D26333212

fbshipit-source-id: d237e8caf5977bc19fcced6aeedc6464fc905457
2021-02-09 12:37:26 -08:00
094d597679 raise windows tol to 30% (#51733)
Summary:
Up the Windows tolerance set by https://github.com/pytorch/pytorch/pull/35818, as CI is still showing some flakes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51733

Test Plan: CI

Reviewed By: zou3519

Differential Revision: D26258005

Pulled By: robieta

fbshipit-source-id: 864c848b7b31a05a2d07d1e683342b3202377c10
2021-02-04 14:09:10 -08:00
621198978a Move USE_NUMPY to more appropriate targets (#51143)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51143

Test Plan: CI

Reviewed By: wconstab

Differential Revision: D26084123

fbshipit-source-id: af4abe4ef87c1ebe5434938320526a925f5c34c8
2021-01-27 15:44:12 -08:00
533cb9530e Introducing TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API to the code (#50627)
Summary:
Sub-step of my attempt to split up the torch_cuda library, as it is huge. Please look at https://github.com/pytorch/pytorch/issues/49050 for details on the split and which files are in which target.

This PR introduces two new macros for Windows DLL purposes, TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API. Both are defined as TORCH_CUDA_API for the time being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50627

Reviewed By: mruberry

Differential Revision: D25955441

Pulled By: janeyx99

fbshipit-source-id: ff226026833b8fb2fb7c77df6f2d6c824f006869
2021-01-21 19:09:11 -08:00
71ca600af9 Renaming CAFFE2_API to TORCH_API (#49496)
Summary:
Since caffe2 and torch have been consolidated, CAFFE2_API should be merged with TORCH_API. Addresses a TODO.

Manually edited some references of the removed `CAFFE2_API`:
* `CONTRIBUTING.md`
* `caffe2/proto/CMakeLists.txt`
* `cmake/ProtoBuf.cmake`
* `c10/macros/Export.h`
* `torch/csrc/WindowsTorchApiMacro.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49496

Reviewed By: malfet, samestep

Differential Revision: D25600726

Pulled By: janeyx99

fbshipit-source-id: 7e068d959e397ac183c097d7e9a9afeca5ddd782
2020-12-18 10:54:50 -08:00
4431731c68 Making ops c10-full: Storage arguments (#49146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49146

Add support for Storage arguments to IValue and the JIT typing system, and make ops that were blocked on that c10-full.
ghstack-source-id: 118710665

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25456799

fbshipit-source-id: da14f125af352de5fcf05a83a69ad5a69d5a3b45
2020-12-16 14:00:34 -08:00
da6f249a10 [caffe2] DeserializeToNDArray (#49135)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49135

Differential Revision: D25417845

fbshipit-source-id: 4d8efd440bc2577fb717f911a401e7b81d48b907
2020-12-10 21:59:25 -08:00
54022e4f9b add new build settings to torch.__config__ (#48380)
Summary:
many newly added build settings are not saved in torch.__config__. adding them to the mix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48380

Reviewed By: samestep

Differential Revision: D25161951

Pulled By: walterddr

fbshipit-source-id: 1d3dee033c93f2d1a7e2a6bcaf88aedafeac8d31
2020-12-01 14:16:36 -08:00
b5149513ec migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API, update code_analyzer regex (#48308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48308

The original regex that I added didn't correctly match namespaces that started with an underscore (e.g. `_test`), which caused a master-only test to fail.

The only change from the previous commit is that I updated the regex like so:

before: `^.*TORCH_LIBRARY_IMPL_init_([^_]+)_([^_]+)_[0-9]+(\(.*)?$`
after: `^.*TORCH_LIBRARY_IMPL_init_([_]*[^_]+)_([^_]+)_[0-9]+(\(.*)?$`

I added in a `[_]*` to the beginning of the namespace capture. I did the same for the `_FRAGMENT` regex.

Verified that running `ANALYZE_TEST=1 tools/code_analyzer/build.sh` (as the master-only test does) produces no diff in the output.

Fixing regex pattern to allow for underscores at the beginning of the
namespace

This reverts commit 3c936ecd3c68f395dad01f42935f20ed8068da02.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25123295

Pulled By: bdhirsh

fbshipit-source-id: 54bd1e3f0c8e28145e736142ad62a18806bb9672
2020-11-30 13:05:33 -08:00
3c936ecd3c Revert D25056091: migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API
Test Plan: revert-hammer

Differential Revision:
D25056091 (0ea4982cf3)

Original commit changeset: 0f647ab9bc5e

fbshipit-source-id: e54047b91d82df25460ee00482373c4580f94d50
2020-11-19 19:10:14 -08:00
0ea4982cf3 migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API (#48097)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48097

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25056091

Pulled By: bdhirsh

fbshipit-source-id: 0f647ab9bc5e5aee497dac058df492f6e742cfe9
2020-11-19 17:56:56 -08:00
0f89be616a Removing non-thread-safe log statement from ReinitializeTensor (#48185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48185

In a scenario where we have Caffe2 wrapped into a dynamic library, we were running into the memory corruption crash at program termination:

"corrupted size vs. prev_size in fastbins"

Turns out the crash occurs in glog's logging.cc, which is not thread-safe and has to initialize a static hostname string when flushing. If this ends up happening on multiple threads simultaneously, this can lead to a memory corruption.

```
==1533667== Invalid free() / delete / delete[] / realloc()
==1533667==    at 0xA3976BB: operator delete(void*, unsigned long) (vg_replace_malloc.c:595)
==1533667==    by 0x37E36AE: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (basic_string.h:647)
==1533667==    by 0xAD87 (97b9712aed)F6B: __run_exit_handlers (in /usr/lib64/libc-2.28.so)
==1533667==    by 0xAD8809 (153e2e96d4)F: exit (in /usr/lib64/libc-2.28.so)
==1533667==    by 0xAD71799: (below main) (in /usr/lib64/libc-2.28.so)
==1533667==  Address 0x165cd720 is 0 bytes inside a block of size 31 free'd
==1533667==    at 0xA3976BB: operator delete(void*, unsigned long) (vg_replace_malloc.c:595)
==1533667==    by 0x37E36AE: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (basic_string.h:647)
==1533667==    by 0xAD87 (97b9712aed)F6B: __run_exit_handlers (in /usr/lib64/libc-2.28.so)
==1533667==    by 0xAD8809 (153e2e96d4)F: exit (in /usr/lib64/libc-2.28.so)
==1533667==    by 0xAD71799: (below main) (in /usr/lib64/libc-2.28.so)
==1533667==  Block was alloc'd at
==1533667==    at 0xA39641F: operator new(unsigned long) (vg_replace_malloc.c:344)
==1533667==    by 0x37F4E18: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_mutate(unsigned long, unsigned long, char const*, unsigned long) (basic_string.tcc:317)
==1533667==    by 0x37F4F2E: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const*, unsigned long) (basic_string.tcc:466)
==1533667==    by 0x5170344: GetHostName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) (logging.cc:227)
==1533667==    by 0x51702D4 (fc7f026980): google::LogDestination::hostname[abi:cxx11]() (logging.cc:555)
==1533667==    by 0x5173789: google::(anonymous namespace)::LogFileObject::Write(bool, long, char const*, int) (logging.cc:1072)
==1533667==    by 0x51746DF: google::LogDestination::LogToAllLogfiles(int, long, char const*, unsigned long) (logging.cc:773)
==1533667==    by 0x5170BDC: google::LogMessage::SendToLog() (logging.cc:1386)
==1533667==    by 0x5171236: google::LogMessage::Flush() (logging.cc:1305)
==1533667==    by 0x517114D: google::LogMessage::~LogMessage() (logging.cc:1264)
==1533667==    by 0x108DC840: caffe2::ReinitializeTensor(caffe2::Tensor*, c10::ArrayRef<long>, c10::TensorOptions) (tensor.cc:0)
==1533667==    by 0x103BBED0: caffe2::int8::Int8GivenTensorFillOp::RunOnDevice() (int8_given_tensor_fill_op.h:29)
==1533667==
```

There doesn't seem to be an obvious easy solution here. The logging API being used by c10 is fundamentally not thread-safe, at least when it uses glog. Glog does have a threadsafe API (raw_logging), but this doesn't seem to be used by c10 right now. I suspect other callers are not running into this crash because:
- They have other libraries using glog in their module, so the static variable in glog gets initialized before getting into a race condition
- They don't use int8 network in a glog context, thus avoiding this problematic log statement

An alternative fix would be to correctly initialize the dtype of the int8 tensor, which is currently always uninitialized, making the log statement always trigger for int8 networks. Initializing the int8 tensor correctly in tensor_int8.h is proving to be challenging though, at least without knowledge of Caffe2's codebase. And even then, it wouldn't fix the issue for all use cases.

Test Plan: Ran my app with valgrind, I no longer get the crash and valgrind doesn't complain about  a memory corruption anymore

Reviewed By: thyu, qizzzh

Differential Revision: D25040725

fbshipit-source-id: 1392a97ccf9b4c9ade1ea713610ee44a1578ae7d
2020-11-18 17:42:22 -08:00
549ef1d668 [caffe][memonger] Extend operator schema check to dag memonger (#48021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48021

Extending operator schema check for simple memonger to dag memonger as well. As part of this a fix is being made to handle inplace ops (having at least one output name same as input blob). Earlier all the output blobs from ops were being treated as shareable but it failed assertion of external input blobs with the same name not allowed to share.

Test Plan: Added corresponding unit tests

Reviewed By: hlu1

Differential Revision: D24968862

fbshipit-source-id: b6679a388a82b0d68f65ade64b85560354aaa3ef
2020-11-16 19:17:55 -08:00
825ee7e7f8 [caffe2] plan_executor_test: add test case for should_stop loops (#47613)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47613

This is to test some more cancellation edge cases that were missing before. It passes under the current code.

Test Plan: buck test mode/opt caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 10

Reviewed By: dahsh

Differential Revision: D24836956

fbshipit-source-id: 3b00dc081cbf4f26e7756d597099636edb49d256
2020-11-16 12:59:13 -08:00
c543b3b582 Fix a downcast (#47919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47919

Suppresses a downcast warning.

Test Plan:
Reproduces with
```
buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:gpu_test
```

Reviewed By: suphoff

Differential Revision: D24866987

fbshipit-source-id: 44f19ab37a7d95abe08f570abfebc702827a2510
2020-11-13 22:26:29 -08:00
f743b5639a [caffe2][memonger] Add support for distributed inference predict nets in DAG memonger (#47718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47718

Distributed Inference splits a predict net into multiple parts, part0 being the main part which contains ops to make remote calls to other parts. part0 predict net may contain AsyncIf ops to optimize rpc call usage. AsyncIf ops have internal nets which may refer to memongered blobs. This change handles AsyncIf ops to update internal nets to refer to memongered blobs.

As part of this change, I am also updating dag memonger traversal to always start from root op, i.e. ops with 0 in degree. Earlier logic will start traversing ops based on input head blobs and if one of the head inputs is getting used in a non-root op which gets visited before its parent, the traversal will throwing assertion error here: https://fburl.com/diffusion/ob110s9z . Almost for all the distributed inference part0 nets, it was throwing this assertion error.

Test Plan: Added corresponding tests in memonger_test.py .  Could not find unit tests in c++ version of memonger.

Reviewed By: hlu1

Differential Revision: D24872010

fbshipit-source-id: 1dc99b2fb52b2bc692fa4fc0aff6b7e4c5e4f5b0
2020-11-13 14:12:07 -08:00
17c58720fe Revert D24346771: [caffe2][memonger] Add support for distributed inference predict nets in DAG memonger
Test Plan: revert-hammer

Differential Revision:
D24346771 (5882f2e540)

Original commit changeset: ad2dd2e63f3e

fbshipit-source-id: 90346f08c890eebe71f068748a8e24e4db88c250
2020-11-10 12:11:22 -08:00
5882f2e540 [caffe2][memonger] Add support for distributed inference predict nets in DAG memonger
Summary:
Distributed Inference splits a predict net into multiple parts, part0 being the main part which contains ops to make remote calls to other parts. part0 predict net may contain AsyncIf ops to optimize rpc call usage. AsyncIf ops have internal nets which may refer to memongered blobs. This change handles AsyncIf ops to update internal nets to refer to memongered blobs. Here is one reference part0 predict net with AsyncIf ops: https://www.internalfb.com/intern/paste/P145812115/

As part of this change, I am also updating dag memonger traversal to always start from root op, i.e. ops with 0 in degree. Earlier logic will start traversing ops based on input head blobs and if one of the head inputs is getting used in a non-root op which gets visited before its parent, the traversal will throwing assertion error here: https://fburl.com/diffusion/ob110s9z . Almost for all the distributed inference part0 nets, it was throwing this assertion error.

Reviewed By: hlu1

Differential Revision: D24346771

fbshipit-source-id: ad2dd2e63f3e822ad172682f6d63f8474492255d
2020-11-10 09:35:28 -08:00
f05b66b70d pass TypeMeta by value (#45026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45026

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23802943

Pulled By: bhosmer

fbshipit-source-id: 81b06ef00bf8eb4375c0e0ff2032e03bd1d1188a
2020-10-30 10:14:17 -07:00
51bf7bed84 [caffe2] Allow memonger to optimize nets with inplace(enforced) ops (#46560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46560

Follow-up for D24236604 (16c52d918b).

For nets that pass the schema check, memonger actually makes sure to preserve the inplaceness of operators if they are already inplace. So we can safely enable it for correct input nets.

(Note: this ignores all push blocking failures!)

Differential Revision: D24402482

fbshipit-source-id: a7e95cb0e3eb87adeac79b9b69eef207957b0bd5
2020-10-22 13:23:33 -07:00
c44300884e Clarify timing of GetDeviceProperty() (#46715)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46715

Test Plan: N/A

Reviewed By: ezyang

Differential Revision: D24455538

fbshipit-source-id: 1770807d178f618ef6338e28f669f09e4cbd2009
2020-10-22 11:29:31 -07:00
0c9787c758 caffe2: use at::mt19937 instead of std::mt19937 (10x speedup) (#43987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43987

This replaces the caffe2 CPU random number (std::mt19937) with at::mt19937 which is the one currently used in pytorch. The ATen RNG is 10x faster than the std one and appears to be more robust given bugs in the std (https://fburl.com/diffusion/uhro7lqb)

For large embedding tables (10GB+) we see UniformFillOp taking upwards of 10 minutes as we're bottlenecked on the single threaded RNG. Swapping to at::mt19937 cuts that time to 10% of the current.

Test Plan: Ran all relevant tests + CI. This doesn't introduce new features (+ is a core change) so existing tests+CI should be sufficient to catch regressions.

Reviewed By: dzhulgakov

Differential Revision: D23219710

fbshipit-source-id: bd16ed6415b2933e047bcb283a013d47fb395814
2020-10-16 16:08:35 -07:00
dd169ca17c caffe2/plan_executor: propagate exceptions from reporter substeps (#46424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46424

Currently if an exception occurs in a reporter thread the process is killed via std::terminate. This adds support for handling the reporter exception if FLAGS_caffe2_handle_executor_threads_exceptions is set to true.

Test Plan: buck test mode/opt -c python.package_style=inplace //caffe2/caffe2/python:hypothesis_test //caffe2/caffe2:caffe2_test_cpu -- --stress-runs 100

Reviewed By: dahsh

Differential Revision: D24345027

fbshipit-source-id: 0659495c9e27680ebae41fe5a3cf26ce2f455cb3
2020-10-16 12:28:57 -07:00
16c52d918b [caffe2] Bypass memonger for in-place ops (#46378)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46378

Reviewed By: dzhulgakov

Differential Revision: D24236604

fbshipit-source-id: 9f599687467ea969e89243482f8e2a41f7db0a23
2020-10-15 16:03:52 -07:00
85c3ba5588 [caffe2] add PlanExecutorTest ErrorPlanWithCancellableStuckNet (#46110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46110

## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145).
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.

## Summary
* Added PlanExecutorTest `ErrorPlanWithCancellableStuckNet` for plan executor.
* Set cancelCount to zero at the beginning of tests to avoid global state be carried over in some test environment.

Test Plan:
## Unit Test Added

```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 1000
```

Reviewed By: d4l3k

Differential Revision: D24226577

fbshipit-source-id: c834383bfe6ab50747975c229eb42a363eed3458
2020-10-12 12:00:15 -07:00
87226f72d2 [caffe2] temp remove ErrorPlanWithCancellableStuckNet (#46080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46080

temp removal of ErrorPlanWithCancellableStuckNet, will fill out more

Test Plan:
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
```
remove a test

Reviewed By: fegin

Differential Revision: D24213971

fbshipit-source-id: e6e600bad00b45c726311193b4b3238f1700526e
2020-10-08 23:35:45 -07:00
487624e369 [caffe2] plan executor error propagation test with blocking cancellable op (#45319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45319

## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145)
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.

## Summary
* Added `ErrorPlanWithCancellableStuckNet` for plan executor.
* We set a plan with two nets: one stuck net with blocking operator that never returns, and one with error
  net with error op that throws, and tested it throw and cancel.

Test Plan:
## Unit Test added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
```
```
Summary
  Pass: 400
  ListingSuccess: 2
```

Reviewed By: d4l3k

Differential Revision: D23920548

fbshipit-source-id: feff41f73698bd6ea9b744f920e0fece4ee44438
2020-10-08 19:54:49 -07:00
59e4803b94 Recommit: caffe2/plan_executor: wait for 1 minute after exception and then abort (#45981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45981

This is a recommit of previously reverted D20850851 (3fbddb92b1).

TL;DR - combining condition_variables and atomics is a bad idea

https://stackoverflow.com/questions/49622713/c17-atomics-and-condition-variable-deadlock

This also adds some ifdefs to disable the death test for mobile, xplat and tsan builds since forking doesn't play nicely with them.

Test Plan:
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 1000 test_atomic_iter_with_concurrent_steps --timeout 120
  buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 100
  buck test mode/opt caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100

no timeouts https://www.internalfb.com/intern/testinfra/testconsole/testrun/7036874440059883/

will ensure no timeouts in OSS

Reviewed By: walterddr, dahsh

Differential Revision: D24165505

fbshipit-source-id: 17cd23bfbcd9c2826a4067a387023d5186353196
2020-10-08 14:17:30 -07:00
1bb2d41b68 Revert D20850851: caffe2/plan_executor: wait for 1 minute after exception and then abort
Test Plan: revert-hammer

Differential Revision:
D20850851 (3fbddb92b1)

Original commit changeset: 330503775d80

fbshipit-source-id: 612c6c3c4d5586bc8ad00a112cd00fc74fb44243
2020-10-07 09:04:24 -07:00
3fbddb92b1 caffe2/plan_executor: wait for 1 minute after exception and then abort (#45297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45297

If we have two concurrent substeps and one of them throws an exception and the other is blocking, we'll currently hang. This waits up to 1 minute for it to complete before terminating the process.

Test Plan: buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100

Reviewed By: dahsh

Differential Revision: D20850851

fbshipit-source-id: 330503775d8062a34645ba55fe38e6770de5e3c7
2020-10-06 12:59:09 -07:00
2ac7de7d53 Remove hacky_wrapper from BackendSelect kernels (#44062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062

Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory,  and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.

Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789

Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/

Reviewed By: ezyang

Differential Revision: D23484192

fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
2020-09-25 09:04:03 -07:00
27c7158166 Remove __future__ imports for legacy Python2 supports (#45033)
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:

```2to3 -f future -w caffe2```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033

Reviewed By: seemethere

Differential Revision: D23808648

Pulled By: bugra

fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
2020-09-23 17:57:02 -07:00
2ae74c0632 Compile less legacy code when BUILD_CAFFE2 is set to False (take 2) (#44453)
Summary:
2nd attempt to land https://github.com/pytorch/pytorch/pull/44079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44453

Reviewed By: walterddr, seemethere

Differential Revision: D23619528

Pulled By: malfet

fbshipit-source-id: c7c206ebd327dcf3994789bd47008b05ff862fe7
2020-09-11 16:27:47 -07:00