Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66746
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D31705361
fbshipit-source-id: 33fd22eb03086d114e2c98e56703e8ec84460268
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
bypass_size_limit
allow-large-files
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D30652629
fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59775
This operator is similar to `GetAllBlobNames` but also returns the estimated
size required to serialize each node.
One goal of this operator is to allow checkpoint saving logic to estimate the
amount of space/bandwidth required to save a checkpoint when first starting
training, without actually serializing any blobs yet. Currently the
checkpointing logic uses `GetAllBlobNames` to determine the blobs to
checkpoint. It can instead be updated to use `EstimateAllBlobSizes` to also
get an estimate for how much space will be required for the checkpoint.
ghstack-source-id: 132275153
Test Plan: Included a new unit test.
Reviewed By: mraway
Differential Revision: D29020227
fbshipit-source-id: 811e5d86c4b59183e84e6424c48c97739be09043
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53402
Add an `options` field to the `Save` operator which accepts options for how to
serialize different blobs. At the moment this simply allows controlling the
existing `chunk_size` behavior, but in the future we can add other options,
such as the ability to control compression settings or other serialization
formats.
ghstack-source-id: 123567034
Test Plan:
Added a new test to `load_save_test.py` that passes in options and verifies
that blobs were serialized with the expected number of chunks.
buck test caffe2/caffe2:caffe2_test_cpu \
caffe2/caffe2/core:serialization_test \
caffe2/caffe2/python/operator_test:load_save_test
Reviewed By: mraway
Differential Revision: D26502577
fbshipit-source-id: 6e302e530bb96990517c2e35c505db7f14a56284
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53400
This is a reland of D26617038 (b4a8d98247) after rebasing onto D26802576 (f595ba1bae).
Optimize the blob serialization code by using `AddNAlreadyReserved()` when
serializing tensor data, rather than making N separate `Add()` calls.
`AddNAlreadyReserved()` is a simple addition operation, while each `Add()`
call checks to see if it needs to reserve new space, and then updates the
element data, which is unnecessary in this case.
ghstack-source-id: 123567030
Test Plan:
This appears to improve raw serialization performance by 30 to 35% for float,
double, and int64_t types which use this function. This improvement appears
relatively consistent across large and small tensor sizes.
Reviewed By: mraway
Differential Revision: D26853941
fbshipit-source-id: 4ccaa5bc1dd7f7864068d71a0cde210c699cbdba
Summary:
Optimize the blob serialization code by using `AddNAlreadyReserved()` when
serializing tensor data, rather than making N separate `Add()` calls.
`AddNAlreadyReserved()` is a simple addition operation, while each `Add()`
call checks to see if it needs to reserve new space, and then updates the
element data, which is unnecessary in this case.
Test Plan:
This appears to improve raw serialization performance by 30 to 35% for float,
double, and int64_t types which use this function. This improvement appears
relatively consistent across large and small tensor sizes.
Differential Revision: D26617038
fbshipit-source-id: 97dedbae889d35463628f3016ac56986e685289e
Summary:
Since caffe2 and torch have been consolidated, CAFFE2_API should be merged with TORCH_API. Addresses a TODO.
Manually edited some references of the removed `CAFFE2_API`:
* `CONTRIBUTING.md`
* `caffe2/proto/CMakeLists.txt`
* `cmake/ProtoBuf.cmake`
* `c10/macros/Export.h`
* `torch/csrc/WindowsTorchApiMacro.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49496
Reviewed By: malfet, samestep
Differential Revision: D25600726
Pulled By: janeyx99
fbshipit-source-id: 7e068d959e397ac183c097d7e9a9afeca5ddd782
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18123
the motivation of this fix is to resolve things like:
for(auto i = 0; i < N; i++) where N is bigger than int32
These instances of comparison were found by enabling -Wsign-compare
There are way too many things to fix, so issuing this as a series of fixes
The plan is to fix all these issues and then enable this flag into Caffe2 to catch future instances
Reviewed By: ZolotukhinM
Differential Revision: D14497094
fbshipit-source-id: bca3927a2188bd33a508fa503ba221c220cdaefe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14197
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13642
Previously we pass in a patially initialized Tensor to Deserialize and it will fill
it with the result of deserialization of a tensor proto. Now we want it to return
a Tensor directly since it's just a shared pointer to TensorImpl.
Reviewed By: dzhulgakov
Differential Revision: D12874357
fbshipit-source-id: 12b80a763375da23cfa64a74d6bc186d8d03b94f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13689
Now that typeid.h lives in c10/util, the include paths should reflect that.
Reviewed By: ezyang
Differential Revision: D12912237
fbshipit-source-id: e54225f049f690de77cb6d5f417994b211a6e1fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12848
Updated all non-test uses of protobuf::MessageLite::SerializeAsString to call
SerializeAsString_EnforceCheck so that the return value is checked and can
throw an exception if failing.
Most of the affected code was called from classes derived from BlobSerializeBase.
Didn't touch most tests and ENFORCE calls because they usually do checks
anyway.
Original commit changeset: c0760e73ecc7
Reviewed By: dzhulgakov
Differential Revision: D10453456
fbshipit-source-id: d2f2b7b4578e721924354149f08f627c7e3bf070
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12799
Updated all non-test uses of protobuf::MessageLite::SerializeAsString to call
SerializeAsString_EnforceCheck so that the return value is checked and can
throw an exception if failing.
Most of the affected code was called from classes derived from BlobSerializeBase.
Didn't touch most tests and ENFORCE calls because they usually do checks
anyway.
Reviewed By: ezyang
Differential Revision: D10416438
fbshipit-source-id: cb842e3e26b0918829d71267a375d4dd40600d58
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11925
This is step 1 in the refactoring to remove Blob::ShareExternal(), i.e. Blob would then always own its contents.
ShareExternal() is for example used to pass non-owning blobs to serialization. This diff prepares removing that.
Reviewed By: ezyang
Differential Revision: D9884177
fbshipit-source-id: d01df9a613a4fc62e5679fe45bfc47e2c899b818
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11817
Blob::Serialize() and Blob::Deserialize() are now free functions SerializeBlob(), DeserializeBlob() instead.
This takes away access to Blob internals from them and makes future refactorings easier.
Reviewed By: ezyang
Differential Revision: D9882726
fbshipit-source-id: 3251ebd4b53fc12f5e6924a6e4a8db3846ab3729
Summary:
Properly annotated all apis for cpu front. Checked with cmake using
cmake -DUSE_ATEN=ON -DUSE_CUDA=OFF -DBUILD_ATEN=ON
and resulting libcaffe2.so has about 11k symbols.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10504
Reviewed By: ezyang
Differential Revision: D9316491
Pulled By: Yangqing
fbshipit-source-id: 215659abf350af7032e9a4b0f28a856babab2454
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9939
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: ezyang, houseroad
Differential Revision: D9024330
fbshipit-source-id: e0b8295d2dc6ebe2963383ded5af799ad17164ba
Summary:
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: xw285cornell
Differential Revision: D8121878
fbshipit-source-id: 4a5e9a677ba4ac82095df959851a054c81eccf81
Summary:
TSIA. Verified on local machine with VS 2017.
Closes https://github.com/caffe2/caffe2/pull/1455
Differential Revision: D6310658
Pulled By: Yangqing
fbshipit-source-id: 88f4519e8e9a4178719a5627365267f627dcb939
Summary:
The use case is that sometimes we need a Tensor of custom type instead of POD
or string. This diff allows one to delegate to BlobSerializerBase to further
serialize the contents inside the Tensor.
Design choices:
(1) Each element is serialized as a BlobProto string, and stored in the
repeated string field.
(2) UNDEFINED is used as the enum value for the tensor data type, and the exact
type string is stored in the additional field.
(3) BlobSerializer is called on each item to obtain the serialized string.
(4) This requires the custom type to have copy constructor - otherwise it
will simply not be possible to copy over the deserialized content without
explicit type.
See blob_test.cc for an example.
Reviewed By: sunnieshang
Differential Revision: D6300196
fbshipit-source-id: 18bf94a22a07337e0fa83d3f1004b3651e38cf27
Summary:
Implementation of polling async net executor.
Notes:
- New net executor async_polling - schedules CPU and GPU ops asynchronously, uses single polling thread
- Events: update to Caffe2 events to support async CPU events, adding new methods:
Query() - non-blocking checking of event states: INITIALIZED -> RECORDED -> SUCCESS/FAILED
ErrorMessage() - when operation runs asynchronously and fails calling this on event will give error message
- Tasks: using existing DAGNet's algorithm to compute CPU and GPU chains, a separate task for each chain
- Polling: using single thread to query state of events - for CPU tasks atomically queries task state, for GPU task - uses cudaEventQuery; using Event
- Scheduling of CPU ops: using global thread pools
- Scheduling of GPU ops: using GPU thread pool per GPU device
Reviewed By: dzhulgakov
Differential Revision: D5985110
fbshipit-source-id: a9de7fcbb71d046a3aa1b573072b89a65dfeee8c
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary: When we use int32_data field for float16 tensors serialization it's possible to end up with up to 50% larger representation than can be achieved using byte_data. The reason for it is varints (https://developers.google.com/protocol-buffers/docs/encoding#varints). In worst cast (when highest sign bit is set) it uses 3 8-bit blocks i.e. 24 bits for each number. Saving in byte_field removes this overhead.
Reviewed By: Yangqing
Differential Revision: D5375267
fbshipit-source-id: 0068daed25cd0157ea80a768b6e3899ea2bd8caf
Summary:
The most recent diff from Andrey had a tiny bug that triggered an error in Android.
Closes https://github.com/caffe2/caffe2/pull/543
Differential Revision: D5040516
Pulled By: Yangqing
fbshipit-source-id: d7b11b509a20b8b5e33db74dd383b55f43608c8f
Summary:
At the moment serialization can tak up to 3x memory of the largest blob:
original blob, BlobProto, SerializeAsString version of the blob. As a result in
certain cases serialization takes more memory than it should and it hurts
utilization/max model size per machines.
This diff is adding IOBound ThreadPool that should set quite strict limitation
on the extra memory overhead per one blob.
Reviewed By: dzhulgakov
Differential Revision: D5012887
fbshipit-source-id: 12dbb9d3efab136411ddeffd519b602cf606661e
Summary:
This diff introduces a new net type 'singlethread_async' which is based on my investigation of DPER/hogwild MLP bottlenecks.
It only uses one CPU thread, but multiple GPUs on each GPU. This is implemented by having each Net to submit their list of operators to
a central GPU-specific executor queue and a thread that executes them asynchronously. This executor takes all tasks in the queue and executes them on separate cuda streams and then waits them in the end. This solution can achieve >95% GPU utilization on 8 GPUs when sufficient amount of workers is used.
FYI: I also tried fancier solution such as using cudaStreamCallbacks(), but they did not have as good performance.
Improved the dper bench by adding the MomentumSGDUpdate operations and adding speed test capabilities. During my testing I also noticed that the startup costs for inizialing CUDA streams and contexts are high, so it is important to do a warm up.
Reviewed By: Yangqing
Differential Revision: D4553941
fbshipit-source-id: bb00524bef653d75de026dd64097b8d9b7a0acb3
Summary: Makes it much nicer to spot errors, especially in iPython notebook.
Reviewed By: kennyhorror
Differential Revision: D4465726
fbshipit-source-id: c0adaf5168248a70987ff9d5dfce54a622ff2219
Summary: Previous implementation was just concatenating string which I believe is wrong. Instead let's turn off chunking when we don't ask for it.
Reviewed By: kennyhorror
Differential Revision: D4461311
fbshipit-source-id: 8b9a3325a40a1cd0a8ffeeb20a17bf9f57b7b0a9
Summary: Some DB don't support duplicate keys. Nvidia had problems with LMDB where we potentially can setup duplicate keys. But this won't be possible in some other cases. So instead lets just store different chunks with different keys in DB. And then when reading back we will remove the special suffix.
Reviewed By: dzhulgakov
Differential Revision: D4446583
fbshipit-source-id: 6b345e342840c5fd476029166db131d343467d48
(1) cudnn for conv
(2) cublas: after going through the work I feel it's beter to use HOST pointer mode, so changed it.
(3) storage order: despite that googlenet and multibox uses NHWC, it seems better to be still using
NCHW as default to be consistent with caffe and cudnn; moved to NCHW as default.