test_FloatTensor_qr_big test is still a bit flaky on K80. Increasing tolerance to improve reliability as tests are moved around and results change for this test.
* Implement BatchNorm double backwards as a python function called directly from C++.
This will be converted to C++ code once ATen is integrated with autograd.
* Some performance improvements via inplace ops and reusing calculations.
There were two implementations of THPUtils_checkLong/THPUtils_unpackLong; one
that was a macro and one that was not, which is hella bad if you accidentally
include the macro before the real definition. Now we always use the inline
function.
A reasonable follow-up task would be to un-macro-ify the rest of these functions.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* add SharedFunctionMaker to create Function shared in the graph
* Clean shared_ptr usage for only function that will be used in the graph
* make Function binding match Varible one
* remove unnecessary changes
* fix comments
* proper weakref implementation
* add call to clear in dealloc
* Add examples in CrossEntropyLoss
1. Added examples in CrossEntropyLoss
2. Make consistent style of example for PyTorch docs
3. Delete unnecessary character '
* Change comments in distance.py
1. Delete x1, x2 from arguments and add eps in PariwiseDistance
2. For the shape, added input1 and input2 for readability (PairwiseDistance and CosineSimilarity.
* Add examples
Added the word 'examples' for PyTorch docs
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
* added tests + removed explicit expand of weight in bce with logits
* add auto broadcasting of weight to BCELoss
* remove the need for _BCELoss
* formatting of warning
* remove TODO
* move across assert from _functions/thnn/loss.py
* flake8 fixes
* add dropout2d and dropout3d to functional
added some loss functions to functional
added tests
using dropout from backend
added docs
fixes
* edited loss modules to call functional
Summary: When performing reductions on fp16 buffers, gloo assumed that both buffers were either aligned to 32 bytes or misaligned by the same offset. This may not hold in intermediate steps of halving-doubling allreduce, when the reduction is performed on some offset within the receive buffer. The fix is to use intrinsic instructions that work with unaligned pointers.
Reviewed By: akyrola
Differential Revision: D5450103
fbshipit-source-id: 9a1c8f8c34d2e62223f6d5c21573ea1cfad6537f
The function iterates over columns and sets "sparsity" fraction of entires in each column to 0. The number of zeros in a column (num_zeros) is then ceil(rows*sparsity)
Summary: When compiled with -Werror=shadow-compatible-local, cannot reuse a variable name. This passed our tests, but some people use stronger settings to compile.
Differential Revision: D5440805
fbshipit-source-id: a246af748717fb7e0e7a321e1ac4ddfef68ae524
Summary: To reduce round trips with store handlers, it is better to store all addresses in one key instead of one address per pair. This is what this implements.
Reviewed By: andrewwdye
Differential Revision: D5435893
fbshipit-source-id: 2d3ea3a2822c3b934ff2578d44a262e7bfbde6d0
Summary: Use the CreateCommonWorld timeout for the storehandler as well, not just the device connect.
Reviewed By: andrewwdye
Differential Revision: D5425923
fbshipit-source-id: 936d2129e2db3bfed8759ca097b75843d3931d5f
* add support for groups in double backward
* add tests for group in double backward
* fix lint
* separate some tests to reduce number of test cases
* remove redundant testing for different number of output channels
This is needed because of possible races in SpatialConvolutionMM (and others that use gemm)
if the BLAS library is not thread-safe.
In terms of performance, there's not much benefit to run two gemms in parallel, because the
BLAS libraries have their own all-occupying gemms anyways.
* Improve non-contiguous testing in TestAutograd:
1) Test gradcheck and gradgradcheck with non-contiguous inputs
2) Test gradgradcheck with non-contiguous gradoutputs (gradcheck would take more work)
3) Fix discovered issue in Prod backwards.
* Simplify non-contiguous setting wrt View.
Previously, there were 2 issues with test_autograd randomness:
1) Many random operations (e.g. random selection in prod_zeros) happened
before the torch random seed was set (because it was set in run_tests
at the end of the file.
2) The random seed was not set consistently: run_tests would set it to the
proper value, but each call to setUp would set it to 0 (because SEED wasn't
global in run_tests), which made setting the seed mostly worthless.
Previously, these tests added 5e-2 to the denominator tensor (the same as the div
tests), which only avoids divide by 0, but not issues with computing the numerical
jacobian due to non-linearity of fmod/remainder, when input / divisor is close to an
integer. These tests now add 1.5 to the denominator, which is the same as the non-tensor
version of the tests; Note that we can still hit the above condition but it will be much
less likely.
This takes advantage of the broadcasting behavior of torch.matmul to
support inputs with more than two dimensions. The extra dimensions are
treated like part of the batch dimension, much like nn.Bottle in Lua
Torch.
There are a few related small performance changes:
* Addmm computes the gradient in column-major for inputs in
column-major format
* Variable.mm calls Addmm in-place with the desired output buffer
* Add weight normalization implementation
This adds forward "pre-hooks" which get called before the module's
forward() method. Weight norm is implemented as a hook which calculates
the weight variable from the weight_g and weight_v every iteration.
Based on @rtqichen implementation.
* Specify return type
* Fix unused linker argument warnings.
This patch began when I noticed the following clang warning:
clang: warning: -Wl,-rpath,RIGIN: 'linker' input unused
clang: warning: argument unused during compilation:
'-L/home/ezyang/local/pytorch/torch/lib/tmp_install/lib'
The warning is minor, but I was a bit worried our rpath wasn't
setup correctly. Actually, it was, and there wasn't a problem,
but I had to spend some time figuring out exactly what as going
on, and by the end of it, I might as well fix the warning. In the end, I ended
up filing two upstream tickets for ccache and cmake:
- https://github.com/ccache/ccache/issues/189
- https://gitlab.kitware.com/cmake/cmake/issues/17025
We can remove the warning by using CMAKE_EXE_LINKER_FLAGS and
CMAKE_SHARED_LINKER_FLAGS, which have sane macro expansion rules
(although still slightly insane: the first level of escaping gets removed.)
To ensure that the rpath was being set correctly, I ran
objdump -x torch/lib/build/TH/libTH.so | grep RPATH and verified that ORIGIN
was setup correctly.
I also considered using CMAKE_INSTALL_RPATH, but the rpath here doesn't
seem to get set until you actually install, which is a change in behavior,
and I wasn't sure if anyone was relying on rpaths being setup in the build
directory.
There is a SLIGHT behavior change, in that if we happened to need these
LDFLAGS passed to the static linker, they won't get passed. I don't
think we ever build static libraries today so this shouldn't be aproblem.
P.S. Because of the ccache bug, you may continue to see these warnings
after this patch. If you apply https://github.com/ccache/ccache/pull/190
and clear your cache, it will solve the problem.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Remove unnecessary -Qunused-arguments
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
If the left tensor is 3D+ and the right tensor is at most 2D, we can
fold the batch into the matrix dimension and use torch.mm instead of
torch.bmm. In practice, this is faster especially if the right tensor is
column major.
Summary:
Adds basic CUDA 9 support, including adding Volta arch, and making appropriate modifications for half precision datatype changes
Closes https://github.com/facebookincubator/gloo/pull/49
Differential Revision: D5315336
Pulled By: pietern
fbshipit-source-id: 6468b0f357206d604bdcfec69ba82509a2c91407
Summary:
Adds a separate set of CUDA collectives that run on device as an
alternative to NCCL. Use these collectives as default on-device
collectives instead of NCCL.
Whenever multiple processes on the same machine use Gloo with NCCL and
end up doing concurrent CUDA memory allocations and algorithm
execution, we risk deadlock. A follow up change will enable opt-in
usage of NCCL (e.g. through environment variable).
Benchmark output below with varying number of elements. It shows a
minor improvement over using NCCL for local reduction and broadcast.
Number of elements equal to on-device threshold (256K):
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 262144 2685 2907 3035 3215 562
(after) 262144 2682 2874 3013 3395 577
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring_chunked
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 262144 2045 2133 2325 2643 725
(after) 262144 1533 1673 1834 2048 800
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_halving_doubling
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 262144 1580 1640 1718 2069 893
(after) 262144 1371 1446 1539 1748 1125
```
Larger number of elements (4M):
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 4194304 55543 58058 60103 62659 32
(after) 4194304 54490 57923 60893 66058 33
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring_chunked
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 4194304 18049 22820 24997 26634 105
(after) 4194304 18356 20463 21695 22589 99
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_halving_doubling
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 4194304 18584 24345 27809 29722 95
(after) 4194304 19541 22718 25408 26688 88
```
Reviewed By: akyrola
Differential Revision: D5278192
fbshipit-source-id: 53f09e404663ddc8bb46d06ac87afd8ee3ffc3a2
Summary:
Code in tcp/transport tries to find the network interface a socket was
bound to when create a TCP device context. Per getifaddrs(3), it is
possible for the ifa_addr field to be NULL (supposedly when an
interface doesn't have an address). Ignore such entries.
Thanks to slayton58 for reporting this.
Reviewed By: wesolwsk
Differential Revision: D5279376
fbshipit-source-id: 039380b95ba4d6d94942c30581e0b230a060870c
Summary:
Previously, `gloo/math.h` inlined methods which use AVX builtins,
which required propagating the `-mavx` flag.
This diff moves these definitions out of the header and into a source
file to prevent avoid this.
Reviewed By: pixelb
Differential Revision: D5271043
fbshipit-source-id: dde4dc560dfb557b46d1a582a8b38e7cb8eb0c37
Summary:
This changes prepares for having a separate set of collectives that
use native CUDA calls instead of NCCL. This is needed to workaround
the issue where NCCL deadlocks when it is interleaved with CUDA memory
management operations in other processes on the same machine.
Includes a modification to the host reduction functions to bring them
up to parity with the NCCL reduction functions (they now incorporate
offset/counter arguments).
Reviewed By: wesolwsk
Differential Revision: D5276291
fbshipit-source-id: 8844731760d2c48577d207c026ce0cd641f2fc6d
Fixing error on line 661:
warnings.warn("masked_copy_ is deprecated and renamed to masked_scatter_, and will be removed in v0.3")
NameError: name 'warnings' is not defined
Summary:
\cc pietern
Minimal changes to allow gloo to compile and run with NCCL 2.0
Closes https://github.com/facebookincubator/gloo/pull/46
Differential Revision: D5268074
Pulled By: pietern
fbshipit-source-id: 58d625d57b31cfc932f3dbbdd7a4b83d9a2e60a8
* Add torch.matmul function.
Includes test_torch, test_autograd and docs changes.
* Add __all__ to functional so imports are accidentally imported.
* Include unbind in __all__.
* Add matmul case for when one argument is 1-dimensional and the other
at least 3-dimensional.
* Add squeeze_ to Variable.
* Use squeeze_ instead of squeeze for matmul.
Primary things I had to fix:
- Suppress _XOPEN_SOURCE warnings by ensuring that Python.h is included
first, because it always unconditionally defines this macro.
- Turn off strict aliasing, because Python 2 doesn't work with strict
aliasing.
- Workaround setuptools bug, where it's incorrectly passing
-Wstrict-prototypes to C++ compilers (where this doesn't make
any sense)
To compile csrc with -Werror, run `CFLAGS="-Werror" python setup.py build_ext`
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Fixes#1783.
There is an undocumented invariant in PyTorch that we should
try to avoid having storage == NULL as much as possible (even
though Torch supports it.) This commit properly documents the
invariant, and fixes a bug in sparse where the invariant was
not respected. This now means that sparse tensors now correctly
remember what GPU they are associated with.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Fixes#1782.
The default operation should be cheap: user can always choose to
explicitly make a copy on the way in. Note that this is a
BACKWARDS COMPATIBILITY BREAKING change. However, we DO create
a new tensor wrapper (so we are not affected by subsequent
size changes, etc.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Replace call to function that is only supported in CUDA 8.0 with one that has been supported in previous releases.
Reviewed By: pietern
Differential Revision: D5231755
fbshipit-source-id: d72aec2a4a1c511064a65142887f8a05b51dad55
1) Line up trailing dimensions in broadcast docs.
2) remove unnecessary expand_as in common_nn test.
3) use view in tensor_str instead of resize_.
4) newExpand remove raiseErrors change.
5) clarify expandedSizes/expandedStrides parameters in inferExpandGeometry.
6) simplify inferSize2/inferSizeN implementations.
7) use new-style classes for warning.
Setting torch.utils.backcompat.broadcast.warning.enabled=True
will cause Python warnings in the case where broadcast occurs
but previously 1-d view style pointwise ops occured.
1) Line up trailing dimensions in broadcast docs.
2) remove unnecessary expand_as in common_nn test.
3) use view in tensor_str instead of resize_.
4) newExpand remove raiseErrors change.
5) clarify expandedSizes/expandedStrides parameters in inferExpandGeometry.
6) simplify inferSize2/inferSizeN implementations.
7) use new-style classes for warning.
1) Line up trailing dimensions in broadcast docs.
2) remove unnecessary expand_as in common_nn test.
3) use view in tensor_str instead of resize_.
4) newExpand remove raiseErrors change.
5) clarify expandedSizes/expandedStrides parameters in inferExpandGeometry.
6) simplify inferSize2/inferSizeN implementations.
7) use new-style classes for warning.
1) Rename calculateExpandGeometry to inferExpandGeometry for consistency
2) Simplify inferExpandGeometry implementation by using a single pass
through dimensions
3) Implement a two operand expansion, expand2.
4) Implement versions that return error code to use for fallback to
equal nElem support.
1) Rename calculateExpandGeometry to inferExpandGeometry for consistency
2) Simplify inferExpandGeometry implementation by using a single pass
through dimensions
3) Implement a two operand expansion, expand2.
4) Implement versions that return error code to use for fallback to
equal nElem support.
* Add SELU activation function
* Remove unnecessary case
* Add Function for SELU + tests and fix RReLU inplace
* Fix extra line in doc
* Fix tests
Remove in-place tests for RReLU. For some reason they fail on legacy nn, but passes on nn
* SELU in new-style Function
It also supports double backprop, verifyed with gradgradcheck
* Fix flake8
Otherwise, on many machines, the size of the OpenMP thread pool will
change between MKL and our OpenMP enabled functions. The constant thread
creation and destruction results in worse performance and leaks memory
on GCC 5.4
Otherwise, on many machines, the size of the OpenMP thread pool will
change between MKL and our OpenMP enabled functions. The constant thread
creation and destruction results in worse performance and leaks memory
on GCC 5.4
Summary:
While debugging #43 I found common/common.h missing some headers as well.
Fixes#43.
Closes https://github.com/facebookincubator/gloo/pull/44
Differential Revision: D5194970
Pulled By: pietern
fbshipit-source-id: 4861cd04c56931d4759f5bc050816788252003ee
When I use the named_parametes to modify the lr and weight decay, I will face a bug. Because the value of the named_parameters return is torch.nn.paramter.Parameter, not a generator of the Parameter.
Summary: Machines may not create their Gloo pairs at the same time, due to earlier variable time work. Increase the timeout used to establish the initial tcp connection to accommodate without sacrificing the shorter default timeout for outstanding reads/writes. No related change required for ibverbs as there is no communication on init.
Reviewed By: akyrola
Differential Revision: D5184518
fbshipit-source-id: 0e6c9704a2d2f1406b3927f75887f0a42199450b
The correct device must be set when getting the base allocation and when
calling cudaIpcCloseMemHandle. Store the device in the allocators
context, which was previously always NULL.
Fixes#1707
* Modify torchvision documentation following https://github.com/pytorch/vision/pull/179
* Add new datasets to docs
* Fix wording in torch.datasets
* Small clarification
* Fix gc_refs assertion failure
Ensure that each THPVariable -> THPFunction reference contributes one
ref count to the THPFunction by creating a new shared_ptr for each ref.
Because multiple shared_ptrs can again manage a single THPFunction, it's
not safe to use std::weak_ptr where it may point to a PyFunction. It's
still safe to use weak_ptr for grad_accumulator since these are never
PyFunctions.
Fixes#1626
* Remove stale comment
Before the change, processes were not waiting for master even when they got
'connection refused' (master is not listening yet, so we should wait).
It was because we were closing socket twice: first, by
the resource guard; second, manually in exception handler.
That caused errno to be set to different value (9 - bad file descriptor)
and in result `if`, which checked if connection was refused, was failing.
* Add sanity checks
* Refactor InitMethodFile and TCPInitMethod to more logical functions
* Update few error messages
* Add passing parameters by **kwargs, so now order of parameters is not relevant
* Review comments
* A pile of misc doc fixes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Handle @apaszke review comments.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Initial csrc documentation.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Extended the time-out option from just working on TCP to also working with ibverbs
Reviewed By: pietern
Differential Revision: D5090258
fbshipit-source-id: fee685850d761d0c2130852f513c64ceb19f4e9e
Summary:
For some long running benchmarks, the iteration count could be 0
which would lead to a segfault when printing results
Reviewed By: pietern
Differential Revision: D5149034
fbshipit-source-id: 7b56e8961c302d1ff11ffcd74ca8e909ea046231
Summary:
Only adding `include_directories` doesn't propagate to the including
targets. Also use `target_include_directories` to do so.
Closes https://github.com/facebookincubator/gloo/pull/39
Differential Revision: D5131001
Pulled By: pietern
fbshipit-source-id: 6c58c4b76ae7fa008e4fb26d1bca7900165884d0
Summary:
The CMake variable CMAKE_BINARY_DIR points to the top level build
directory. For standalone Gloo builds this path lets files include the
generated file "gloo/config.h". When Gloo is included as project, this
variable points to a different path and "gloo/config.h" cannot be
resolved. Fix is to build a path from CMAKE_CURRENT_BINARY_DIR.
Closes https://github.com/facebookincubator/gloo/pull/38
Differential Revision: D5129385
Pulled By: pietern
fbshipit-source-id: 722cebf4892b34f869fe43320153efbb181555b6
Summary: Using Misha's vectorized AVX code to greatly improve performance of reductions on float16 values. Float16 reductions are now 2x faster than float.
Reviewed By: pietern
Differential Revision: D5123331
fbshipit-source-id: 03d4e76886d538b7e24eedaf32a92231a80b1e43
Summary:
The broadcast algorithms use the buffers they were given directly.
There is no inbox/outbox pattern. This means that we can race if the
algorithm is run repeatedly within a short time frame. This hasn't
been an issue so far since we've only used it in combination with
other process wide barriers.
Since this adds a round trip the latency of these ops from the root
rank perspective increases. The variance between the before and after
runs is pretty high since there is no back and forth interaction on
the root. It simply waits for recipients to be ready and then sends
its data.
Before:
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: broadcast_one_to_all
Options: processes=4, inputs=1
elements min (us) p50 (us) p99 (us) max (us) samples
100 1 16 29 50 426075
200 2 17 32 50 179953
500 2 11 31 59 140291
1000 2 12 29 59 177619
2000 3 12 29 62 117882
5000 5 16 31 64 127113
10000 9 21 38 88 60328
20000 19 36 65 130 30427
50000 48 68 221 556 11180
100000 92 136 426 871 7314
200000 193 251 829 2965 4092
500000 492 638 2098 4133 1677
1000000 1195 2024 3513 11646 628
2000000 3446 4216 5007 17100 282
5000000 12956 13919 14941 37751 71
```
After:
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: broadcast_one_to_all
Options: processes=4, inputs=1
elements min (us) p50 (us) p99 (us) max (us) samples
100 15 37 52 107 27332
200 14 40 63 199 28620
500 17 37 52 118 18299
1000 9 39 57 120 33375
2000 20 57 78 180 24779
5000 31 61 84 190 18039
10000 39 70 90 225 8908
20000 57 108 130 940 8313
50000 94 163 217 1933 5326
100000 132 231 331 3501 3681
200000 256 426 560 6509 2272
500000 774 1092 1698 10039 985
1000000 1132 2106 3878 18218 484
2000000 3509 4252 6832 20228 226
5000000 11326 15447 27129 52694 77
```
Reviewed By: wesolwsk
Differential Revision: D5123341
fbshipit-source-id: f3bab4f75ef7c38817f74f00b382f18fe43d85d5
Summary: Vector out-of-range error was being triggered in some tests due to trying to get the address of an element past the end of vector.
Reviewed By: pietern
Differential Revision: D5123044
fbshipit-source-id: 004f72ebaa27c609290959c12a3d99b16289bfa8
* Fix segfault in autograd:
1) Every "output" variable must have a grad_fn or grad_accumulator
2) compute_partial_exec_callbacks uses Python errors
* assertRaisesRegexp was renamed assertRaisesRegex in 3.2
* Use HANDLE_TH_ERRORS macro
Summary:
In a previous commit where the slot numbering was expanded, I changed
the memory region send/recv path to use a map for the outgoing memory
regions (since they may complete out of order). Before, this was a
fixed size array, which was mutated by both the user thread and device
thread without holding a lock. The map, however, can't be mutated
without a lock. This change adds that lock and a few assertions to
check for this type of problem.
Reviewed By: andrewwdye
Differential Revision: D5108194
fbshipit-source-id: 1908c988112469ecdec6cb6eb9849068d896c409
Summary:
This file can then be used by downstream code to figure out what Gloo
features it can support (e.g. ibverbs transport or not).
Closes https://github.com/facebookincubator/gloo/pull/36
Differential Revision: D5110769
Pulled By: pietern
fbshipit-source-id: 2c0c07537258048737ae764a4978f2f7fdbd992d
Summary:
This is another example where our unsolicited writes may interfere
across calls to the collective function. In this case, it was possible
for a second call to overwrite a pair's address before it had been
used to connect the pair in the previous iteration.
Thinking out loud, we could avoid this from happening by supporting
this pattern natively in the Buffer classes. For example, we can add a
notification mechanism (opt in) to the Buffer class such that the
receiver may call `ackRecv()` to acknowledge receipt and handling of
the data in the buffer. Then the sender will block on new sends until
acknowledgement from the previous send has been received. Until then,
we have to keep an extra eye out.
Reviewed By: wesolwsk, romain-intel
Differential Revision: D5095430
fbshipit-source-id: 4c100433108fccea7457bba4dc00f651f722e6c9
* Check cuDNN version at runtime
This checks that the version from cudnn.h matches the version from
libcudnn.so.
Fixes#1476
* Only check major and minor version numbers
Summary:
The pair was still hardcoding limits on the slot numbers. In this
change those limits are lifted.
This also adds back assertions on work completion status in
handleCompletion.
Reviewed By: wesolwsk
Differential Revision: D5090457
fbshipit-source-id: 7bf884e1f31e48e8f1cdfb179a225999e28171b2
Summary: Add support for collectives over vectors of half-precision floating point values.
Reviewed By: pietern
Differential Revision: D5062938
fbshipit-source-id: 0b39fa53370393fec1edf2d852ff7f1d862b9022
Summary:
The halving/doubling algorithm had two instances where a receive
buffer was registered with a number of elements instead of a number of
bytes. This change adds the assertion that should have caught this in
the first place.
Reviewed By: wesolwsk
Differential Revision: D5089483
fbshipit-source-id: fd0f0724ef04300236c9297ee88b27e61fb1e5a0
Summary:
The original implementation created temporary buffers on the backing
context. This also meant an ordering problem when using the ibverbs
transport, as a call to send will block until the remote side has
created its receive side buffer. Since all buffers are now created
prior to using them, this is no longer an issue.
Reviewed By: romain-intel
Differential Revision: D5082352
fbshipit-source-id: 4c260f06e8f461c0336e7eec7ca891e07ff41cd3
Summary: Fixing a bug in the multiple algorithm test where threads were spawned repeatedly, causing collisions during rendezvous.
Reviewed By: pietern
Differential Revision: D5082945
fbshipit-source-id: 4adbbc963b1ff652f73a44cd9fd75dcd3325f182
Summary:
TSIA
This matches the approach in the TCP transport where all send/recv
logic is contained in the pair code.
Reviewed By: wesolwsk
Differential Revision: D5082503
fbshipit-source-id: b70886ed9aaeb381cdb45fba00704118cff62a23
Summary:
This is necessary to avoid the next iteration of the algorithm
overwriting data in recvBuf_ before it has been consumed by the
receiver of that data. If this does happen, the result of the previous
iteration for the receiving end is corrupted. This can only happen in
async mode on the TCP transport (so all incoming data is unsolicited)
when spinning on the run function.
Reviewed By: wesolwsk
Differential Revision: D5074789
fbshipit-source-id: 66668fbd885888f26266d812e78d61c6d65c2461
* Fix clang warnings
* Raise errors when unsupported ConvNd configurations are used
* Properly handle Variable indexing with LongTensors
* Support both tensors and variables in Variable.type_as
* fix issue #1549, expose bitwise and
* expose C bitwise or of Tensor
* expose C bitwise xor of Tensor
* use built-in method for inplace and, or, xor
* expose C bitwise lshift(ilshift) and rshift(irshift) of Tensor
a module that returns a non-standard data structure currently breaks
due to checks for backwards hooks. This refactors the code slightly so
this will only break in the event of backwards hooks.
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
Summary:
Added a context factory that allows you to use an existing context to
create other fully connected contexts much more cheaply (without having
to rely on a store).
Limitations:
- The backing context needs to be fully connected
Reviewed By: andrewwdye, pietern
Differential Revision: D4985121
fbshipit-source-id: 31ceabccbb679cedb18ec9927b6c166bef5989bb
Summary: Set deviceId_ to -1 when CudaDevicePointer and CudaStream do not have valid data
Reviewed By: andrewwdye
Differential Revision: D4881374
fbshipit-source-id: e973a70e2e6e4519f5fdc2ad4e76f232d9593751
* Make sparseMask error if mask is uncoalesced.
Fixes#1447.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add test for sparse adagrad.
Previously, the sparse codepath was not exercised at all; this commit
adds a very simple test case "sparse Rosenbrock"; the idea is to do
Rosenbrock but then knock out one of the dimensions so that the
tensor is sparse.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
We weren't handling an edge case where write(2) would return EINTR
when in sync mode. The Pair::write function would return false
indicating it didn't complete the write whereas the send function
expects it to complete when in sync mode. With this change we now
advance the cursor and retry the write when fewer than expected bytes
were written.
Also see https://github.com/facebookincubator/gloo/issues/34
Reviewed By: andrewwdye
Differential Revision: D4996949
fbshipit-source-id: 3bad4fa3d0a01517f20b64904aa71410641fa60f
Fixes#1449.
For future reference, we should have a doc explaining our ref-counting
conventions; it looks like this bug slipped by because we assumed that
newTensor was taking ownership of the pointers it was passed in.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
As discussed in #1441.
I also added some docs giving clear guidance about how to coalescing
in sparse tensors.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Simplify _gen_sparse
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Randomly generate an uncoalesced tensor and test with it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Simpler implementation of cpu_only suggested by @apaszke
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Better implementation of randn, suggested by @soumith
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Lint fix.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix CUDA type error.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Previous slot offset was not added to the calculated value for the slot to be used in halving-doubling algorithms. If multiple instances were running, slot values could collide.
Reviewed By: pietern
Differential Revision: D4986618
fbshipit-source-id: 56b9220c91f31cc016d37e82907221460de70657
1) Fix "kth" attr specification -- I can't get sphinx to generate `k`th,
but `k` th works with a space, unlike now where the highlighting continues
until the next attr.
2) Specify the size of the return tensors.
3) Add an example of the return tensor sizes with more than 1 dimension.
Summary:
This helps guard against programming errors where waitSend is called
before send is called. It uses a std::atomic to keep overhead low.
Reviewed By: andrewwdye
Differential Revision: D4984604
fbshipit-source-id: 04a63b1ba088e3bcba0abff40771af666deb15e5
Summary:
This returns EFAULT when passing a GPU memory pointer (for GPUDirect)
and the ibverbs driver can't map the GPUs memory. Since the error is
pretty cryptic, crash with a more useful message.
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at gloo/transport/ibverbs/buffer.cc:46] mr_ !=
nullptr. ibv_reg_mr: Bad address (kernel module 'nv_peer_mem' not
loaded; did you specify a GPU pointer?)
```
Reviewed By: andrewwdye
Differential Revision: D4982966
fbshipit-source-id: 72c220fe22a3bc59396cfff992ad5f0f9c5bf83a
* Refactor test_sparse to reduce boilerplate.
Instead of manually creating a helper function, threading an is_cuda
parameter around, and creating a test method for CUDA and non-CUDA
variants, we take a different approach:
- There is now some new member variables initialized in setUp which
control the aspects of how we carry out the test; at the moment,
it's just whether or not we are using CUDA or not. This means
you don't have to pass is_cuda around, or do a conditional to
get the triplet of constructors you need.
I'll note that I am not a big fan of member variables in test
objects, but these are (intended to be) immutable so I think
it should be OK.
- Instead of manually defining test_foo and test_foo_cuda, we now
have a new TestCudaSparse class which overrides setUp (from above)
to swap in the CUDA implementation. Way less boilerplate, and NO
metaprogramming needed.
If you need to opt out of CUDA testing, there is a new cpu_only
decorator you can use.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: A generalized version of halving-doubling that supports non-power-of-two number of processes by breaking up execution into blocks that are powers of two and communicating interblock after the intrablock reduce-scatter. Non-power-of-two cases will have some degree of load imbalance compared to power-of-two, but cases with few large blocks (e.g. 8 + 4 or 16 + 8) should still perform relatively well.
Reviewed By: pietern
Differential Revision: D4955947
fbshipit-source-id: af4f218fedb6adf475530c38386978b81f4f2b74
Because of this Variables can no longer appear in the graph.
Every usage of a leaf Variable will leave an AccumulateGrad
function that has no outputs, but modifies var.grad as a side
effect.
Summary:
After running the test suite many times we end up with a zillion
connections in TIME_WAIT state. Setting SO_REUSEADDR seems like it
should help binding to ports regardless of the TIME_WAIT state.
Reviewed By: andrewwdye
Differential Revision: D4979606
fbshipit-source-id: b611f9c9e11aba858dc192f6bca3d64e10100b52
Summary:
It can happen that a pair is destructed while in CONNECTING
state when some unrelated code throws an exception after the connect
function has been called. The most likely place for this to happen is
when connecting pair A is in progress while connecting pair B throws
an exception. The exception will force destruction of all references
to pair A, even if it is in the CONNECTING state.
Also see https://github.com/facebookincubator/gloo/issues/33
Reviewed By: andrewwdye
Differential Revision: D4979557
fbshipit-source-id: 0cddddd3f478106f1694603fe7f2efe15a2d9aa1
Previously, when using same data channel in multiple thread environment,
one didn't have any guarantee that there won't be any deadlocks
or even errors.
Summary: No need to assert on connection errors.
Reviewed By: andrewwdye
Differential Revision: D4957698
fbshipit-source-id: b47f6f0f098dbf7d212701c5cb68e34b2c1c9522
Summary:
This PR makes cmake installs the gloo CUDA headers if USE_CUDA is enabled.
Closes https://github.com/facebookincubator/gloo/pull/29
Differential Revision: D4946856
Pulled By: pietern
fbshipit-source-id: a688c3794c4a5e34b664e7bdeb4e1148f6504419
Summary:
It should be up to the program including Gloo to ignore SIGPIPE.
We have seen a case where the EPIPE errno is not properly handled in
an unrelated piece of code. Having SIGPIPE fire means we can get a
core and debug this further.
Reviewed By: andrewwdye
Differential Revision: D4896727
fbshipit-source-id: f6fe2d3f8dc68a9e6c2c457639b45f8aee2d7b20
* move TopK to generic
* partial genericization of kernel code
* introduce TopKTypeConfig, specialize radix type and conversion for floats
* implement topk for byte tensor
* implement for char tensor
* implement for int tensor, extend test to check indices as well
* works for longs too
* make bitfield set/get a struct, add support for 64-bit types
* extend to double tensor
* implement for half tensor
* asserts; test fix
Summary: This file was left over after a recent refactoring but is not used.
Reviewed By: andrewwdye
Differential Revision: D4940265
fbshipit-source-id: 01f8c5fbc73dd0ca0a92306dbfef22ff28133750
Summary:
While it is theoretically possible to make Gloo work on 32-bit systems, it's unlikely anybody would ever use it on 32-bit systems. This removes the expectation that it should work...
Fixes#28
Closes https://github.com/facebookincubator/gloo/pull/31
Differential Revision: D4939073
Pulled By: pietern
fbshipit-source-id: 8c60804f7ae5cf835332871a424aefa2c498e8a4
Fixes#1267
This fixes a number of issues when PyTorch was compiled with CUDA
support but run on a machine without any GPUs. Now, we treat all errors
from cudaGetDeviceCount() as if the machine has no devices.
This saves an extra memory copy, which speeds up data loading a bit
(5-10% with accimage).
As part of this change:
* torch.cat accepts keyword argument out
* sepcifiying out=None is treated like not specifying out
Summary: PrefixStore::wait() uses a default timeout if unspecified. This is incompatible when using PrefixStore to wrap a Store implementation that does not support timeout. Instead the base Store::wait(keys, timeout) implementation is called, throwing an exception. This change modifies the base implementation to ignore the timeout.
Differential Revision: D4916517
fbshipit-source-id: 3cdd83bd209bf938b58442d82f3fc245e68019ad
Summary: Fixes for corner cases with small element counts. Fixed problems include (1) calling range on out of bounds pointers, (2) failing to allocate send or receive buffers in cases where they correspond to out of bounds indices for reduce-scatter, but are needed in the allgather, (3) not allocating enough receive buffer space (more than count_ bytes may be needed in some cases)
Reviewed By: pietern
Differential Revision: D4912656
fbshipit-source-id: 0409d01894ff9c93ef1a1fdf8021c9ecf62f9b57
Summary:
memcpy comes from cstring
See https://github.com/caffe2/caffe2/issues/286
Reviewed By: Yangqing
Differential Revision: D4914228
fbshipit-source-id: de60c2a98feb4228546a8f1fe237a090101f50e4
Summary: Add a default 60s timeout to RedisStore::wait() to avoid blocking indefinitely when peer machines are unavailable.
Reviewed By: pietern
Differential Revision: D4908699
fbshipit-source-id: 39de9066633e8b0c8d1ee198b6bf3f70d3961196
Summary:
It's possible the pair is in the listening state when it is
destructed. The fd will not have been cleaned up in that case, so we
shouldn't assert that being the case.
Reviewed By: andrewwdye
Differential Revision: D4909964
fbshipit-source-id: 7103d74910e3bcf5de9f4658d8f1f682b6c8a70c
Summary: Add AllgatherRing and CudaBroadcastOneToAll to benchmark. Add host info and algorithm sweep to chronos script.
Reviewed By: pietern
Differential Revision: D4901111
fbshipit-source-id: 1421025d39b914b14e857f21c43eac30c9c9dd2f
Summary: Output peer address on network failures. This change will help in root causing network failures.
Differential Revision: D4899129
fbshipit-source-id: 60a762c6551a726081d5335ab478da8dd7f6dad7
* Fix group-convolution w/o biases on CPU.
Not having this guard will cause a crash further down in the `cat`
function when it uses the first element in the passed list to create a
new tensor. (And even after that, cat doesn't handle nulls well.)
* Added test for groupconv w/o bias on CPU.
Summary: Device reduce is more efficient for large buffer sizes. For smaller buffers, host reduce may be more efficient in some cases and frees up the GPU for other work.
Reviewed By: andrewwdye
Differential Revision: D4885855
fbshipit-source-id: 7dc522e8c93e1a94427730aca6af03b7e93e660d
Summary:
Instantiate nccl type templates for gloo (minus half).
half requires at a minumum ifdefing CUDA_HAS_HALF and likely requires
more work given that operators aren't defined on it, so skipping it
for now.
Reviewed By: pietern
Differential Revision: D4876217
fbshipit-source-id: 833d2aec12789cbaf9e0a201b979a420fbe6732f
Summary: Added a pipelined version of cuda halving/doubling algorithm. Half the buffer is reduced prior to first send and the other half prior to reducing the result from first receive. Broadcasts are started asynchronously as soon as each new message is received. New code was added as a new algorithm, as pipelining makes performance worse for small buffer sizes.
Reviewed By: pietern
Differential Revision: D4847109
fbshipit-source-id: 5aa55de95f8c94069380af7396f2b5b6297dcbea
Summary:
The code already asserted, but only on the reply type, so it didn't
include the actual error message. This makes debugging problems much
easier when people have problems running the benchmark suite.
Differential Revision: D4860022
fbshipit-source-id: 659bc461a724603375bff18eac90eca658492b05
Summary: This is cheaper than doing getaddrinfo for every pair.
Reviewed By: andrewwdye
Differential Revision: D4850102
fbshipit-source-id: e77f468f099f63860b52fdd0dcc57a8a7a91a448
Summary:
Part of this change is to perform a getaddrinfo in the TCP device
class so we can figure out the interface and subsequently PCI bus ID
of the NIC used for its traffic. This information can be used in a
later diff to avoid doing getaddrinfo calls in the TCP pairs and have
them reuse the information that is resolved by the device.
The PCI bus ID can be used to compute distance between NICs and GPUs
and make informed decisions on where to allocate scratch buffers.
Reviewed By: andrewwdye
Differential Revision: D4850035
fbshipit-source-id: 575e401a9273300bc720c814fef8971846ec748c
* Add IndexLinear
* Fixes to IndexLinear
- Fix IndexLinear test
- make it better for multithreaded case
- fix a glitch in the C code
- improve the reset() method
- fix the weight allocation.
- remove "fakeBatch" possibility as it's not used
- clamp normalized values at evaluation time instead of just dividing by max.
- add assert on the keys/values dimensions in IndexLinear.
- invert order of weightDecay in the case of output dim > 1.
* Changes required to support IndexLinear in CUDA
* Adding support for flattened inputs for IndexLinear
* Doc for IndexLinear + fix for when the input format changes from one batch to another.
* Cleaning up IndexLinear documentation
* Changes required to build with latest torch
* Adding benchmark script for IndexLinear
* Bugfixes and cleanup of IndexLinear.lua
- Fixed bug that occurs when performing multiple accGradParams +
updateParams
- All the data required for the updates is put in a single table
- Added :pararameters method
Summary:
Forgot to include these in a previous commit.
Closes https://github.com/facebookincubator/gloo/pull/23
Differential Revision: D4847072
Pulled By: pietern
fbshipit-source-id: 08aa9e8fa47377eb8c7747bd577eec7e615789f1
Summary:
With this we can compute the best GPU device to reduce on. It is not
always the one CUDA indicates as GPU 0.
Reviewed By: andrewwdye
Differential Revision: D4845581
fbshipit-source-id: 13e0500f54fd507899646f781a97c09abcd3b056
Summary:
This makes it easier to capture, compare, contrast results with
different parameters.
Reviewed By: andrewwdye
Differential Revision: D4843715
fbshipit-source-id: ba6916dcd5f8bcc615d6edce1a54657241357c31
Summary:
Instead of having every CudaDevicePointer "own" a stream, this change
moves to using CudaStream as first class object. It was pretty clunky
to use the copy{To,From}* functions on the CUDA pointer classes to
copy stuff around. For example it was not clear whether the stream
belonging to the source or destination was used to execute the copy
on. There is no longer such ambiguity after this change.
To make this work the CudaBroadcastOneToAll algorithm was changed to
include the workspace template argument, but only has the
CudaHostWorkspace implementation. The CudaDeviceWorkspace
implementation is left to be done for another change (that's not the
purpose of this change).
Reviewed By: andrewwdye
Differential Revision: D4841615
fbshipit-source-id: d0c1b9ba948ff6167832515afa7bdd2b32b48064
Summary: Make timeout a device attribute. Now the pair will configure timeout when connecting based on device timeout settings, instead of needing to be set explicitly on each pair. Set default tcp timeout to 30 sec.
Reviewed By: pietern
Differential Revision: D4838918
fbshipit-source-id: e6e6ee36c662eb5e7ba5354c904e50f9dcac258f
Summary: cuda_allreduce_halving_doubling was not properly handling the case where buffers are allocated in GPU memory, trying to reduce and copy from them as if they were in system memory.
Reviewed By: pietern
Differential Revision: D4840259
fbshipit-source-id: 2615360cd2f1d9c7a37fb0bcdf33ff35528b2c75
Summary:
Clarify that Redis Cluster is not supported. Also see #21.
Closes https://github.com/facebookincubator/gloo/pull/22
Differential Revision: D4837375
Pulled By: pietern
fbshipit-source-id: 6e3575b3b8dae6ca62beb765da15d8506da4abdb
Summary: Basic port of the CPU halving/doubling algorithm. No pipelining is done between reduce/broadcast and communication.
Reviewed By: pietern
Differential Revision: D4823693
fbshipit-source-id: b18045d64edf90361bf7713f4ccb2e074757780f
Summary:
Required for D4821763
Based on targets from https://fb.facebook.com/groups/fbcode/permalink/1304073246296178/ (I also excluded those targets which do not depend on folly:singleton).
Reviewed By: meyering
Differential Revision: D4832492
fbshipit-source-id: fcb4ce42e9e5359d4752769f77d7271e550201fe
Summary: Refactor AllgatherRing algorithm to remove all memcpy in the communication rounds by using outPtrs as send/receive buffer + remote buffer offset.
Reviewed By: pietern
Differential Revision: D4793186
fbshipit-source-id: 645d0758d246fd0b493e3fe312a8441d86f6d169
Summary:
Combines the top level common.h with algorithm.h. With algorithm.h in
the common package, CUDA algorithms only need a dependency on that
package. CudaBroadcastOneToAll still depended on broadcast.h so this
change also removes that dependency and has it subclass the Algorithm
class.
Reviewed By: andrewwdye
Differential Revision: D4826885
fbshipit-source-id: 930037e39f7a2c941868e53f0bbc54e3f2e0b184
Summary:
GPUDirect support for CudaAllreduceRingChunked by adding a workspace
template parameter and adding workspace specific init functions.
To support this change the CUDA LocalOp classes had to be changed a
bit to take an extra destination/source pointer. This allows reduction
of 1-N pointers into a target pointer, where the target may live on
device or live on host. If it lives on the host, the NCCL operation
that executes the reduction is followed by a D-to-H memory copy. If
there is only a single input pointer, no reduction needs to happen and
the class just executes the D-to-H memory copy. The net result is that
we can interchangeably use device or host pointers as target for
reduction or source for broadcast and these LocalOp what you would
expect them to do.
Reviewed By: andrewwdye
Differential Revision: D4825236
fbshipit-source-id: 048ec6cbc5a0500bafbe1b3f6abe1e2e5f3a2675
Summary: Fixes for handling errors and timeouts in blocking and polling sync paths. Add test coverage for errors and timeouts.
Reviewed By: pietern
Differential Revision: D4823498
fbshipit-source-id: 93721947a6404ca9cea6a4869f4156f8d270a981
Summary:
Anything number of elements below this always fits in a single packet
and will yield ~identical results.
Differential Revision: D4825190
fbshipit-source-id: 71ac77456049e991da5059d5a029c5e9d2a67ed7
Summary:
The existing CudaAllreduceRing with a CudaDeviceWorkspace
template parameter now has the same effect.
Reviewed By: andrewwdye
Differential Revision: D4823393
fbshipit-source-id: 88fe497a983b26a281a3a74fe3bdc02c0c87c523
Summary:
Implement a file store for multi-process transport failure testing. Add test cases to spawn multi-process tcp communication, and verify that all processes throw the expected IoException.
A future diff will add coverage for connectivity failures, sync modes, and ibverbs.
Reviewed By: pietern
Differential Revision: D4807794
fbshipit-source-id: 35212719d46e6d875eacb341fae25681f39053bc
Summary:
Allreduce using recursive halving and doubling algorithm. Algorithm is described in http://www.mcs.anl.gov/~thakur/papers/ijhpca-coll.pdf (see top diagram on page 12). Algorithm consists of 2 lg P stages, the first log P performing a reduce-scatter and the second log P the allgather. Message size is variable across steps. The early stages of the reduce-scatter and the late stages of allgather send the largest messages. The communication is structured such that the largest messages are sent between nearby ranks, which could be useful if elements are ranked in locality-aware fashion.
So far this supports only power-of-two number of processing elements.
I have attempted to minimize the amount of synchronization/ hand-shaking. Messages are received at different offsets of the output buffer for each communication step. Send offsets in the reduce-scatter steps become receive offsets in the allgather and vice versa. The reuse of buffers across reduce-scatter and allgather steps requires synchronization. Right now the algorithm is inefficient in terms of memory use, requiring 3x memory currently. This can be reduced, but would require additional synchronization.
Reviewed By: pietern
Differential Revision: D4795878
fbshipit-source-id: fcc6597ef6a99cd102fce2b8e4562d93088d39dc
Summary:
Didn't provide enough value now that ReductionFunction and
CudaReductionFunction are no longer related.
Reviewed By: andrewwdye
Differential Revision: D4819295
fbshipit-source-id: e6479769af7f78d486bee7d9c31f049430cdc775
Summary:
To bring the GPUDirect and non-GPUDirect implementations of CUDA aware
algorithms closer together this change introduces CUDA workspaces.
There's an implementation for a host side workspace and a device side
workspace. The former is used for transports that don't support
GPUDirect and the latter for ones that do. CUDA algorithms will take
an extra template parameter for this workspace and this will determine
whether they can be used for GPUDirect or not.
The workspaces only define their respective pointer types right now
but may contain local operation construction functions at a later
point in time.
Reviewed By: andrewwdye
Differential Revision: D4802826
fbshipit-source-id: cb1d71a224ce0165afd07fb9092ad54d3e07c8cf
Summary:
The CUDA algorithms all had their own version of local reduction and
broadcast. This commit consolidates them and allows all CUDA
algorithms to work with CudaDevicePointer instances.
Reviewed By: andrewwdye
Differential Revision: D4797968
fbshipit-source-id: cccef39fce01905a2cd757ccbcffd29803411409
Summary: Verification was sometimes failing for allreduce halving-doubling. Pieter noticed that it is due to verification step racing with the regular iterations.
Reviewed By: pietern
Differential Revision: D4804558
fbshipit-source-id: f645cb2e332e449a993a634c5bdb42c2dcb8613b
Summary:
This is a copy of CudaAllreduceRing that doesn't stage the locally
reduced buffer in host memory but uses the GPU side buffers directly.
Eventually I would like this to be absorbed back into
CudaAllreduceRing, but for now it's a good place to compare the two
implementations and abstract the parts that make sense, until they are
identical again.
Reviewed By: andrewwdye
Differential Revision: D4791629
fbshipit-source-id: 5ad065cb94adb968aeee2379327be313638f2161
Summary: Add a setTimeout() API to the Pair interface. Implement in the tcp transport for connect, read, and write, and across blocking, polling, and async configurations. Ibverbs implementation to come later.
Reviewed By: pietern
Differential Revision: D4787932
fbshipit-source-id: 6072dc0c0add1700f84a72b83e4388b29b044ec1
Summary:
The header already contained an analysis of required completion queue
depth but the queue pair was still initialized with a maximum queue
depth of kMaxBuffers. This change fixes that and updates the analysis
to talk separately about receive and send completion queues.
Reviewed By: andrewwdye
Differential Revision: D4785786
fbshipit-source-id: 4dc302d523a3b7162dc261d14cfcc755681febf8
Summary:
Predefining the reduction functions makes it easy to provide a set of
fast implementations. Eigen is used to implement them if it is found.
Reviewed By: andrewwdye
Differential Revision: D4780868
fbshipit-source-id: e825cf2e5cfe8ec27d587c5aff4002534b1c670d
Summary: This makes it possible to write to any offset in a remote buffer.
Reviewed By: andrewwdye
Differential Revision: D4779776
fbshipit-source-id: f5a44cc705df5141bd720ff4e3fec8697f707a70
Summary:
All operations supported by NCCL are now available through the Gloo
wrappers. Algorithm wrappers for them are forthcoming so that they
can be used interchangeably with other implementations.
Since not all of them require same-sized source and destination
pointers, I moved assertions on number of elements to the op
constructors.
Reviewed By: andrewwdye
Differential Revision: D4771292
fbshipit-source-id: 2f34629507b5e1cb9ae8d6d2f02de0a7f641a341
Summary: Allgather ring CPU implementation. Its does |buffers| x |contextSize| passes.
Reviewed By: pietern
Differential Revision: D4723809
fbshipit-source-id: ffd8366ac7e1746555474e173143d33cee497822
Currently in-place and out-of-place updateGradOutput will produce different results for input=max_val or input=min_val - in-place won't backprop gradient where input=max_val or input=min_val, out-of-place will backprop gradient in this case.
Summary:
This makes it possible to embed Gloo in a project without CMake
installing Gloo headers and/or libraries, or having a runtime
dependency (and statically link to it).
Also:
* Install benchmark tools
* Statically link to NCCL if the bundled version is used
Closes https://github.com/facebookincubator/gloo/pull/19
Differential Revision: D4762432
Pulled By: pietern
fbshipit-source-id: cf38903e6c51f2480fba4ff18cbdc0c9080df0c4
Summary:
This may be the case when the Gloo CMake files are sources from a
parent project that has already imported CMake CUDA support. If these
checks are not performed then CUDA_NVCC_FLAGS might contain
conflicting options.
Verified this works while working on Gloo for Caffe2.
Closes https://github.com/facebookincubator/gloo/pull/18
Differential Revision: D4756179
Pulled By: pietern
fbshipit-source-id: 32fc39ec2322cce5899a2398ebbf8395d3917502
Summary:
Some small MPI-related changes:
1) Instead of making an object copy of the MPI_Comm, call MPI_Comm_dup;
because the (passed-in) communicator is used later via the call to
connectFullMesh this guarantees that the communicator will not have been
freed by user before connectFullMesh is called.
2) Allreduce for maxLength is done on an unsigned long type; use the
corresponding MPI type.
Closes https://github.com/facebookincubator/gloo/pull/17
Differential Revision: D4754195
Pulled By: pietern
fbshipit-source-id: 863fd33c726f88120f8f5ee61964c3525babbf97
Summary:
This change solidifies IO error handling between threads and successive transport API calls. When an IO exception occurs, signal all buffers of the error, propagating the exception from the device thread or single user thread onto all user threads. Store the exception in the pair and check on future API calls or device events. Swallow all IO exceptions in the device loop.
Right now IO exceptions during portions of the listen/connect phase will result in an indefinite wait in the peer. I will address this with a configurable timeout (t16205269).
Reviewed By: pietern
Differential Revision: D4749248
fbshipit-source-id: c75ee3b20875d561bf84631e5384e28015dabad3
Summary:
Bubble up gloo configuration and network errors as exceptions. The caller may be able to recover. Other unexpected failures continue to be handled as fatal with GLOO_ENFORCE
Modify ibverb API validation to check for != 0 instead of -1 to conform with API definition.
Still need to convert some errors in the rendezvous code and add documentation.
Will pass device loop errors onto the calling thread in a future diff
Reviewed By: pietern
Differential Revision: D4730362
fbshipit-source-id: c801adb353013e7f541ab01ac16a0cc71c1c36b2
- Add additional timeouts to test_multiprocessing to reduce chances of
hanging indefintely on failure
- Add missing header guards
- Fix typo
- Check that torch_shm_manager exists in torch/__init__.py
This ensures that we use the same library at the C++ level and with
Python ctypes. It moves the searching for the correct library from
run-time to compile-time.
- make each test in test_autograd have a unique name ignoring case
- assemble all tests when test_legacy_nn is imported
- import Python.h in PtrWrapper.h
Summary: Initializing ncclComm_t is expensive. Allocate a set of ncclComm_t for each unique device set and cache for reuse. With this change the CudaAllreduceChunked tests runtime improved from ~170 sec -> ~10 sec on my machine. There is no improvement in the benchmark numbers because the algorithm instance is only allocated once.
Reviewed By: pietern
Differential Revision: D4708943
fbshipit-source-id: 85b85070586d6683a762b8282df593ca831e7bc7
Summary:
This change includes CMake changes to compile the MPI assets when the USE_MPI flag is enabled. If so, the benchmark tool can now be launched through mpirun.
Includes the changes done in #11.
Closes https://github.com/facebookincubator/gloo/pull/12
Reviewed By: Yangqing
Differential Revision: D4712060
Pulled By: pietern
fbshipit-source-id: 0d0e93882f5822583f59304d4256dbdf5dea7483
Summary: NCCLOp::runNCCL is mistakenly recording an event in the source pointer after the NCCL op. This results in NCCLOp::wait() returning without synchronizing with the output buffer. The synchronous tests using NCCL fail.
Reviewed By: pietern
Differential Revision: D4708860
fbshipit-source-id: 0c36511e260b587d410e5c9604552ceedd06d988
Our extension library links against cudart and pulls in the symbols. Use
LoadLibrary(None) to use the same symbols as the _C extension.
This fixes the PyTorch wheel when you don't have system CUDA installed.
Summary:
This is the minimum required CMake version (also the version that is available on Ubuntu Trusty (14.04)).
Closes https://github.com/facebookincubator/gloo/pull/9
Reviewed By: Yangqing
Differential Revision: D4698659
Pulled By: pietern
fbshipit-source-id: bf01541fe485c03e7c665f175c2887feaf9516a3
Summary:
Allocate a set of per-device streams used to serialize NCCL op scheduling. These ensure concurrent NCCL ops are not interleaved across devices (i.e., through priority scheduling), resulting in deadlock.
Synchronize source and destination streams with NCCL streams.
Reviewed By: pietern
Differential Revision: D4685360
fbshipit-source-id: 3c228b195b0a0d9d7cccc720163898d344a5ed4c
Samples elements from `[0,..,len(weights)-1]` with given probabilities (weights). So far there is no mean to either introduce sample weights in loss functions or while sampling from a dataset. This is an attempt to add the functionality for the latter issue.
Summary:
This makes it easy to use Gloo transports and algorithms in existing
MPI environments.
Reviewed By: andrewwdye
Differential Revision: D4685999
fbshipit-source-id: cfc7d0e445893512b4e4ed2abe1bb280d83b9c70
Summary:
How pairs are setup and connected to one another is specific to
whatever underlying rendezvous mechanism is used. This change moves
the `connectFullMesh` function into a subclass in the `rendezvous`
directory. This prepares for a separate MPI context that can setup
pairs between processes using an existing MPI communicator.
Reviewed By: andrewwdye
Differential Revision: D4684755
fbshipit-source-id: 9eb643b8ba545b3e6f9a36b65642b3b04a5f0077
Summary: CudaDevicePointer has the information we need for a NCCL op. Refactor NCCLElement as a composition of src and dst CudaDevicePointers. This allows for separate streams for src and dst, and will simplify a future change to use a static set of streams for all NCCL ops.
Reviewed By: pietern
Differential Revision: D4679483
fbshipit-source-id: 75656cc2fa5b5e2a6c096d914d2111769a47291b
* add momentum and centered options
Add two options :
- Momentum (like SGD's momentum)
- Centered RMSprop, as in Graves 2013 ( https://arxiv.org/abs/1308.0850 ) : grad is normalized by running estimation of its variance
* somme PEP8
* bug in default
* bug2
* sign mistake
* alloc of momentum & centered only if needed
* add link to docstring
* some pep8 on docstring
* implement __setstate__() for backward compatibilty
* correct grammar mistake
* multiply by lr when adding delta to params
* rename momentum variables
* change __init__ params order
This is an important clarification to make as otherwise users are misled as to where they may need to add dropout and to clarify the situation would need to delve into the backend implementation.
4647f753bc/torch/nn/_functions/rnn.py (L73)
Summary:
Add a nextSlot() function to the context that increments and
returns a slot number. This enables multiple algorithms sharing the
pairs part of a context. The slot numbers were hardcoded before this
change, which prevented reuse.
After this change, some of the tests can be changed to run multiple
times (or do a parameter sweep) without respawning a new threadpool or
allocating new fixtures.
Also change some internally used variable names for more consistency.
Reviewed By: andrewwdye
Differential Revision: D4668268
fbshipit-source-id: 65cbc8f2666f0b7d2f1c72574b86d913f5855d62
Summary:
Taking ownership of a std::unique_ptr is a bit awkward. It's actually
useful to reuse the underlying store and create multiple prefix stores
against it.
Reviewed By: andrewwdye
Differential Revision: D4662354
fbshipit-source-id: eaf62f7d5a97d6ee848252ff3124c28da349f6f2
Summary:
This changes the constructor prototype of the broadcast algorithms.
They now take the rank of the root process and the rank of the root
pointer. The root process now also broadcasts locally, among the
specified pointers, in addition to broadcasting to its peer processes.
The broadcast tests are made more robust to use a different value at
every index for every buffer, like the allreduce tests. To accomodate
multiple input buffers for CPU side algorithms, I added a Fixture
helper, and renamed the existing Fixture class to CudaFixture.
The broadcast tests contain a few TODOs since they don't vary the root
process or root pointer yet. I anecdotally verified this does work,
but didn't want to include the necessary changes to do so in this
commit (it requires some changes in rendezvous and NCCL code). A fix
for this is forthcoming.
Reviewed By: andrewwdye
Differential Revision: D4661635
fbshipit-source-id: c069e0d4e8f676a63efd74b15ea1156adcc09477
We were keying hooks by RemovableHandle id. However, we don't hold onto
handles and ids of dead objects can be reused. This replaces id(handle)
with a global counter.
This is similar to THCCachingHostAllocator_recordEvent() but on CUDA
allocations. It's useful for overlapping copies with computation. The
workflow is approximately:
0. allocate dst tensor on copy stream
1. copy from CPU to GPU on copy stream
2. synchronize the main stream with the copy stream via
cudaStreamWaitEvent
3. THCCachingAllocator_recordStream(dst, main_stream)
The recordStream() call is necessary to prevent the dst tensor from
begin reused on the copy stream before the main stream finishes work.
Previously, you would need to insert a second cudaStreamWaitEvent before
dst is freed to force the copy stream to wait on the main stream.
Summary:
I have seen a stress run crash with unexpected state. Adding these
assertions will give more information when it happens again.
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at gloo/transport/tcp/pair.cc:407] false. Unexpected state: 5
```
Reviewed By: andrewwdye
Differential Revision: D4652216
fbshipit-source-id: e787f4097f5ab32367dd9fa5a336d0389b97e955
* Use TH_INDEX_BASE when verifying dimension for cat
* Adding tests for cat when no dimension is specified.
- Also renamed ldimension to cat_dimension to be more specific.
Summary:
The fields are public so their names should not end with an
underscore.
Reviewed By: andrewwdye
Differential Revision: D4645038
fbshipit-source-id: c12b47affbe511383a4722717a06abb61918473b
- Code was using dimension specified which was negative
- Changed the cat_dimension variable to be more explicit
- Fixed code to use the cat_dimension variable
Summary:
The NCCL code used in CUDA-aware allreduce does local reduction of N
buffers prior to putting anything on the wire. Supporting this in the
benchmark tool to measure the impact under various configurations.
Other minor tweaks in this change:
* Specify sub-second iteration time
* Templatize allreduce benchmarks (the algorithms share a constructor
prototype)
Reviewed By: andrewwdye
Differential Revision: D4639517
fbshipit-source-id: f7417d3e9f79278a3b1eca48d779f48b77e5260c
Summary: Cuda algorithms take an optional set of device streams to sequence operations. If streams are provided, the algorithms should enqueue final output buffer operations on the associated stream and return asynchronously. Destructors that allocate streams/events should synchronize before tearing down.
Reviewed By: pietern
Differential Revision: D4636447
fbshipit-source-id: 32ec2adc214c83b0b4bc0fff8993ab196459117b
Summary:
With this change, every buffer gets assigned a different
value at every index. This means reordering of segments (e.g. in the
chunked algorithm) would surface as test errors.
Reviewed By: andrewwdye
Differential Revision: D4636368
fbshipit-source-id: 464eb1515d1590e12481961d427a92e2ebb3be82
Summary: CUDA documentation detailing high-level support for CUDA in gloo algorithms, usage of streams, and synchronizing memory management.
Reviewed By: pietern
Differential Revision: D4633120
fbshipit-source-id: d88e230c8dc82fe48cda0f401b61758fa4f07f2e
Summary:
Synchronous mode means using the calling thread instead of the device
thread for completion handling. Since this saves a context switch in
the critical path, this is very beneficial for low latency algorithms.
For example: the p99 of a 4-way barrier drops from 17us to 4us.
Reviewed By: andrewwdye
Differential Revision: D4626948
fbshipit-source-id: 013b1680497589fe5ad0bca38600bce6a410200b
Summary:
All pairs created by a device would use the same completion queue.
Supporting sync mode that way is difficult, as there is no way to
filter completions for a particular pair. This change refactors this
to use a single completion queue per pair so that this is no longer an
issue. This change is a preparation for supporting synchronous mode
(where the calling thread itself will poll the ibv library for
completions instead of the device thread).
This change also includes a refactoring of the way transient memory
regions are handled so that they are properly deregistered and
deallocated when no longer needed.
Reviewed By: andrewwdye
Differential Revision: D4625146
fbshipit-source-id: 21bf5ab321534fbd5c03f12049c10fc67da68944
Summary: std::atomic was not defined for cuda.cu.
Reviewed By: andrewwdye
Differential Revision: D4624611
fbshipit-source-id: 973bba10026e065667d6a576055d00505ee02d62
Summary: Allow gloo consumers to assign a mutex to synchronize CUDA malloc/free and NCCL operations.
Reviewed By: pietern
Differential Revision: D4622135
fbshipit-source-id: 60acd7c01a677a0df5415fe38e6ef5a2e7c8606a
Separates out non-Python part of AutoGPU. This also compiles without
CUDA which is useful for generic tensor code.
Also fixes a bug where THCPAutoGPU may not always switch the device:
THCPAutoGPU guard(-1);
guard.setDevice(0);
guard.setDevice(1);
guard.setDevice(0); // would not switch batch to 0
NCCL can deadlock if cudaFree() is called while it's launching kernels.
This exposes a mutex that can be held to prevent cudaFree() calls in the
caching allocator.
Summary: The AllReduceChunked algorithm currently performs the local reduce/broadcast of local device buffers in host memory. This diff updates the algorithm to execute the local reduce/broadcast steps using NCCL operations before copying a single device buffer to/from host memory.
Reviewed By: pietern
Differential Revision: D4587441
fbshipit-source-id: 4de689f59a6cf898b8eecd3c3b9f57f77124c0e3
* Add more detail to CUDA documentation
Also adds better cross-linking to the pages that discuss relevant topics.
* Adds recommendation to torch.save docs
* Make the version numbers for the docs dynamic
Might need tweaks for beta, 1.0, etc.
Backend is SpatialDilatedMaxPooling, so change 3D input (N*C*L)
to 4D size (N*C*1*L). Then output indices will range from 0 to L.
This range will not cause UnMaxPool1D error.
Signed-off-by: Zhou Chang <achang.zhou@gmail.com>
Summary:
Work may be queued on CUDA streams for asynchronous execution. The
memory backed by pointers passed to any algorithm can therefore be
mutated after constructing an algorithm instance. By also passing in
the streams these mutations happen on, the algorithms can synchronize
with these mutations to ensure no invalid data is used.
By passing in these streams, any work done by these algorithms will
*also* be queued, which effectively removes a single synchronization
step from any algorithm run.
Differential Revision: D4589394
fbshipit-source-id: 0c8cd6ba9c9018f33d6f4c55a037083fc4164acb
Summary: I was mistakenly calling the non-chunked algorithm for the chunked test.
Reviewed By: pietern
Differential Revision: D4580160
fbshipit-source-id: 9d62a68e9e86cc6e596d90ff8854c585a0e8855c
Summary:
First pass at a CUDA-aware allreduce chunked implementation. For now the algorithm runs on the CPU and is mostly copy/paste from allreduce_ring.h. A subsequent pass will offload to the GPU.
Serialize cuda test to avoid intermittent failures due to memory contention.
Reviewed By: pietern
Differential Revision: D4576959
fbshipit-source-id: e1f292a05b88ff24c33e549d4a52e770a21f85d2
Summary: Ideally we would want the driver to busy-poll for us. In absence of driver support, spinning with MSG_DONTWAIT flag seems to be helping a lot too. Of course, we pay the price of burning one core for polling. Sigh.
Reviewed By: pietern
Differential Revision: D4576242
fbshipit-source-id: 85d9e1b786fbb6053864fba80f3e5ecc80fe221d
Summary:
Latency optimization is going well and I've seen the odd case of <10us
measurements. This option makes the benchmark tool display nanos
instead.
Differential Revision: D4575925
fbshipit-source-id: 98dbd3b39e31cbcdd4c146613f6630e721187e1e
Summary:
The CudaDevicePointer optionally takes an existing stream on
which it runs any operation associated with the pointer (for now just
memcpy's, but this likely will includes kernel execution in the
future).
Differential Revision: D4574035
fbshipit-source-id: ddd7972a3874012059f1fde1b341fd6edd69102d
Summary:
In synchronous mode, it is not the device thread that is responsible
for handling I/O, but the user thread itself. Calling waitRecv on a
buffer will trigger the read function on the pair to be called. This
eliminates the context switch necessary if the device thread is
handling all I/O. For benchmarks with small numbers of elements this
reduces latency by as much as 20%.
Reviewed By: plapukhov
Differential Revision: D4549998
fbshipit-source-id: ab718ba090c06d7c7aa4065cc9f92bd96b9e4a35
Used .c file changes from 7318e2de13 as a starting point. All changes to .c files (except for whitespace details) are present here.
However, the required .h files were not present in that PR.
Summary:
Implement CUDA BroadcastOneToAll algorithm for GPU addresses. Refactor cuda.h into cuda_private.h to allow inclusion of <cuda.h> in public headers without polluting the namespace.
Port broadcast tests to GPU variants.
* this revision is based on Peter's revision D4546932
Differential Revision: D4547382
fbshipit-source-id: 3d294ad8862b04fb783ba22e5c925b8d7cbc8a8d
Summary:
Separate benchmark build target for CUDA-aware algorithms.
This is needed to keep CUDA an optional dependency.
Differential Revision: D4546932
fbshipit-source-id: b73176ae9067233f883d51ba3ab4efbb13a6f86f
Summary:
This CUDA-aware ring allreduce is based on the regular ring allreduce.
It runs the reduction algorithm on the CPU and is therefore most
suited for smaller buffers.
Both the device-to-host memcpy's at the start of the algorithm and the
host-to-device memcpy's at the end of the algorithm are kicked off
asynchronously in an attempt to parallize as much as possible.
Reviewed By: Yangqing
Differential Revision: D4542816
fbshipit-source-id: 101dfad276ca79703e37ff93fb1b6d467295f66b
Summary:
The CUDA benchmark suite will be a separate build target, so the
runner should be reused.
Reviewed By: Yangqing
Differential Revision: D4545092
fbshipit-source-id: 6ccf2d30f5d35c74fc59851b25416bfe6863d62c
The core autograd Variable, Function, and Engine no longer depend on the
Python API. This let's us implement functions in C++. In the future, we
can also multithread engine and release the GIL for most of the
non-Python backwards.
Summary:
In the GitHub repository this directory will be mirrored similar to
folly, such that the repository has a single top level directory
called "gloo". This allows for versioning or renaming of the
project root, without having to mangle the include paths; they will
always use the "gloo" prefix.
fbshipit-source-id: 24502e4185fc7cbe19b5249f83609e2b8118e9d7
In cases where copyAsync is a large percentage of the work,
processing events in recordEvent can cause a large bottleneck.
Here, we relax the constraint that we reclaim blocks as fast as possible
(i.e. in copyAync); instead, we only check that a block can be re-allocated
in malloc and free.
These methods are useful from C because they don't require constructing
THLongStorages to wrap the sizes and strides, which can lead to leaked
memory in case of an error. Instead the sizes and strides can be
represented on the stack using standard C long arrays.
Moves THPObjectPtr into a separate header, so that it can be included
independently. Currently, utils.h requries all of THP.h. Also adds RAII
structs for acquiring and releasing the GIL.
Due to bad rank mapping broadcast and reduce were connecting
wrong processes what resulted in errors or not received/sent tensors.
* Introduced new mapping method to solve this problem.
* Added and improved tests for this cases.
Here's the command I used to invoke autopep8 (in parallel!):
git ls-files | grep '\.py$' | xargs -n1 -P`nproc` autopep8 -i
Several rules are ignored in setup.cfg. The goal is to let autopep8
handle everything which it can handle safely, and to disable any rules
which are tricky or controversial to address. We may want to come back
and re-enable some of these rules later, but I'm trying to make this
patch as safe as possible.
Also configures flake8 to match pep8's behavior.
Also configures TravisCI to check the whole project for lint.
* Fix error in ELU backward
* Add --seed flag for testst st
* Add test for BatchNorm eval
* Fix autograd.backward docs
* Support cc flags in cuDNN search
* Fix IndexSelect backward formula
Scales `delta` before it is applied to the parameters in order to control the learning rate of the optimizer (inspired from climin optim lib for theano).
Also changed the link to the Adadelta paper to point to the right location.
* Always compile .numpy() for all types
* Add torch.nn.functional docs and hidden headers
* Use sphinx to generate torchvision docs
* Remove unused import in ffi utils
Transposed convolutions are often (but incorrectly) referred to as Deconvolutional operations. Made mention of this in the docstring to make it easier for people to search for this operation in the documentation.
Depending on how PyTorch is compiled, the source code for DataLoader
might not be fully available which can cause a spurious error in
test_dataloader.py
arguments.
For example:
>>> torch.randn(5, 5).geqrf('invalid arg')
TypeError: geqrf received an invalid combination of arguments - got (str), but expected ()
This is because the current version of luaffifb fails to pass
custom structs (i.e. half) as arguments or accept them as return
values.
The accreal parameters are immediately converted to real internally.
This is done to ensure none of the internal code needs to be changed.
This change also removes transform_reals_to_half which is no longer
necessary.
Change-Id: I978151d001de5492576fb0eddfa0608cd4e99149
The load_state_dict() function now raises an error if the argument
state_dict has extra keys or is missing keys.
Previously, load_state_dict() ignored extra and missing keys, which made
it hard to notice when you load an invalid state_dict. This could
happen, for example, if you save the state_dict for a DataParallel, but
load it into a single model.
The state_dict() function now only includes the Tensor data from the
paramters, which reduces checkpoint size by not saving gradients.
The register hook calls now return an object that can be used to remove
the hook. For example,
>>> h = module.register_forward_hook(callback)
>>> h.remove() # removes hook
Or as a context manager:
>>> with module.register_forward_hook(callback):
... pass
This makes it easier for libraries to use hooks without worrying about
name collisions.
- Non differentiable outputs could prevent a gradient computation (see
test_dep_nograd)
- Crash in backward on variable which doesn't requires_grad (issue
#438)
- Stochastic functions could be backproped through multiple times
- don't use cuDNN for half inputs because weight, bias, running_mean,
etc. are required to be of different type than for THCUNN
- accept 3D inputs (N,C,L) in BatchNorm1d
- remove accidental 'use_cudnn=False'
* Add support for torch.HalfTensor.
* Improvements/Simplifications for torch.HalfTensor.
Improvements/Simplifications:
1) Defines half type as TH_Half, so as to not conflict with cutorch
version. Previously, these were defined as the same "half" type and
required proper ordering of includes to ensure type was only defined
once, which would have affected all downstream projects.
2) No longer generates math functions that are not actually defined
on torch.HalfTensor, e.g. maskedFill, map, etc.
3) Adds tests for all available torch.HalfTensor functions
4) Allows compiling without TH_GENERIC_USE_HALF (so if there's a
problem can just unset that in CMakeLists rather than backing out)
5) Some simplifications: removes a new copy optimization and
some TH_HALF literal definitions
Limitations:
Because match functions are not defined, some "non-math" operators
on torch.HalfTensor give an error message, e.g. __index__/__newindex__
with a ByteTensor apply a mask, but masks aren't implemented. These
limitations aren't always obvious, (e.g. for documentation purposes),
but they should always give an error message.
* Rename TH_HALF to THHalf.
This hooks into the (internal) ForkingPickler class in multiprocessing
to reduce tensors, storages, and CUDA events instead of our queue from
joblib. This makes it easier to use the standard multiprocessing classes
in later versions of Python.
This also exposes:
- Tensor/Storage.share_memory_()
- Module.share_memory()
These methods move the CPU tensors and storages to shared memory. If
you're using the "fork" method of multiprocessing, these objects can be
directly inherited instead of serialized through a queue.
Added support for the fill, diff, scale, mul and add functions using
PPC CPU vector instructions. These are used in place of the versions
of these functions written for x86, when compiled on PPC.
This fixes a compile failure on PPC
Occasionally, my PyTorch checkout gets into a bad state where libnccl.so
does not exist, but the NCCL makefile doesn't build it because
libnccl.so.1 exists. Switch to copying libnccl.so.1 to work around this.
Fix a bug in cat when catting with an empty tensor along first dim (it added an extra dim).
Fix the ambiguous 'catting along last dimension' sentence in the doc and change the behavior to pick the maximum last dimension over all input tensors.
Now empty tensors are allowed.
CUDA IPC only works with Python 3 using the "spawn" start method. You
can select the start method using the get_context method:
import torch.multiprocessing as mp
ctx = mp.get_context('spawn')
queue = ctx.Queue()
event = ctx.Event()
Uses the assignment syntax to get deterministic ordering of parameters.
The ordering of parameters using the constructor syntax is
non-deterministic because kwargs use dict() in Python 3.5 and earlier.
Without this, the cuda_events could continuously grow from calls to
cudaMemcpyAsync, but would never be processed if there were no new
pinned memory allocations.
For example:
t1 = cutorch.createCudaHostTensor(10)
t2 = torch.CudaTensor(10)
while true do t2:copyAsync(t1) end
Adds a caching allocator for CUDA pinned (page-locked) memory. This
avoid synchronization due to cudaFreeHost or cudaHostUnregister at the
expense of potentially higher host memory usage.
Correctness is preserved by recording CUDA events after each
cudaMemcpyAsync involving the pinned memory. The pinned memory
allocations are not reused until all events associated with it have
completed.
Exceptions are:
1) SparseLinear
requires additional parameters to be passed in (e.g. nbatches),
so it's not clear it's worth moving to C since it won't really simplify the binding
code logic.
2) BatchNormalization
requires "makeBatch", which isn't a trivial translation to C.
3) LookupTable
requires "view" in C, which is already a TODO
4) SpatialUpSamplingBilinear
requires "view" in C, which is already TODO
DataLoader now supports the constructor argument 'pin_memory'. When set
to true, tensors in the sample are copied to pinned memory. This happens
in a background thread when num_workers > 1.
Previously, cutorch would initialize every CUDA device and enable P2P
access between all pairs. This slows down start-up, especially with 8
devices. Now, THCudaInit does not initialize any devices and P2P access
is enabled lazily. Setting the random number generator seed also does
not initialize the device until random numbers are actually used.
Previously, cutorch would initialize every CUDA device and enable P2P
access between all pairs. This slows down start-up, especially with 8
devices. Now, THCudaInit does not initialize any devices and P2P access
is enabled lazily. Setting the random number generator seed also does
not initialize the device until random numbers are actually used.
Only references to their data and version counters are stored.
Also, it is now possible to have None arguments in save_for_backward
and return too many values from backward (as long as the excessive
results are None).
Without the PyObject_GC_UnTrack call, the tp_dealloc handler could get
called twice if a referred to object triggers a garbage collection from
its destructor.
See http://bugs.python.org/issue28737
This change removes HtoD copies inside baddbmm. These copies
introduce a syncing point which causes slow downs in a multi
gpu training.
Test plan: Run unittests for baddbmm.
On ARMv8, neon is inherit and instead listed as 'asimd' in /proc/cpuinfo
Replace assembly with C
Original authors:
- @dusty-nv
FindARM-patch.txt
CMakeLists-patch.txt
- @rtarquini
NEON.c
Differences from nn equivalent:
1) No changes to VolumetricConvolutionMM, which doesn't exist in cunn.
2) No changes to HardShrink, which doesn't exist in cunn.
3) LookupTable doesn't verify that all inputs are within range.
Math is done at accreal precision. At real precision,
forward pass fails, but backward passes. We do backward
pass at accreal precision for consistency.
Half types fail on backward, probably because we don't consistently
accumulate in accreal. This is difficult because gradInput is
accumulated directly (either with atomicAdd or not) rather than
in another variable.
This is the first instance of functions that take a lua number but
are not reals in C. So, instead of automatically converting lua
numbers in the half case, we parse the function definitions to
find the argument positions to convert.
Math is done at accreal precision (e.g. for half,
math is done at float precision). Originally code
called __expf, which doesn't have a double equivalent;
we call exp instead of converting down.
This maintains the existing logic of doing the math in
double precision and converting back to the intended
type (previously: just float). We do the same for
half here, although perhaps we should do the math
at float in that case.
There is some question about what to do with conversions;
Sigmoid did math in double before converting back to float;
we keep this intent, although there is some question on whether
this was intentional and for half -- should we just go up to
float or up to double?
Adds the ability to "genericize" cunn modules that can exist
simultaneously with non-generic modules (i.e. modules can
be genericized one at a time). Allowing both generic and
non-generic modules simultaneously requires some extra code
that can be removed once every module is genericized.
Also genericizes SoftPlus in this way.
The __getstate__ and __setstate__ functions are called from copy.copy as
well as pickling. The source code inspection currently slows down the
data parallel code because it makes a copy of the object every
iteration.
This adds three small pieces to help with sharing THCStorages across
processes:
1. THCIpcAllocator: a THCDeviceAllocator to close shared memory handles in the
child process.
2. THCCachingAllocator_getBaseAllocation which returns the pointer and
size of the underlying cudaMalloc allocation. This is necessary
because cudaIpcGetMemHandle requires 'base' pointers
3. Support for TH_STORAGE_VIEW in THCStorage_(free). This is useful in
child processes to represent THCCachingAllocator allocations split
from a larger cudaMalloc call.
See issue #20
The torch.Size class is a tuple subclass which distinguishes sizes from
other tuples so that torch.Tensor(size) is interpreted as size instead
of data.
Expose omp_set_num_threads and similar APIs through the TH lib. This
means a third-party libaries using TH don't need to be compiled with
OpenMP support just to control the number of TH OMP threads.
Use a single, global THCCachingAllocator instance.
Previously, each Lua thread had its own THCCachingAllocator instance.
However, threads can share storages, which means a segment could be
allocated from on THCCachingAllocator and freed on another, which
breaks.
Fixes#539
modules(): returns an iterator over all modules in the network
children(): returns an iterator over immediate children
Also fix __getitem__ in Sequential
THSetErrorHandler still modifies per-thread pointers, but
THSetDefaultErrorHandler allows to set a handler that's
used by all threads that haven't specified any function.
THSetErrorHandler still modifies per-thread pointers, but
THSetDefaultErrorHandler allows to set a handler that's
used by all threads that haven't specified any function.
Switching the device, setting the stream, and switching BLAS handles is
now thread-safe. Some other operations, like reserveStreams, are still
not thread-safe.
Prior to this change, there was a circular reference between Leaf and
Variable. This means that the objects (and referenced Tensors) are not
collected as soon as they go out of scope, which lead to higher memory
usage and out-of-memory errors.
Adds indexAdd via atomicAdd for unsigned char, char, short, long,
half, double. Integer types are templatized based on sizeof.
Floating point types are implemented via intrinsics.
Bug caused AllGathers and ReduceScatters of less than
8 bytes to fail in certain cases.
Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b
Reviewed-on: http://git-master/r/1011505
Reviewed-by: Przemek Tredak <ptredak@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
Also cleaned up makefile so that tests and lib are not built unnecessarily.
Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262
Reviewed-on: http://git-master/r/999627
Reviewed-by: Przemek Tredak <ptredak@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
The project is still under active development and is likely to drastically change in short periods of time.
We will be announcing API changes and important developments via a newsletter, github issues and post a link to the issues on slack.
Please remember that at this stage, this is an invite-only closed alpha, and please don't distribute code further.
This is done so that we can control development tightly and rapidly during the initial phases with feedback from you.
PyTorch is a Python package that provides two high-level features:
- Tensor computation (like NumPy) with strong GPU acceleration
- Deep neural networks built on a tape-based autograd system
You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed.
We are in an early-release beta. Expect some adventures and rough edges.
- [More about PyTorch](#more-about-pytorch)
- [Installation](#installation)
- [Binaries](#binaries)
- [From Source](#from-source)
- [Docker Image](#docker-image)
- [Getting Started](#getting-started)
- [Communication](#communication)
- [Releases and Contributing](#releases-and-contributing)
- [The Team](#the-team)
| System | 2.7 | 3.5 |
| --- | --- | --- |
| Linux CPU | [](https://travis-ci.org/pytorch/pytorch) | [](https://travis-ci.org/pytorch/pytorch) |
| Linux GPU | [](https://build.pytorch.org/job/pytorch-master-py2-linux) | [](https://build.pytorch.org/job/pytorch-master-py3-linux) |
| macOS CPU | [](https://build.pytorch.org/job/pytorch-master-py2-osx-cpu) | [](https://build.pytorch.org/job/pytorch-master-py3-osx-cpu) |
## More about PyTorch
At a granular level, PyTorch is a library that consists of the following components:
<table>
<tr>
<td><b> torch </b></td>
<td> a Tensor library like NumPy, with strong GPU support </td>
</tr>
<tr>
<td><b> torch.autograd </b></td>
<td> a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch </td>
</tr>
<tr>
<td><b> torch.nn </b></td>
<td> a neural networks library deeply integrated with autograd designed for maximum flexibility </td>
</tr>
<tr>
<td><b> torch.multiprocessing </b></td>
<td> Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training. </td>
</tr>
<tr>
<td><b> torch.utils </b></td>
<td> DataLoader, Trainer and other utility functions for convenience </td>
</tr>
<tr>
<td><b> torch.legacy(.nn/.optim) </b></td>
<td> legacy code that has been ported over from torch for backward compatibility reasons </td>
</tr>
</table>
Usually one uses PyTorch either as:
- a replacement for NumPy to use the power of GPUs.
- a deep learning research platform that provides maximum flexibility and speed
Elaborating further:
### A GPU-Ready Tensor Library
If you use NumPy, then you have used Tensors (a.k.a ndarray).
PyTorch is not a Python binding into a monolithic C++ framework.
It is built to be deeply integrated into Python.
You can use it naturally like you would use NumPy / SciPy / scikit-learn etc.
You can write your new neural network layers in Python itself, using your favorite libraries
and use packages such as Cython and Numba.
Our goal is to not reinvent the wheel where appropriate.
### Imperative Experiences
PyTorch is designed to be intuitive, linear in thought and easy to use.
When you execute a line of code, it gets executed. There isn't an asynchronous view of the world.
When you drop into a debugger, or receive error messages and stack traces, understanding them is straightforward.
The stack trace points to exactly where your code was defined.
We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines.
### Fast and Lean
PyTorch has minimal framework overhead. We integrate acceleration libraries
such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed.
At the core, its CPU and GPU Tensor and neural network backends
(TH, THC, THNN, THCUNN) are written as independent libraries with a C99 API.
They are mature and have been tested for years.
Hence, PyTorch is quite fast – whether you run small or large neural networks.
The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives.
We've written custom memory allocators for the GPU to make sure that
your deep learning models are maximally memory efficient.
This enables you to train bigger deep learning models than before.
### Extensions without Pain
Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straightforward
and with minimal abstractions.
You can write new neural network layers in Python using the torch API
[or your favorite NumPy-based libraries such as SciPy](http://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html).
If you want to write your layers in C/C++, we provide an extension API based on
[cffi](http://cffi.readthedocs.io/en/latest/) that is efficient and with minimal boilerplate.
There is no wrapper code that needs to be written. You can see [a tutorial here](http://pytorch.org/tutorials/advanced/c_extension.html) and [an example here](https://github.com/pytorch/extension-ffi).
## Installation
### Binaries
- Anaconda
Commands to install from binaries via Conda or pip wheels are on our website:
[http://pytorch.org](http://pytorch.org)
### From Source
If you are installing from source, we highly recommend installing an [Anaconda](https://www.continuum.io/downloads) environment.
You will get a high-quality BLAS library (MKL) and you get a controlled compiler version regardless of your Linux distro.
Once you have [Anaconda](https://www.continuum.io/downloads) installed, here are the instructions.
If you want to compile with CUDA support, install
- [NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads) 7.5 or above
- [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) v5.x or above
If you want to disable CUDA support, export environment variable `NO_CUDA=1`.
* Slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . If you need a slack invite, ping us at soumith@pytorch.org
* newsletter: no-noise, one-way email newsletter with important announcements about pytorch. You can sign-up here: http://eepurl.com/cbG0rv
## Timeline
## Releases and Contributing
We will run the alpha releases weekly for 6 weeks.
After that, we will reevaluate progress, and if we are ready, we will hit beta-0. If not, we will do another two weeks of alpha.
PyTorch has a 90 day release cycle (major releases).
It's current state is Beta, we expect no obvious bugs. Please let us know if you encounter a bug by [filing an issue](https://github.com/pytorch/pytorch/issues).
* ~~alpha-0: Working versions of torch, cutorch, nn, cunn, optim fully unit tested with seamless numpy conversions~~
* ~~alpha-1: Serialization to/from disk with sharing intact. initial release of the new neuralnets package based on a Chainer-like design~~
* ~~alpha-2: sharing tensors across processes for hogwild training or data-loading processes. a rewritten optim package for this new nn.~~
* ~~alpha-3: binary installs, contbuilds, etc.
* alpha-4: a ton of examples across vision, nlp, speech, RL -- this phase might make us rethink parts of the APIs, and hence want to do this in alpha than beta
* alpha-5: Putting a simple and efficient story around multi-machine training. Probably simplistic like torch-distlearn. Building the website, release scripts, more documentation, etc.
* alpha-6: [no plan yet]
We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.
The beta phases will be leaning more towards working with all of you, convering your use-cases, active development on non-core aspects.
If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.
## pytorch vs torch: important changes
**For the next release cycle, these are the 3 big features we are planning to add:**
We've decided that it's time to rewrite/update parts of the old torch API, even if it means losing some of backward compatibility (we can hack up a model converter that converts correctly).
This section lists the biggest changes, and suggests how to shift from torch to pytorch.
1. [Distributed PyTorch](https://github.com/pytorch/pytorch/issues/241) (a draft implementation is present in this [branch](https://github.com/apaszke/pytorch-dist) )
2. Backward of Backward - Backpropagating through the optimization process itself. Some past and recent papers such as
[Double Backprop](http://yann.lecun.com/exdb/publis/pdf/drucker-lecun-91.pdf) and [Unrolled GANs](https://arxiv.org/abs/1611.02163) need this.
3. Lazy Execution Engine for autograd - This will enable us to optionally introduce caching and JIT compilers to optimize autograd code.
For now there's no pytorch documentation.
Since all currently implemented modules are very similar to the old ones, it's best to use torch7 docs for now (having in mind several differences described below).
### Library structure
## The Team
All core modules are merged into a single repository.
Most of them will be rewritten and will be completely new (more on this below), but we're providing a Python version of old packages under torch.legacy namespace.
* torch (torch)
* cutorch (torch.cuda)
* nn (torch.legacy.nn)
* cunn (torch.legacy.cunn)
* optim (torch.legacy.optim)
* nngraph (torch.legacy.nngraph - not implemented yet)
PyTorch is a community driven project with several skillful engineers and researchers contributing to it.
### 0-based indexing
pytorch uses 0-based indexing everywhere.
This includes arguments to `index*` functions and nn criterion weights.
Under the hood, on the C side, we've changed logic on TH / THC / THNN / THCUNN to introduce a TH_INDEX_BASE compile-time definition to switch between 0 and 1 indexing logic.
### New Tensor API
**All methods operating on tensors are now out-of-place by default.**
This means that although `a.add(b)` used to have a side-effect of mutating the elements in a, it will now return a new Tensor, holding the result.
All methods that mutate the Tensor/Storage are now marked with a trailing underscore (including `copy` -> `copy_`, `fill` -> `fill_`, `set` -> `set_`, etc.).
Most of math methods have their in-place counterparts, so an equivalent to `a.add(b)` in Lua is now `a.add_(b)` (or `torch.add(a, a, b)`, which is not recommended in this case)
### CUDA module
All tensors have their CUDA counterparts in torch.cuda module.
There is no `torch.cuda.setDevice` anymore. By default always the 0th device is selected, but code can be placed in a `with` statement to change it:
```python
withtorch.cuda.device(1):
a=torch.cuda.FloatTensor(10)# a is allocated on GPU1
```
Calling `.cuda()` on tensors no longer converts it to a GPU float tensor, but to a CUDA tensor of the same type located on a currently selected device.
So, for example: `a = torch.LongTensor(10).cuda() # a is a CudaLongTensor`
Calling `.cuda(3)` will send it to the third device.
`.cuda()` can be also used to transfer CUDA tensors between devices (calling it on a GPU tensor, with a different device selected will copy it into the current device).
```python
a=torch.LongTensor(10)
b=a.cuda()# b is a torch.cuda.LongTensor placed on GPU0
c=a.cuda(2)# c is a torch.cuda.LongTensor placed on GPU2
withtorch.cuda.device(1):
d=b.cuda()# d is a copy of b, but on GPU1
e=d.cuda()# a no-op, d is already on current GPU, e is d == True
```
Also, setting device is now only important to specify where to allocate new Tensors. You can perform operations on CUDA Tensors irrespective of currently selected device (but all arguments have to be on the same device) - result will be also allocated there. See below for an example:
```python
a=torch.randn(2,2).cuda()
b=torch.randn(2,2).cuda()
withtorch.cuda.device(1):
c=a+b# c is on GPU0
d=torch.randn(2,2).cuda()# d is on GPU1
```
In the near future, we also plan to use a CUDA allocator, which allows to alleviate problems with cudaMalloc/cudaFree being a sync point.
This will help us to not worry about using buffers for every intermediate computation in a module if one wants to do multi-GPU training, for example.
See: https://github.com/torch/cutorch/pull/443
### Numpy integration
Because numpy is a core numerical package in Python, and is used by many other libraries like matplotlib, we've implemented a two-way bridge between pytorch and numpy.
```python
a=torch.randn(2,2)
b=a.numpy()# b is a numpy array of type corresponding to a
# no memory copy is performed, they share the same storage
c=numpy.zeros(5,5)
d=torch.DoubleTensor(c)# it's possible to construct Tensors from numpy arrays
# d shares memory with b - there's no copy
```
### New neural network module
After looking at several framework designs, looking at the current design of `nn` and thinking through a few original design ideas, this is what we've converged to:
* Adopt a Chainer-like design
* Makes it extremely natural to express Recurrent Nets and weight sharing
* Each module can operate in-place, but marks used variables as dirty - errors will be raised if they're used again
* RNN example:
```python
classNetwork(nn.Container):
def__init__(self):
super(Network,self).__init__(
conv1=nn.SpatialConvolution(3,16,3,3,1,1),
relu1=nn.ReLU(True),
lstm=nn.LSTM(),
)
def__call__(self,input):
y=self.conv(input)
y=self.relu1(y)
y=self.lstm(y)
returny
model=Network()
input=nn.Variable(torch.zeros(256,3,224,224))
output=model(input)
loss=0
foriinrange(ITERS):
input,target=...
# That's all you need for an RNN
fortinrange(TIMESTEPS):
loss+=loss_fn(model(input),target)
loss.backward()
```
* Here, nn.Variable will have a complete tape-based automatic differentiation implemented
* To access states, have hooks for forward / backward (this also makes multi-GPU easier to implement)
* This has the advantage of not having to worry about in-place / out-of-place operators for accessing .output or .gradInput
* When writing the module, make sure debuggability is straight forward. Dropping into pdb and inspecting things should be natural, especially when going over the backward graph.
* Pulling handles to a module after constructing a chain should be very natural (apart from having a handle at construction)
* It's easy, since modules are assigned as Container properties
* Drop overly verbose names. Example:
* SpatialConvolution → conv2d
* VolumetricConvolution → conv3d
#### Some notes on new nn implementation
As shown above, structure of the networks is fully defined by control-flow embedded in the code. There are no rigid containers known from Lua. You can put an `if` in the middle of your model and freely branch depending on any condition you can come up with. All operations are registered in the computational graph history.
There are two main objects that make this possible - variables and functions. They will be denoted as squares and circles respectively.

Variables are the objects that hold a reference to a tensor (and optionally to gradient w.r.t. that tensor), and to the function in the computational graph that created it. Variables created explicitly by the user (`Variable(tensor)`) have a Leaf function node associated with them.

Functions are simple classes that define a function from a tuple of inputs to a tuple of outputs, and a formula for computing gradient w.r.t. it's inputs. Function objects are instantiated to hold references to other functions, and these references allow to reconstruct the history of a computation. An example graph for a linear layer (`Wx + b`) is shown below.
Please note that function objects never hold references to Variable objects, except for when they're necessary in the backward pass. This allows to free all the unnecessary intermediate values. A good example for this is addition when computing e.g. (`y = Wx + My`):
Matrix multiplication operation keeps references to it's inputs because it will need them, but addition doesn't need `Wx` and `My` after it computes the result, so as soon as they go out of scope they are freed. To access intermediate values in the forward pass you can either copy them when you still have a reference, or you can use a system of hooks that can be attached to any function. Hooks also allow to access and inspect gradients inside the graph.
Another nice thing about this is that a single layer doesn't hold any state other than it's parameters (all intermediate values are alive as long as the graph references them), so it can be used multiple times before calling backward. This is especially convenient when training RNNs. You can use the same network for all timesteps and the gradients will sum up automatically.
To compute backward pass you can call `.backward()` on a variable if it's a scalar (a 1-element Variable), or you can provide a gradient tensor of matching shape if it's not. This creates an execution engine object that manages the whole backward pass. It's been introduced, so that the code for analyzing the graph and scheduling node processing order is decoupled from other parts, and can be easily replaced. Right now it's simply processing the nodes in topological order, without any prioritization, but in the future we can implement algorithms and heuristics for scheduling independent nodes on different GPU streams, deciding which branches to compute first, etc.
### Serialization
Pickling tensors is supported, but requires making a temporary copy of all data and breaks sharing.
For this reason we're providing `torch.load` and `torch.save`, that are free of these problems.
They have the same interfaces as `pickle.load` (file object) and `pickle.dump` (serialized object, file object) respectively.
For now the only requirement is that the file should have a `fileno` method, which returns a file descriptor number (this is already implemented by objects returned by `open`).
Objects are serialized in a tar archive consisting of four files:
`sys_info` - protocol version, byte order, long size, etc.
`pickle` - pickled object
`tensors` - tensor metadata
`storages` - serialized data
### Multi-GPU
Proposed solutions need to address:
* Kernel launch latency
* without affecting the user's code
* Implementation should be as transparent as possible
* Should we expose DPT as:
* Split
* ParallelApply (scheduling kernels in breadth first order, to address launch latency)
* Join
* In backward phase, send parameters as soon as the module finishes computation
**Rough solution:**
```python
# This is an example of a network that has a data parallel part inside
#
# B is data parallel
# +->A+-->B+-+
# +--+ +->D
# +->C+------+
classNetwork(nn.Container):
__init__(self):
super(Network,self).__init__(
A=...,
B=GPUReplicate(B,[0,1,2,3]),# Copies the module onto a list of GPUs
C=...,
D=...
)
__call__(self,x):
a=self.A(x)
c=self.C(x)
a_split=Split(a)# a_split is a list of Tensors placed on different devices
b=ParallelApply(self.B,a_split)# self.B is a list-like object containing copies of B
d_input=Join(b+[c])# gathers Tensors on a single GPU
returnself.D(d_input)
```
Each module is assigned to a single GPU.
For Kernel Launch Latency:
* Python threading
* Generators
For parameter reductions ASAP:
* In the forward pass, register a hooks on a every parameter which are evaluated as soon as the last backward is executed for that parameter. The hook will then “all-reduce” those parameters across GPUs
* Problem with multiple forward calls - how do you know that the parameters won't be used anymore?
* Well, last usage in backward graph = first usage in forward graph, so this should be straightforward
### Multiprocessing with Tensor sharing
In Torch, or in general, one uses "threads" to build parallel data loaders, as well as to do Hogwild training.
Threads are powerful, as one can share Tensors between threads.
This allows you to:
* transfer data between threads with efficiently with zero memory copy and serialization overhead.
* share tensors among threads for parameter sharing models
Sharing Tensors among threads is very useful when you do Hogwild training, i.e. if you want to train several models in parallel, but want to share their underlying parameters.
This is often used in non ConvNets, like training word embeddings, RL-for-games, etc.
With Python, one cannot use threads because of a few technical issues.
Python has what is called [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock), which does not allow threads to concurrently execute python code.
Hence, the most pythonic way to use multiple CPU cores is [multiprocessing](http://docs.python.org/2/library/multiprocessing.html)
We made PyTorch to seamlessly integrate with python multiprocessing.
This involved solving some complex technical problems to make this an air-tight solution, and more can be read [in this in-depth technical discussion](http://github.com/pytorch/pytorch/wiki/Multiprocessing-Technical-Notes).
What this means for you as the end-user is that you can simply use multiprocessing in this way:
```python
# loaders.py
# Functions from this file run in the workers
deffill(queue):
whileTrue:
tensor=queue.get()
tensor.fill_(10)
queue.put(tensor)
deffill_pool(tensor):
tensor.fill_(10)
```
```python
# Example 1: Using multiple persistent processes and a Queue
# process.py
importtorch
importtorch.multiprocessingasmultiprocessing
fromloadersimportfill
# torch.multiprocessing.Queue automatically moves Tensor data to shared memory
PyTorch is currently maintained by [Adam Paszke](https://apaszke.github.io/), [Sam Gross](https://github.com/colesbury) and [Soumith Chintala](http://soumith.ch) with major contributions coming from 10s of talented individuals in various forms and means. A non-exhaustive but growing list needs to mention: Sergey Zagoruyko, Adam Lerer, Francisco Massa, Andreas Kopf, James Bradbury, Zeming Lin, Yuandong Tian, Guillaume Lample, Marat Dukhan, Natalia Gimelshein.
Note: this project is unrelated to [hughperkins/pytorch](https://github.com/hughperkins/pytorch) with the same name. Hugh is a valuable contributor in the Torch community and has helped with many things Torch and PyTorch.
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.