Summary:
"then the output would also has k tensors" -> "then the output would also have k tensors"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20425
Differential Revision: D15320152
Pulled By: zou3519
fbshipit-source-id: b04e2ccd29c6a3e33ad1040d0ea975a01a7bd9b5
Summary:
As a first step for this plan: https://github.com/pytorch/pytorch/issues/19508#issuecomment-485178192, this PR moves `THCTensor_(uniform)` to ATen. Major changes are:
- `uniform_` cuda kernel now utilizes a philox generator.
- the kernel also utilizes TensorIterator
- the kernel uses a grid-stride loop to achieve peak effective bandwidth
- Since the engine has changed from `curandStateMTGP32` to `curandStatePhilox4_32_10`, the randoms generated now will be different.
- Here is the diff showing codegen changes: https://gist.github.com/syed-ahmed/4af9ae0d42b6c7dbaa13b9dd0d1dd1e8 (BC breaking change if any)
- Philox4_32_10 is known to pass the standard TestU01 Big Crush test (https://www.thesalmons.org/john/random123/papers/random123sc11.pdf) and hence the quality of random numbers generated isn't an issue when compared to the previously used `curandStateMTGP32`.
- I have added a test case in `aten/src/ATen/test/cuda_distributions_test.cu` which verifies that philox offset is incremented properly
The benchmark was done on a DGX station with 4 V100s.
I modified the script from jcjohnson 's [multinomial benchmark](https://github.com/jcjohnson/pytorch-multinomial-benchmark) to produce this notebook which shows that there is a general speedup with this PR and a regression hasn't been introduced: https://gist.github.com/syed-ahmed/9d26d4e96308aed274d0f2c7be5218ef
To reproduce the notebook:
- Run https://gist.github.com/syed-ahmed/4208c22c541f1d30ad6a9b1efc1d728f in a container with the current pytorch top of tree with the command: `python uniform_benchmark.py --stats_json before.json`
- Apply this diff to the current pytorch top of tree and run the same script in a container with the command: `python uniform_benchmark.py --stats_json after.json`
- Run the notebook attached above with the `after.json` and `before.json` in the same directory
The effected bandwidth was calculated using the script (thanks to ngimel ): https://gist.github.com/syed-ahmed/f8b7384d642f4bce484228b508b4bc68
Following are the numbers before and after.
```
uniform, size, elements 65536 forward 5.168914794921875e-06 bandwidth (GB/s) 50.71548098597786
uniform, size, elements 131072 forward 5.056858062744141e-06 bandwidth (GB/s) 103.67860705101367
uniform, size, elements 262144 forward 7.164478302001953e-06 bandwidth (GB/s) 146.357621001797
uniform, size, elements 524288 forward 1.1217594146728515e-05 bandwidth (GB/s) 186.9520302275877
uniform, size, elements 1048576 forward 1.923084259033203e-05 bandwidth (GB/s) 218.10297600317384
uniform, size, elements 2097152 forward 3.640890121459961e-05 bandwidth (GB/s) 230.39992200138826
uniform, size, elements 4194304 forward 6.778717041015625e-05 bandwidth (GB/s) 247.49839679819922
uniform, size, elements 8388608 forward 0.00012810707092285157 bandwidth (GB/s) 261.92490202361347
uniform, size, elements 16777216 forward 0.00025241613388061524 bandwidth (GB/s) 265.86598474620627
uniform, size, elements 33554432 forward 0.000497891902923584 bandwidth (GB/s) 269.5720239913193
```
```
uniform, size, elements 65536 forward 5.550384521484375e-06 bandwidth (GB/s) 47.22988091821306
uniform, size, elements 131072 forward 5.581378936767578e-06 bandwidth (GB/s) 93.93520954942333
uniform, size, elements 262144 forward 6.165504455566406e-06 bandwidth (GB/s) 170.071404141686
uniform, size, elements 524288 forward 6.3276290893554685e-06 bandwidth (GB/s) 331.4277702414469
uniform, size, elements 1048576 forward 8.509159088134765e-06 bandwidth (GB/s) 492.91639239047356
uniform, size, elements 2097152 forward 1.2989044189453124e-05 bandwidth (GB/s) 645.8218077979443
uniform, size, elements 4194304 forward 2.347707748413086e-05 bandwidth (GB/s) 714.6211452997259
uniform, size, elements 8388608 forward 4.4286251068115234e-05 bandwidth (GB/s) 757.6715389250498
uniform, size, elements 16777216 forward 8.672237396240235e-05 bandwidth (GB/s) 773.8356427961071
uniform, size, elements 33554432 forward 0.00016920566558837892 bandwidth (GB/s) 793.2224227438523
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20292
Differential Revision: D15277761
Pulled By: ezyang
fbshipit-source-id: 8bfe31a01eeed77f0ed6e7ec4d2dda4c6472ecaa
Summary:
To fully support incremental_state function, it requires several additional utils available in fairseq. However, we lack a problem for the unit test. Therefore, the incremental_state function will be disable for now. If it is needed in the future, a feature request could be created. Fixed#20132
Add some unit tests to cover the arguments of MultiheadAttention module, including bias, add_bias_kv, add_zero_attn, key_padding_mask, need_weights, attn_mask.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20177
Differential Revision: D15304575
Pulled By: cpuhrsch
fbshipit-source-id: ebd8cc0f11a4da0c0998bf0c7e4e341585e5685a
Summary:
We don't need to overlay vc env when not using ninja. CMake will deal with it automatically. Overlaying is a no-op when the env is the same with the generator specified but will generate the error "Cannot find CMAKE_CXX_COMPILER" when they are different.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20417
Differential Revision: D15317081
Pulled By: ezyang
fbshipit-source-id: 5d9100321ecd593e810c31158f22c67d3e34973b
Summary:
This is an attempt to isolate unrelated changes from #19228 for easier review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20150
Differential Revision: D15314891
Pulled By: ezyang
fbshipit-source-id: 8c429747ba83ad5aca4cdd8f8086bcf65a326921
Summary:
* Constructs a new type at runtime so that `isinstance` checks work for
weak modules assigned to `ScriptModule`s
* Fix some extraneous names in `__constants__`
* Add `in_features` and `out_features` to `nn.Linear` `__constants__`
Fixes#19363
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20190
Pulled By: driazati
Differential Revision: D15302350
fbshipit-source-id: 1d4d21ed44ab9578a4bc2a72396a82e9bbcd387c
Summary:
TensorList, DoubleList, and BoolList were missing from the pickler, so
this adds them.
As a follow up a lot of the code for these could be templated and cut
down
](https://our.intern.facebook.com/intern/diff/15299106/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20191
Pulled By: driazati
Differential Revision: D15299106
fbshipit-source-id: f10c0c9af9d60a6b7fb8d93cea9f550b1a7e2415
Summary:
Given that tensorboardX and our PyTorch 1.1 release had `log_dir` as the argument for SummaryWriter initialization and member variable (which some users access), we need to preserve this name. However, we might deprecate this in the future and I've added a `get_logdir` method that can be used in the future.
cc natalialunova, lanpa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20382
Reviewed By: NarineK
Differential Revision: D15300941
Pulled By: orionr
fbshipit-source-id: a29a70fcbc614a32ebfa6c655962fdff081af1af
Summary:
This code is unused and has been superseded by TensorIterators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20207
Differential Revision: D15240832
Pulled By: cpuhrsch
fbshipit-source-id: 4f600bb8645f9b28a137e2cefb099978f5152d05
Summary:
This PR add Poisson NLL loss to aten and substitute the python implementation with a call to the c++.
Fixes#19186.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19316
Differential Revision: D15012957
Pulled By: ezyang
fbshipit-source-id: 0a3f56e8307969c2f9cc321b5357a496c3d1784e
Summary:
This PR is an intermediate step toward the ultimate goal of eliminating "caffe2" in favor of "torch". This PR moves all of the files that had constituted "libtorch.so" into the "libcaffe2.so" library, and wraps "libcaffe2.so" with a shell library named "libtorch.so". This means that, for now, `caffe2/CMakeLists.txt` becomes a lot bigger, and `torch/CMakeLists.txt` becomes smaller.
The torch Python bindings (`torch_python.so`) still remain in `torch/CMakeLists.txt`.
The follow-up to this PR will rename references to `caffe2` to `torch`, and flatten the shell into one library.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17783
Differential Revision: D15284178
Pulled By: kostmo
fbshipit-source-id: a08387d735ae20652527ced4e69fd75b8ff88b05
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19803
There is no reason to set a specific logging level for this module. Removing it to just use the default logging level.
Differential Revision: D15098834
fbshipit-source-id: 1654c04500c19690ddde03343f2e84b04bb0f1ef
Summary:
Fixed#20250
Not sure if there's any specific design reason to `add_dependecy()` and manually add a few include dir, instead of linking the target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20319
Differential Revision: D15294584
Pulled By: ezyang
fbshipit-source-id: 97f813a6b1829dad49958e0f880b33eb95747607
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20351
This was broken because of a merge race between #20282 and the stack in #20236.
Cleaned up the test and comments a bit as well.
Differential Revision: D15292786
fbshipit-source-id: a4379ea700cad959d3a6921fc5ddf9384fb8f228
Summary:
The trick here is that creating a mapping from const values to
const values means that downstream clients that want to mutate
the output of the mapping are stuck. However, a mapping from
const values to non-const values is just fine and doesn't put
constraints on downstream clients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20303
Differential Revision: D15284076
fbshipit-source-id: 16206fd910dd5f83218525ca301b1889df0586cb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20236
Use the new version of broadcast_coalesced that deals with both CPU
and CUDA models. Add tests that evaluate correctness of
DistributedDataParallel for CPU models.
Closes#17757.
Reviewed By: mrshenli
Differential Revision: D15245428
fbshipit-source-id: d2fa09f68593b3cd1b72efeb13f5af23ebd5c80a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20235
The tests expected to only run for CUDA models. In a future commit we
need to update this to work for CPU models as well. Therefore, we can
no longer rely on only integers being passed for device identifiers.
With this change we pass both the materialized list of devices to use
(as `torch.Device` objects), as well as an optional list of integers.
The latter is specified to exercise the code in the
DistributedDataParallel constructor that turns a list of integers into
CUDA devices, IFF it is used to wrap a single-device CUDA module.
This commit also groups together the 'str' and non-'str' tests. These
used to test passing the list of devices as integers or as
`torch.Device` instances. These are now executed from the same test.
Reviewed By: mrshenli
Differential Revision: D15245429
fbshipit-source-id: 5797ba9db33d2c26db8e7493c91bb52f694285ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20234
The differences with the existing function _dist_broadcast_coalesced
is that this one works for both CPU and CUDA tensors and that it has a
maximum number of in flight operations.
This should be the final change needed to have only a single version
of DistributedDataParallel that both supports CPU and CUDA models, or
even a mix of both.
See #17757 for more information.
Reviewed By: mrshenli
Differential Revision: D15228099
fbshipit-source-id: a2113ba6b09b68cb5328f49f4c1960031eb43c93
Summary:
The isConvFusion(...) is only for Conv op.
If non-Conv op, the crash takes place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20139
Differential Revision: D15280604
Pulled By: yinghai
fbshipit-source-id: eb45be11990b3bf7c5b45f02ebb6018444ab5357