Summary:
Recent versions of GCC split unaligned load and store intrinsics into
two 128-bit instructions. On old processors (Sandy Bridge) this was a
bit faster for unaligned data, but bit slower for aligned data. On new
processors (Intel Haswell+, recent AMD) splitting loads is slower on
both aligned and unaligned data.
Clang, MSVC, and ICC do not split unaligned load and store intrinsics.
There's a good explanation here:
https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd#tab-top
Splitting load and store intrinsics makes no sense in our AVX2
configuration because the CPUs that support AVX2 instructions are the
same CPUs where splitting is disadvantageous on all data alignemnt.
Note that this doesn't change the AVX configuration (used by CPUs that
support AVX but not AVX2). It's possible this would be benficial for
that configuration too (our data is usually 32-byte aligned), but I'd
prefer the conservative change for now.
torch.add generated assembly (hot loop) (GCC 7.3.0)
before:
https://gist.github.com/colesbury/066376537bccd514daf8fe4ab54d8295
after:
https://gist.github.com/colesbury/8b4b948145001d44b225c51d2428bb91
Timing of `torch.add(x, y, out=z)` for size 10240 (1 thread, Broadwell,
no turbo):
before: 7.35 us after: 6.39 us
(Take the torch.add timings with a grain of salt. The difference in timings
is much larger than I would expect.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20609
Differential Revision: D15385800
Pulled By: colesbury
fbshipit-source-id: 66415b148a3b19360b9de9881af594ab46547b6f
Summary:
Change `Inputs` to `Shape` to unify the format of CTCLoss `class`, and add the type of `Output` in `Shape`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20422
Differential Revision: D15393484
Pulled By: ezyang
fbshipit-source-id: 5b49647f9740de77db49a566fa2de74fcecd9110
Summary:
CUDA 8 is no longer supported and removed from CI, so these checks are irrelevant
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20482
Differential Revision: D15393438
Pulled By: ezyang
fbshipit-source-id: ac0979bf660b3314eec502c745e34ce4940bda0e
Summary:
Fixes#20568.
Looks like CMake is passing `/MD` when we call `add_library`. We need to fix these with C source files too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20574
Differential Revision: D15392682
Pulled By: ezyang
fbshipit-source-id: c92034d8725fcec48fd7db6cf5322868e956dc6b
Summary:
Fixes#20523 .
nn.Upsample was unable to accept tuple inputs for the scale_factor argument due to direct casting to float, which was done in #17732.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20581
Differential Revision: D15392622
Pulled By: ezyang
fbshipit-source-id: b56ba8197a5bbf8891bc7e1bebf5cad63dcab04d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20441
This op is fairly complex and the fact that it isn't formatted
correctly makes things that much harder to reason about. Clean it up.
Reviewed By: dreiss
Differential Revision: D15220006
fbshipit-source-id: 30632d8bdbf15f96e73d8b6c96c5f29c052e6e7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20502
Following D15307410 removing more floating point exceptions in unit tests
Reviewed By: hx89
Differential Revision: D15340930
fbshipit-source-id: 269fc75e0800bc9d39126767a0f3ca15cd8b0cad
Summary:
(Reopens https://github.com/pytorch/pytorch/pull/20330 and fixes test error.)
After the Variable/Tensor merge, there is no guarantee that `indices` and `values` passed into the sparse tensor constructor don't contain AutogradMeta. However, we want to maintain the existing invariant that `indices_` and `values_` of a sparse tensor don't contain AutogradMeta, and to achieve this we need do shallow-copy in the sparse tensor constructor.
Note that this is BC-breaking for code that changes the sizes / strides of the indices or values tensor after it's used to create a sparse tensor. In current master, such changes will be reflected in the sparse tensor and break sparse tensor invariants. After this PR, those changes will not be reflected in the sparse tensor, and thus the sparse tensor invariants are always preserved. Specifically, running in-place size/stride-changing ops such as `resize_` / `resize_as_` / `as_strided_` / `set_` / `transpose_` on the original values tensor will not update the sparse tensor's `values_`. For example:
```python
# Calling resize_ on non-requires-grad value tensor
i2 = torch.zeros([1, 1])
v2 = torch.ones([1, 2, 3])
t2 = torch.sparse_coo_tensor(i2, v2, torch.Size([2, 2, 3]))
v2.resize_(4, 5)
t2.coalesce().values().size()
# On current master, this throws "indices and values must have same nnz, but got nnz from indices: 1, nnz from values: 4", because resizing the original value tensor affects `values_` of the sparse tensor.
# After this PR, this prints "torch.Size([1, 2, 3])", which means resizing the original value tensor doesn't affect `values_` of the sparse tensor.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20614
Differential Revision: D15385811
Pulled By: yf225
fbshipit-source-id: e963fcf5e4097f8c881b56145f408565d97cf5c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19932
In preparation to add int8_t data type for QTensor
Reviewed By: zafartahirov
Differential Revision: D15137838
fbshipit-source-id: 59462c36d6fc5982986d4196bf3f32f49bb294d7
Summary:
After the Variable/Tensor merge, there is no guarantee that `indices` and `values` passed into the sparse tensor constructor don't contain AutogradMeta. However, we want to maintain the existing invariant that `indices_` and `values_` of a sparse tensor don't contain AutogradMeta, and to achieve this we need do shallow-copy in the sparse tensor constructor.
Note that this is BC-breaking for code that changes the sizes / strides of the indices or values tensor after it's used to create a sparse tensor. In current master, such changes will be reflected in the sparse tensor and break sparse tensor invariants. After this PR, those changes will not be reflected in the sparse tensor, and thus the sparse tensor invariants are always preserved. Specifically, running in-place size/stride-changing ops such as `resize_` / `resize_as_` / `as_strided_` / `set_` / `transpose_` on the original values tensor will not update the sparse tensor's `values_`. For example:
```python
# Calling resize_ on non-requires-grad value tensor
i2 = torch.zeros([1, 1])
v2 = torch.ones([1, 2, 3])
t2 = torch.sparse_coo_tensor(i2, v2, torch.Size([2, 2, 3]))
v2.resize_(4, 5)
t2.coalesce().values().size()
# On current master, this throws "indices and values must have same nnz, but got nnz from indices: 1, nnz from values: 4", because resizing the original value tensor affects `values_` of the sparse tensor.
# After this PR, this prints "torch.Size([1, 2, 3])", which means resizing the original value tensor doesn't affect `values_` of the sparse tensor.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20330
Differential Revision: D15373683
Pulled By: yf225
fbshipit-source-id: 32e7275d7121e17937c7cc258e8a60bb0848ff25
Summary:
Currently `bmm()` has very heavy performance overhead on CPU due to construction/deconstruction of `TensorImpl`. Applying `TensorAccessor` when indexing tensor data can greatly improve the performance.
I tested this on `fairseq` Transformer model. Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by approximately **10%** (seconds per iteration reduced from **3.60** to **3.21**). Considering the fact that `bmm()` takes only **14%** of the total time, 10% overall improvement indicates `bmm()` itself improves by roughly **3x**.
Before:
```
| epoch 001: 0%| | 43/25337 [02:34<25:17:11, 3.60s/it, loss=16.179, nll_loss=16.137, ppl=72045.59, wps=1320, ups=0, wpb=4758.767, bsz=136.558, num_updates=43, lr=6.45e-06, gnorm=6.88
```
After:
```
| epoch 001: 0%| | 23/25337 [01:13<22:32:48, 3.21s/it, loss=17.072, nll_loss=17.068, ppl=137419.42, wps=1478, ups=0, wpb=4746.870, bsz=128.348, num_updates=23, lr=3.45e-06, gnorm=10.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20266
Differential Revision: D15262201
Pulled By: cpuhrsch
fbshipit-source-id: c2e4e406c06714b04cc7534f3da71e986eddca35
Summary:
In onnx spec, the supported input/output type for `And` and `Or` is `Bool` only.
Thus in exporting, cast to/from `Bool` is inserted for input/output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17894
Reviewed By: zrphercule
Differential Revision: D15103148
Pulled By: houseroad
fbshipit-source-id: 3e1068ea236c743260d42882fb11f0e3a21707e6
Summary:
First time this was merged it broke master and was reverted. This time I do not add ```set -u``` to the .circleci/scripts/setup* scripts. There's still a chance that ```set -u``` breaks the binary builds on master, but at least those can be fixed in parallel and don't completely eliminate signal from all merges.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20540
Differential Revision: D15373444
Pulled By: pjh5
fbshipit-source-id: 0203c20865827366ecd8fa07b2db74d255549ed1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20501
Fixing unit tests related to optimizer related operators and tests
Reviewed By: hx89
Differential Revision: D15307410
fbshipit-source-id: e5400c26e08f26191ee542fe6b02e0a69bc4e1ae
Summary:
#19975 was separated by 2 PRs.
This one:
Introduce MemoryFormat argument to the `x.is_contiguous(memory_format=torch.channels_last)` and to the `y = x.contiguous(memory_format=torch.channels_last)` functions.
At this moment both functions just operate with strides and doesn't store any tensor state.
(Original RFC #19092)
-----
Expands functionality of two tensor functions `.is_contiguous` and `.contiguous` (both python and c++ api).
Note: We had several complaints about `.to(memory_format)` function, and decided not to support it.
1. `.contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- Using `torch.contiguous_format` will preserve existing `.contiguous()` behavior.
- Calling `x.contiguous(memory_format=torch.channels_last)` returns new tensor which maintain same semantical layout (NCHW), but have different memory allocation pattern.
`x.contiguous(memory_format=torch.channels_last)` expects input tensor to be 3d, 4d or 5d; and fails otherwise.
2. `.is_contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- `x.is_contiguous(memory_format=torch.contiguous_format)` preserves same functionality as `x.is_contiguous()` and remains unchanged.
- `x.is_contiguous(memory_format=torch.channels_last)` returns true if A) input tensor is contiguous in memory AND B) allocated in the memory in NWHC (or similar for 3d,5d) format.
Note: By the end of the phase one `x.is_contiguous(memory_format=torch.channels_last)` will calculate state of the Tensor on every call. This functionality going to be updated later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20455
Differential Revision: D15341577
Pulled By: VitalyFedyunin
fbshipit-source-id: bbb6b4159a8a49149110ad321109a3742383185d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20513
They've been using an old API, switch them to the new one instead.
Reviewed By: li-roy
Differential Revision: D15346349
fbshipit-source-id: 538eb460897ec6addebeebf88b316eb0d6b1dd6f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20379
The legacy custom op API allowed nesting of std::unordered_map and std::vector. While we haven't figured out yet how to do that with the new API,
we at least have to keep backwards compatibility. This diff adds the feature so we can switch to the new API without breaking third party code.
Reviewed By: li-roy
Differential Revision: D15287693
fbshipit-source-id: bb5b8429fddf6298719cbf567b584ed371f8fc81
Summary:
Previously, the caller of `shallow_copy_and_detach()` is responsible for deciding whether the shallow-copy should share the source TensorImpl's version counter, or have its own new version counter. However, since this decision is crucial for ensuring the correctness of the shallow-copy's version counter, we want to enforce users of `shallow_copy_and_detach()` to pass a version counter to the function call, so that they are required to make the decision at the time of API usage, not as an afterthought.
For similar reasons, we want to enforce users of `shallow_copy_and_detach()` to pass `allow_tensor_metadata_change` to the function call, so that they are required to decide "whether the TensorImpl shallow-copy should allow tensor metadata change" at the time of API usage, not as an afterthought.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20496
Differential Revision: D15363620
Pulled By: yf225
fbshipit-source-id: a65e74738b10452668d6dc644b43aad5b3d8c9e6
Summary:
Remove weak_script. After recently splitting the forward() function in MultiheadAttention module, we notice a memory leak on GPU. Fix the problem by removing those "weak_script" decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20563
Differential Revision: D15368262
Pulled By: zhangguanheng66
fbshipit-source-id: 475db93c9ee0dbaea8fb914c004e7d1e0d419bc2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20543
All of that code for concatenating strings together adds up. Just discard it all for mobile builds.
Reviewed By: ljk53
Differential Revision: D15353447
fbshipit-source-id: a82dd0b884335d662605aabf7dd3d09dfcc1478b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20439
This is the QTensorProto workflow for multi group quantization in C2 side.
No DNNLOWP Tensor related thing is included in this pr, so once we finished glow side, we should be able to test this pr using resnet50.
Reviewed By: yinghai
Differential Revision: D15096919
fbshipit-source-id: 741eecd59eb79d24d9fe2b035f6246d42422d25c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19816
We need this for quantization for bias
add third argument of ScalarType to `quantize_linear`
Differential Revision: D15094174
fbshipit-source-id: f19ec8f4716cf5fe0aa21b38d45af6d27c9ab377
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20512
Fixing typos in the description of schema for one of the inputs for BatchMatMul operator.
Reviewed By: jianyuh, BIT-silence
Differential Revision: D15343879
fbshipit-source-id: 06354e8e6b0d79fea937ed2703bb457b2d04f859
Summary:
Fix a typo in the doc.
Add an AssertError check back to MultiheadAttention module
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20492
Differential Revision: D15349008
Pulled By: cpuhrsch
fbshipit-source-id: 2d898345f03787c713e537673613a748ad826b34
Summary:
The current variance kernels compute mean at the same time. Many times we want both statistics together, so it seems reasonable to have a kwarg/function that allows us to get both values without launching an extra kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18731
Differential Revision: D14726082
Pulled By: ifedan
fbshipit-source-id: 473cba0227b69eb2240dca5e61a8f4366df0e029