torch.vmap is a prototype feature and should not be in the stable
binary. This PR:
- Removes the `torch.vmap` API
- Removes the documentation entry for `torch.vmap`
- Changes the vmap tests to use an internal API instead of `torch.vmap`.
Test Plan:
- Tested locally (test_torch, test_type_hints, test_vmap), but also wait
for CI.
Summary:
Originally proposed at https://github.com/pytorch/pytorch/issues/44473#issuecomment-690670989 by colesbury .
This PR adds the functionality to print relevant tensor shapes and convolution parameters along with the stack trace once a cuDNN exception is thrown.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45023
Reviewed By: gchanan
Differential Revision: D23932661
Pulled By: ezyang
fbshipit-source-id: 5f5f570df6583271049dfc916fac36695f415331
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45482
Working on some models that need these ops on lite interpreter.
Test Plan: locally build and load/run the TS model without problem.
Reviewed By: iseeyuan
Differential Revision: D23906581
fbshipit-source-id: 01b9de2af2046296165892b837bc14a7e5d59b4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520
With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D23998789
Pulled By: ZolotukhinM
fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
Summary:
This might be an alternative to reverting https://github.com/pytorch/pytorch/issues/45396 .
The obvious rough edge is that I'm not really seeing the work group limits that TensorExpr produces.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45506
Reviewed By: zhangguanheng66
Differential Revision: D23991410
Pulled By: Krovatkin
fbshipit-source-id: 11d3fc4600e4bffb1d1192c6b8dd2fe22c1e064e
Summary:
This PR adds a new GraphManipulation library for operating on the GraphModule nodes.
It also adds an implementation of replace_target_nodes_with, which replaces all nodes in the GraphModule or a specific op/target with a new specified op/target. An example use of this function would be replacing a generic operator with an optimized operator for specific sizes and shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44775
Reviewed By: jamesr66a
Differential Revision: D23874561
Pulled By: gcatron
fbshipit-source-id: e1497cd11e0bbbf1fabdf137d65c746248998e0b
Summary:
In coordination with jlin27.
This PR is meant to build documentation when the repo is tagged. For instance, tagging the repo with 1.7.0rc1 will push that commit's documentation to pytorch/pytorch.github.io/docs/1.7.
Subsequently tagging 1.7.0rc2 will override the 1.7 docs, as will 1.7.0, and 1.7.1. I think this is as it should be: there should be one, latest, version for the 1.7 docs. This can be tweaked differently if desired.
There is probably work that needs to be done to adjust the [versions.html](https://pytorch.org/docs/versions.html) to add the new tag?
Is there a way to test the tagging side of this without breaking the production documentation?
As an aside, the documentation is being built via the `pytorch_linux_xenial_py3_6_gcc5_4_build` image. Some projects are starting to move on from python3.6 since [it is in security-only support mode](https://devguide.python.org/#status-of-python-branches), no new binaries are being released.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45204
Reviewed By: zhangguanheng66
Differential Revision: D23996800
Pulled By: seemethere
fbshipit-source-id: a94a080348a47738c1de5832ab37b2b0d57d2d57
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45364
Plus add some more comments about the usage, limitations and cons.
Test Plan: Build and run benchmark binary.
Reviewed By: gchanan
Differential Revision: D23944193
fbshipit-source-id: 30d4f4991d2185a0ab768d94c846d73730fc0835
Summary:
Per feedback in the recent design review. Also tweaks the documentation to clarify what "deterministic" means and adds a test for the behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45410
Reviewed By: ngimel
Differential Revision: D23974988
Pulled By: mruberry
fbshipit-source-id: e48307da9c90418fc6834fbd67b963ba2fe0ba9d
Summary:
Updated `cholesky_backward` to work correctly for complex input.
Note that the current implementation gives the conjugate of what JAX would return. anjali411 is that correct thing to do?
Ref. https://github.com/pytorch/pytorch/issues/44895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45267
Reviewed By: bwasti
Differential Revision: D23975269
Pulled By: anjali411
fbshipit-source-id: 9908b0bb53c411e5ad24027ff570c4f0abd451e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45488
model_name logging was broken, issue is from the recent change of assigning the method name into the module name, this diff is fixing it.
ghstack-source-id: 113103942
Test Plan:
made sure that now the model_name is logged from module_->name().
verified with one model which does not contain the model metadata, and the model_name field is logged as below:
09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING run() module = __torch__.Model
09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING metadata does not have model_name assigning to __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log model_name = __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log method_name = labels
09-28 21:59:30.068 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod()
Reviewed By: linbinyu
Differential Revision: D23984165
fbshipit-source-id: 5b00f50ea82106b695c2cee14029cb3b2e02e2c8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45261
**Summary**
This commit enables `unused` syntax for ignoring
properties. Inoring properties is more intuitive with this feature enabled.
`ignore` is not supported because class type properties cannot be
executed in Python (because they exist only as TorchScript types) like
an `ignored` function and module properties that cannot be scripted
are not added to the `ScriptModule` wrapper so that they
may execute in Python.
**Test Plan**
This commit updates the existing unit tests for class type and module
properties to test properties ignored using `unused`.
Test Plan: Imported from OSS
Reviewed By: navahgar, Krovatkin, mannatsingh
Differential Revision: D23971881
Pulled By: SplitInfinity
fbshipit-source-id: 8d3cc1bbede7753d6b6f416619e4660c56311d33
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45479
Add a top level boolean attribute to the model called mobile_optimized that is set to true if it is optimized.
Test Plan: buck test //caffe2/test:mobile passes
Reviewed By: kimishpatel
Differential Revision: D23956728
fbshipit-source-id: 79c5931702208b871454319ca2ab8633596b1eb8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45083
This PR just reorders the methods in Sorting.cpp placing related methods next to each other.
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D23908817
Pulled By: heitorschueroff
fbshipit-source-id: 1dd7b693b5135fddf5dff12303474e85ce0c2f83
Summary:
Fix `torch._C._autocast_*_nesting` declarations in __init__.pyi
Fix iterable constructor logic: not every iterable can be constructed using `type(val)(val)` trick, for example it would not work for `val=range(10)` although `isinstance(val, Iterable)` is True
Change optional resolution logic to meet mypy expectations
Fixes https://github.com/pytorch/pytorch/issues/45436
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45480
Reviewed By: walterddr
Differential Revision: D23982822
Pulled By: malfet
fbshipit-source-id: 6418a28d04ece1b2427dcde4b71effb67856a872
Summary:
This PR makes the deprecation warnings for existing fft functions more prominent and makes the torch.stft deprecation warning consistent with our current deprecation planning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45409
Reviewed By: ngimel
Differential Revision: D23974975
Pulled By: mruberry
fbshipit-source-id: b90d8276095122ac3542ab625cb49b991379c1f8
Summary:
This avoids unnecessary memory allocations in `view_as_complex` and `view_as_real`. I construct the new tensor directly with the existing storage to avoid creating a new storage object and also use `DimVector`s to avoid allocating for the sizes and strides. Overall, this saves about 2 us of overhead from `torch.fft.fft` which currently has to call `view_as_real` and `view_as_complex` for every call.
I've used this simple benchmark to measure the overhead:
```python
In [1]: import torch
...: a = torch.rand(1, 2)
...: ac = torch.view_as_complex(a)
...: %timeit torch.view_as_real(ac)
...: %timeit torch.view_as_complex(a)
...: %timeit ac.real
```
Results before:
```
2.5 µs ± 62.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.22 µs ± 36 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.17 µs ± 8.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
and after:
```
1.83 µs ± 9.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.57 µs ± 7.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3.47 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44908
Reviewed By: agolynski
Differential Revision: D23793479
Pulled By: anjali411
fbshipit-source-id: 64b9cad70e3ec10891310cbfa8c0bdaa1d72885b
Summary:
PR opened just to run the CI tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44465
Reviewed By: ngimel
Differential Revision: D23907565
Pulled By: mruberry
fbshipit-source-id: 620661667877f1e9a2bab17d19988e2dc986fc0f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44846
The save function traverses the model state dict to pick out the observer stats
load function traverse the module hierarchy to load the state dict into module attributes depending on observer type
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_save_observer_state_dict
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23746821
fbshipit-source-id: 05c571b62949a2833602d736a81924d77e7ade55
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390
Tensor objects should always refer to their Function's bufs. Currently
we never create a Tensor with a buffer different than of its function,
but having it in two places seems incorrect and dangerous.
Differential Revision: D23952865
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388
Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.
Differential Revision: D23952867
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
Summary:
In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before:
```
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
aten::matmul 0.17% 890.805us 99.05% 523.401ms 5.234ms 49.91% 791.184ms 7.912ms 100
aten::mm 98.09% 518.336ms 98.88% 522.511ms 5.225ms 49.89% 790.885ms 7.909ms 100
aten::t 0.29% 1.530ms 0.49% 2.588ms 25.882us 0.07% 1.058ms 10.576us 100
aten::view 0.46% 2.448ms 0.46% 2.448ms 12.238us 0.06% 918.936us 4.595us 200
aten::transpose 0.13% 707.204us 0.20% 1.058ms 10.581us 0.03% 457.802us 4.578us 100
aten::empty 0.14% 716.056us 0.14% 716.056us 7.161us 0.01% 185.694us 1.857us 100
aten::as_strided 0.07% 350.935us 0.07% 350.935us 3.509us 0.01% 156.380us 1.564us 100
aten::stride 0.65% 3.458ms 0.65% 3.458ms 11.527us 0.03% 441.258us 1.471us 300
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 528.437ms
CUDA time total: 1.585s
Recorded timeit time: 789.0814 ms
```
Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler
After
```
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::matmul 0.15% 802.716us 99.06% 523.548ms 5.235ms 302.451us 0.04% 791.151ms 7.912ms 100
aten::mm 98.20% 519.007ms 98.91% 522.745ms 5.227ms 790.225ms 99.63% 790.848ms 7.908ms 100
aten::t 0.27% 1.406ms 0.49% 2.578ms 25.783us 604.964us 0.08% 1.066ms 10.662us 100
aten::view 0.45% 2.371ms 0.45% 2.371ms 11.856us 926.281us 0.12% 926.281us 4.631us 200
aten::transpose 0.15% 783.462us 0.22% 1.173ms 11.727us 310.016us 0.04% 461.282us 4.613us 100
aten::empty 0.11% 591.603us 0.11% 591.603us 5.916us 176.566us 0.02% 176.566us 1.766us 100
aten::as_strided 0.07% 389.270us 0.07% 389.270us 3.893us 151.266us 0.02% 151.266us 1.513us 100
aten::stride 0.60% 3.147ms 0.60% 3.147ms 10.489us 446.451us 0.06% 446.451us 1.488us 300
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 528.498ms
CUDA time total: 793.143ms
Recorded timeit time: 788.9832 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209
Reviewed By: zou3519
Differential Revision: D23925491
Pulled By: ngimel
fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44970
Right now, when RecordFunction is not active (usual case),
we do two TLS accesses (check for thread local callbacks, and check for
thread local boolean).
Experimenting with reducing number of TLS accesses in RecordFunction
constructor.
Test Plan: record_function_benchmark
Reviewed By: dzhulgakov
Differential Revision: D23791165
Pulled By: ilia-cher
fbshipit-source-id: 6137ce4bface46f540ece325df9864fdde50e0a4
Summary:
To support abnormal detection for test time spike
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45457
Reviewed By: malfet
Differential Revision: D23975628
Pulled By: walterddr
fbshipit-source-id: f28d0f12559070004d637d5bde83289f029b15b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45069
`torch.abs` is a `C -> R` function for complex input. Following the general semantics in torch, the in-place version of abs should be disabled for complex input.
Test Plan: Imported from OSS
Reviewed By: glaringlee, malfet
Differential Revision: D23818397
Pulled By: anjali411
fbshipit-source-id: b23b8d0981c53ba0557018824d42ed37ec13d4e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45233
**Summary**
This commit modifies `TestClassType.test_properties` to check that
properties on class types can be ignored with the same syntax as
ignoring properties on `Modules`.
**Test Plan**
`python test/test_jit.py TestClassType.test_properties`
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23971885
Pulled By: SplitInfinity
fbshipit-source-id: f2228f61fe26dff219024668cc0444a2baa8834c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45232
**Summary**
This commit updates the TorchScript language reference to include
documentation on recently-added TorchScript enums. It also removed
`torch.no_grad` from the list of known unsupported `torch` modules and
classes because it is now supported.
**Test Plan**
Continuous integration.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23971884
Pulled By: SplitInfinity
fbshipit-source-id: 5e2c164ed59bc0926b11201106952cff86e9356e
Summary:
Inline pytorch into wrapper, which is especially helpful in combination
with dead code elimination to reduce IR size and compilation times when
a lot of parameters are unused.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45445
Test Plan: CI
Reviewed By: ZolotukhinM
Differential Revision: D23969009
Pulled By: asuhan
fbshipit-source-id: a21509d07e4c130b6aa6eae5236bb64db2748a3d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43612
**Summary**
This commit modifies the `torch._C._jit_to_backend` function so that it
accepts `ScriptModules` as inputs. It already returns `ScriptModules`
(as opposed to C++ modules), so this makes sense and makes the API more
intuitive.
**Test Plan**
Continuous integration, which includes unit tests and out-of-tree tests
for custom backends.
**Fixes**
This commit fixes#41432.
Test Plan: Imported from OSS
Reviewed By: suo, jamesr66a
Differential Revision: D23339854
Pulled By: SplitInfinity
fbshipit-source-id: 08ecef729c4e1e6bddf3f483276947fc3559ea88
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45280
Performance is the same on CPU and on CUDA is only 1-1.05x slower. This change is necessary for the future nan ops including nan(min|max|median)
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D23908796
Pulled By: heitorschueroff
fbshipit-source-id: c2b57acbe924cfa59fbd85216811f29f4af05088
Summary:
Stumbled upon a little gem in the audio conversion for `SummaryWriter.add_audio()`: two Python `for` loops to convert a float array to little-endian int16 samples. On my machine, this took 35 seconds for a 30-second 22.05 kHz excerpt. The same can be done directly in numpy in 1.65 milliseconds. (No offense, I'm glad that the functionality was there!)
Would also be ready to extend this to support stereo waveforms, or should this become a separate PR?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44201
Reviewed By: J0Nreynolds
Differential Revision: D23831002
Pulled By: edward-io
fbshipit-source-id: 5c8f1ac7823d1ed41b53c4f97ab9a7bac33ea94b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45214
When in verbose mode the package exporter will produce an html visualization
of dependencies of a module to make it easier to trim out unneeded code,
or debug inclusion of things that cannot be exported.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D23873525
Pulled By: zdevito
fbshipit-source-id: 6801991573d8dd5ab8c284e09572b36a35e1e5a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45402
Previous diffs in this stack implemented the getNumKeys and deleteKey
APIs in the c10d Store as well as added tests at the C++ layer. This diff adds
tests at the Python level in test_c10d.py
ghstack-source-id: 112997161
Test Plan: Running these new python tests as well as previous C++ tests
Reviewed By: mrshenli
Differential Revision: D23955729
fbshipit-source-id: c7e0af7c884de2d488320e2a1d94aec801a782e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401
Added a DeleteKey API for the TCP Store
ghstack-source-id: 112997162
Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values
Reviewed By: mrshenli
Differential Revision: D23955730
fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f
Summary: As title.
Test Plan:
FBL job without this diff failed:
f221545832
Error message:
```
NonRetryableException: AssertionError: Label is missing in training stage for HistogramBinningCalibration
```
FBL job with canary package built in this diff is running without failure:
f221650379
Reviewed By: chenshouyuan
Differential Revision: D23959508
fbshipit-source-id: c077230de29f7abfd092c84747eaabda0b532bcc
Summary:
Recent changes to the seq_num correlation behavior in profiler (PR https://github.com/pytorch/pytorch/issues/42565) has changed the behavior for emit_nvtx(record_shapes=True) which doesn't print the name of the operator properly.
Created PR to dump out the name in roctx traces, irrespective of the sequence number assigned only for ROCm.
cc: jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45229
Reviewed By: zou3519
Differential Revision: D23932902
Pulled By: albanD
fbshipit-source-id: c782667ff002b70b51f1cc921afd1b1ac533b39d
Summary:
This PR cleans up some of the rough edges around `Timer` and `Compare`
* Moves `Measurement` to be dataclass based
* Adds a bunch of type annotations. MyPy is now happy.
* Allows missing entries in `Compare`. This is one of the biggest usability issues with `Compare` right now, both from an API perspective and because the current failure mode is really unpleasant.
* Greatly expands the testing of `Compare`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45361
Test Plan: Changes to Timer are covered under existing tests, changes to `Compare` are covered by the expanded `test_compare` method.
Reviewed By: bwasti
Differential Revision: D23966816
Pulled By: robieta
fbshipit-source-id: 826969f73b42f72fa35f4de3c64d0988b61474cd
Summary:
Export of view op with dynamic input shape is broken when using tensors with a 0-dim.
This fix removes symbolic use of static input size to fix this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43558
Reviewed By: ailzhang
Differential Revision: D23965090
Pulled By: bzinodev
fbshipit-source-id: 628e9d7ee5d53375f25052340ca6feabf7ba7c53
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45291
It's not necessary, you can just check if the dtype is integral.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23911963
Pulled By: gchanan
fbshipit-source-id: 230139e1651eb76226f4095e31068dded30e03e8
Summary: Adding support for type double to caffe2 MeanOp and MeanGradientOp.
Test Plan:
All tests passed.
Example FBL job failed without this diff:
f221169563
Error message:
```
c10::Error: [enforce fail at mean_op.h:72] . Mean operator only supports 32-bit float, but input was of type double (Error from operator:
input: "dpsgd_8/Copy_3" input: "dpsgd_8/Copy_4" output: "dpsgd_8/Mean_2" name: "" type: "Mean" device_option { device_type: 0 device_id: 0 })
```
Example FBL job is running without failure with the canary package built from this diff:
f221468723
Reviewed By: chenshouyuan
Differential Revision: D23956222
fbshipit-source-id: 6c81bbc390d812ae0ac235e7d025141c8402def1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294
While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.
This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc. This way, cuda-memcheck will actually work.
Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.
Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826
And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```
Reviewed By: ngimel
Differential Revision: D23964734
Pulled By: bertmaher
fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45433
Primarily in order to pick up the fix landed in https://github.com/pytorch/tensorpipe/pull/225 which fixes the handling of scopes in link-local IPv6 addresses, which was reported by a user.
Test Plan: The specific upstream change is covered by new unit tests. The submodule update will be validated by the PyTorch CI.
Reviewed By: beauby
Differential Revision: D23962289
fbshipit-source-id: 4ed762fc19c4aeb1398d1337d61b3188c4c228be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45438
Adds torchelastic (as well as its dependencies) to the official docker
images
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: tierex
Differential Revision: D23963787
Pulled By: seemethere
fbshipit-source-id: 54ebb4b9c50699e543f264975dadf99badf55753
Summary:
As per title. Fixes [#{38948}](https://github.com/pytorch/pytorch/issues/38948). Therein you can find some blueprints for the algorithm being used in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43002
Reviewed By: zou3519
Differential Revision: D23931326
Pulled By: albanD
fbshipit-source-id: e6994af70d94145f974ef87aa5cea166d6deff1e
Summary:
Changes the deprecation of norm to a docs deprecation, since PyTorch components still rely on norm and some behavior, like automatically flattening tensors, may need to be ported to torch.linalg.norm. The documentation is also updated to clarify that torch.norm and torch.linalg.norm are distinct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45415
Reviewed By: ngimel
Differential Revision: D23958252
Pulled By: mruberry
fbshipit-source-id: fd54e807c59a2655453a6bcd9f4073cb2c12e8ac
Summary:
Fix a couple of issues with scripting inplace indexing in prepare_inplace_ops_for_onnx pass.
1- Tracing index copy (such as cases lik x[1:3] = data) already applies broadcasting on rhs if needed. The broadcasting node (aten::expand) is missing in scripting cases.
2- Inplace indexing with ellipsis (aten::copy_) is replaced with aten::index_put and then handled with slice+select in this pass.
Support for negative indices for this op added.
Shape inference is also enabled for scripting tests using new JIT API.
A few more tests are enabled for scripting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44351
Reviewed By: ezyang
Differential Revision: D23880267
Pulled By: bzinodev
fbshipit-source-id: 78b33444633eb7ae0fbabc7415e3b16001f5207f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45143
This PR prevents freezing cleaning up a submodule when user requests to
preserve a submodule.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23844969
Pulled By: bzinodev
fbshipit-source-id: 80e6db3fc12460d62e634ea0336ae2a3551c2151
Summary:
in ONNX NegativeLogLikelihoodLoss specification, ignore_index is optional without default value.
therefore, when convert nll op to ONNX, we need to set ignore_index attribute even if it is not specified (e.g. ignore_index=-100).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44816
Reviewed By: ezyang
Differential Revision: D23880354
Pulled By: bzinodev
fbshipit-source-id: d0bdd58d0a4507ed9ce37133e68533fe6d1bdf2b
Summary:
Optimize export_onnx api to reduce string and model proto exchange in export.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44332
Reviewed By: bwasti, eellison
Differential Revision: D23880129
Pulled By: bzinodev
fbshipit-source-id: 1d216d8f710f356cbba2334fb21ea15a89dd16fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419
Closes https://github.com/pytorch/pytorch/issues/39969
This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument.
This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally.
ghstack-source-id: 112977899
Reviewed By: pritamdamania87
Differential Revision: D23591274
fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967
When enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.
ghstack-source-id: 112977906
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D23790729
fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362
Summary: Currently GetSingleArgument is overflowing since it's expecting an int instead of an int64 when using a 1cycle (hill policy) annealing schedule
Test Plan:
unittest
buck test caffe2/caffe2/python/operator_test:learning_rate_op_test
Differential Revision: D23938169
fbshipit-source-id: 20d65df800d7a0f1dd9520705af31f63ae716463
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45223
Previous diffs in this stack implemented the getNumKeys and deleteKey
APIs in the c10d Store as well as added tests at the C++ layer. This diff adds
tests at the Python level in test_c10d.py
ghstack-source-id: 112939763
Test Plan: Ensured these new python tests as well as previous C++ tests pass
Reviewed By: jiayisuse
Differential Revision: D23878455
fbshipit-source-id: 0a17ecf66b28d46438a77346e5bf36414e05e25c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963
Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762
Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values
Reviewed By: jiayisuse
Differential Revision: D23009117
fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962
TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761
Test Plan: Adding tests to C++ Store Tests
Reviewed By: pritamdamania87
Differential Revision: D22985085
fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
Summary:
This PR adds get_all_users_of function. The function returns all the users of a specific node. A test unit is also added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45216
Reviewed By: ezyang
Differential Revision: D23883572
Pulled By: scottxu0730
fbshipit-source-id: 3eb68a411c3c6db39ed2506c9cb7bb7337520ee4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45306
Adds details to the main quantization doc on how specifically
users can skip or customize quantization of layers.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23917034
Pulled By: vkuzo
fbshipit-source-id: ccf71ce4300c1946b2ab63d1f35a07691fd7a2af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45305
Adds an explanatation for reduce_range to the main quantization
doc page.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23916669
Pulled By: vkuzo
fbshipit-source-id: ef93fb774cb15741cd92889f114f6ab76c39f051
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45135
The previous quantization summary had steps on what to do for
dynamic, static, QAT. This PR moves these steps to comments in the
example code, so it is more clear how to accomplish the steps.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23842456
Pulled By: vkuzo
fbshipit-source-id: db2399e51e9ae33c8a1ac610e3d7dbdb648742b0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45093
This adds a tl;dr; style summary of the quantization API
to the documentation. Hopefully this will make this easier
for new folks to learn how to use quantization.
This is not meant to be all-encompassing. Future PRs
can improve the documentation further.
Test Plan:
1. build the doc as specified in https://github.com/pytorch/pytorch#building-the-documentation
2. inspect the quantization page in Chrome, format looks good
Reviewed By: jerryzh168
Differential Revision: D23828257
Pulled By: vkuzo
fbshipit-source-id: 9311ee3f394cd83af0aeafb6e2fcdc3e0321fa38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221
This PR introduces a distributed functional optimizer, so that
distributed optimizer can reuse the functional optimizer APIs and
maintain their own states. This could enable the torchscript compatible
functional optimizer when using distributed optimizer, helps getting rid
of GIL and improve overall performance of training, especially distributed
model parallel training
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D23935256
Pulled By: wanchaol
fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44715
We have provided a nice and intuitive API in Python. But in the context of large scale distributed training (e.g. Distributed Model Parallel), users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency.
This PR introduces functional optimizer concept (that is similar to the concept of `nn.functional`), we split optimizer into two parts: 1. optimizer state management 2. optimizer computation. We expose the computation part as a separate functional API that is available to be used by internal and OSS developers, the caller of the functional API will maintain their own states in order to directly calls the functional API. While maintaining the end user API be the same, the functional API is TorchScript friendly, and could be used by the distributed optimizer to speed up the training without GIL.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D23935258
Pulled By: wanchaol
fbshipit-source-id: d2a5228439edb3bc64f7771af2bb9e891847136a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353
Temporarily removing this feature, will add this back after branch cut.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D23939865
Pulled By: mrshenli
fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44433
Not entirely sure why, but changing the type of beta from `float` to `double in autocast_mode.cpp and FunctionsManual.h fixes my compiler errors, failing instead at link time
fixing some type errors, updated fn signature in a few more files
removing my usage of Scalar, making beta a double everywhere instead
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D23636720
Pulled By: bdhirsh
fbshipit-source-id: caea2a1f8dd72b3b5fd1d72dd886b2fcd690af6d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181
`init_process_group` and `new_group` update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.
To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.
Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.
#Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378
ghstack-source-id: 112923112
Test Plan:
Reproduced the failures in
https://github.com/pytorch/pytorch/issues/40434 and
https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes
the issue.
Reviewed By: mrshenli
Differential Revision: D23858025
fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45356
In this PR, I'm adding a warning to the PG backend mentioning it would
be deprecated in the future. In addition to this I removed the warning from the
TP backend that it is a beta feature.
ghstack-source-id: 112940501
Test Plan: waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D23940144
fbshipit-source-id: d44054aa1e4ef61004a40bbe0ec45ff07829aad4
Summary:
This should get the builds green again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45354
Reviewed By: zhangguanheng66
Differential Revision: D23939615
Pulled By: bwasti
fbshipit-source-id: e93b11bc9592205e52330bb15928603b0aea21ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45188
This is a symbolically traceable alternative to Python's `assert`.
It should be useful to allow people who want to use FX to also
be able to assert things.
A bunch of TODO(before) land are inline - would love thoughts
on where is the best place for this code to live, and what this
function should be called (since `assert` is reserved).
Test Plan:
```
python test/test_fx.py TestFX.test_symbolic_trace_assert
```
Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23861567
fbshipit-source-id: d9d6b9556140faccc0290eba1fabea401d7850de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44923
This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed)
ghstack-source-id: 112868469
Test Plan: CI
Reviewed By: lw
Differential Revision: D23691304
fbshipit-source-id: b17d34ade823794cbe949b70a5ab35723d974203
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664
Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.
To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.
For example, if the following async function is ran on a server over RPC:
```
def slow_add(x, y):
time.sleep(1)
return torch.add(x, y)
rpc.functions.async_execution
def slow_async_add(to, x, y):
return rpc.rpc_async(to, slow_add, args=(x, y))
```
we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:
```
------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s
1.012s 1 1
aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s
1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us
11.843us 1 2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us
22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- ---------------
Self CPU time total: 164.164us
```
This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470
Test Plan:
```
rvarm1@devbig978:fbcode (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```
Reviewed By: mrshenli
Differential Revision: D23638387
fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45264
Context for why we are porting to gtest in: https://github.com/pytorch/pytorch/pull/45018.
This PR completes the process of porting and removes unused files/macros.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23901392
Pulled By: suo
fbshipit-source-id: 89526890e1a49462f3f77718f4ee273c5bc578ba
Summary:
The Cuda HalfChecker casts up all loads and stores of Half to Float, so we do math in Float on the device. It didn't cast up HalfImmediate (ie. constants) so they could insert mixed-size ops. Fix is to do that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45213
Reviewed By: ezyang
Differential Revision: D23885287
Pulled By: nickgg
fbshipit-source-id: 912991d85cc06ebb282625cfa5080d7525c8eba9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45257
Currently we inline fork-wait calls when we insert observers for quantization
In the case where fork and wait are in different subgraphs, inlining the fork-wait calls
only gets rid of the fork. This leaves the aten::wait call in the graph with a torch.Tensor as input,
which is currently not supported.
To avoid this we check to make sure input to all wait calls in the graph is of type Future[tensor]
in the cleanup phase
Test Plan:
python test/test_quantization.py TestQuantizeJitPasses.test_quantize_fork_wait
Imported from OSS
Reviewed By: qizzzh
Differential Revision: D23895412
fbshipit-source-id: 3c58c6be7d7e7904eb6684085832ac21f827a399
Summary:
I noticed while working on https://github.com/pytorch/pytorch/issues/45163 that edits to python files in the `tools/codegen/api/` directory wouldn't trigger rebuilds. This tells CMake about all of the dependencies, so rebuilds are triggered automatically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45275
Reviewed By: zou3519
Differential Revision: D23922805
Pulled By: ezyang
fbshipit-source-id: 0fbf2b6a9b2346c31b9b0384e5ad5e0eb0f70e9b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44071
Previously, tracing re-gathered ScalarType, Layout, Device, bool into a TensorOptions object and called `tracer::addInput()` on the gathered TensorOptions argument. `tracer::addInput()` then scattered them again and added the individual scattered arguments to the traced graph. This PR avoids the extraneous gathering and re-scattering step and calls `tracer::addInput()` on the individual arguments directly. This avoid the perf hit for an unnecessary gathering step.
This applies to both c10-full and non-c10-full ops. In the case of c10-full ops, the tracing kernels takes scattered arguments and we can directly pass them to `tracer::addInput()`. In the case of non-c10-full ops, the kernel takes a `TensorOptions` argument but we still call `tracer::addInput()` on the scattered arguments.
ghstack-source-id: 112825793
Test Plan:
waitforsandcastle
vs master: https://www.internalfb.com/intern/fblearner/details/216129483/
vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170069/
Reviewed By: ezyang
Differential Revision: D23486638
fbshipit-source-id: e0b53e6673cef8d7f94158e718301eee261e5d22
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062
Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory, and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.
Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789
Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/
vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/
Reviewed By: ezyang
Differential Revision: D23484192
fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44005
Previously, VariableType and TraceType kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory, and they used hacky_wrapper to be callable.
Now with this PR, variable and tracing kernels are written in the new way and no hacky_wrapper is needed for them.
ghstack-source-id: 112825791
Test Plan:
waitforsandcastle
https://www.internalfb.com/intern/fblearner/details/215954270/
Reviewed By: ezyang
Differential Revision: D23466042
fbshipit-source-id: bde730a9e3bb4cb80ad484417be1ebecbdc2d377
Summary:
A lot of changes are in this update, some highlights:
- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218
Reviewed By: ezyang
Differential Revision: D23905183
Pulled By: soumith
fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45317
Eager mode quantization depends on the presence of the `config`
model attribute. Currently converting a model to use `SyncBatchNorm`
removes the qconfig - fixing this. This is important if a BN is not
fused to anything during quantization convert.
Test Plan:
```
python test/test_quantization.py TestDistributed.test_syncbn_preserves_qconfig
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23922072
fbshipit-source-id: cc1bc25c8e5243abb924c6889f78cf65a81be158
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45134
Per-Op-Registration was a mechanism used for mobile selective build v0. Since then, a new dispathing mechanism has been built for PyTorch, and this code path isn't used any more. Remove it to simplify understanding/updating the code-generator's code-flow.
ghstack-source-id: 112723942
Test Plan: `buck build` and sandcastle.
Reviewed By: ezyang
Differential Revision: D23806632
fbshipit-source-id: d93cd324650c541d9bfc8eeff2ddb2833b988ecc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45284
This is the 2nd batch of the change described in #45010.
In this batch we relaxed some filters to cover more 'backend specific' ops:
* ops that not call any 'Tensor::is_xxx()' method OR only call
'Tensor::is_cuda()' - we are adding CUDA dispatch key anyway;
* ops that call other ATen ops but ARE differentiable - differentiability
is a fuzzy indicator of not being 'composite';
Inherited other filters from the 1st batch:
* These ops don't already have dispatch section in native_functions.yaml;
* These ops call one or more DispatchStub (thus "backend specific");
Differential Revision: D23909901
Test Plan: Imported from OSS
Reviewed By: ailzhang
Pulled By: ljk53
fbshipit-source-id: 3b31e176324b6ac814acee0b0f80d18443bd81a1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44430
log metadata even when model loading is failed
Test Plan: {F331550976}
Reviewed By: husthyc
Differential Revision: D23577711
fbshipit-source-id: 0504e75625f377269f1e5df0f1ebe34b8e564c4b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45315
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45314
in D23858329 (721cfbf842), we put PriorCorrectionCalibrationPrediction unit test in OSS file which causes test failure issue in public trunk.
this diff moves it to FB only test file.
Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_gather_ranges_to_dense_op
buck test //caffe2/caffe2/fb/python/operator_test:torch_integration_test -- test_prior_correct_calibration_prediction_op
```
all pass.
Reviewed By: houseroad
Differential Revision: D23899012
fbshipit-source-id: 1ed97d8702e2765991e6caf5695d4c49353dae82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45162
This test was flaky because it was not able to validate that the
overall record_function's CPU times are greater than the sum of its children.
It turns out that this is a general bug in the profiler that can be reproduced
without RPC, see https://github.com/pytorch/pytorch/issues/45160. Hence,
removing this from the test and replacing it by just validating the expected
children.
Ran the test 1000 times and they all passed.
ghstack-source-id: 112632327
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D23851854
fbshipit-source-id: 5d9023acd17800a6668ba4849659d8cc902b8d6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44972
Previously, our fusion strategy would be:
- start at the end of the block, find a fusable node
- iteratively try to merge inputs into the fusion group, sorted topologically
This strategy works pretty well, but has the possibility of missing fusion groups. See my attached test case for an example where we wouldn't find all possible fusion groups. bertmaher found an example of a missed fusion groups in one of our rnn examples (jit_premul) that caused a regression from the legacy fuser.
Here, I'm updating our fusion strategy to be the same as our other fusion passes - create_autodiff_subgraphs, and graph_fuser.cpp.
The basic strategy is:
- iterate until you find a fusible node
- try to merge the nodes inputs, whenever a succesful merge occurs restart at the beginning of the nodes inputs
- after you've exhausted a node, continue searching the block for fusion opportunities from the node
- continue doing this on the block until we go through an iteration without an succesful merges
Since we create the fusion groups once, and only re-specialize within the fusion groups, we should be running this very infrequently (only re-triggers when we fail undefinedness specializations). Also bc it's the same algorithm as the existing fuser it is unlikely to cause a regression.
Test Plan: Imported from OSS
Reviewed By: Krovatkin, robieta
Differential Revision: D23821581
Pulled By: eellison
fbshipit-source-id: e513d1ef719120dadb0bfafc7a14f4254cd806ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44238
Refactor create_autodiff_subgraphs to use the same updating of output aliasing properties logic as tensorexpr fuser, and factor that out to a common function in subgraph utils.
Test Plan: Imported from OSS
Reviewed By: Krovatkin, robieta
Differential Revision: D23871565
Pulled By: eellison
fbshipit-source-id: 72df253b16baf8e4aabf3d68b103b29e6a54d44c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45096
Add operator to compute the equalization scale. This will be used in the integration of equalization into dper int8 fixed quant scheme quantization flow.
Design docs:
https://fb.quip.com/bb7SAGBxPGNChttps://fb.quip.com/PDAOAsgoLfRr
Test Plan: buck test caffe2/caffe2/quantization/server:compute_equalization_scale_test
Reviewed By: jspark1105
Differential Revision: D23779870
fbshipit-source-id: 5e6a8c220935a142ecf8e61100a8c71932afa8d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45178
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
## Summary
* Adds a hypothesis test for queue ops cancellation.
Test Plan:
## Unit test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test
buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000
```
```
Summary
Pass: 1000
ListingSuccess: 1
```
Reviewed By: d4l3k
Differential Revision: D23847576
fbshipit-source-id: 2fc351e1ee13ea8b32d976216d2d01dfb6fcc1ad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45177
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
## Summary
* When an error occurs in a net or it got cancelled, running ops will have the
`Cancel` method called.
This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
blocking ops to return.
* Adds unit test that verified the error propagation.
Test Plan:
## Unit test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000
```
```
Summary
Pass: 1000
ListingSuccess: 1
```
Reviewed By: d4l3k
Differential Revision: D23846967
fbshipit-source-id: c7ddd63259e033ed0bed9df8e1b315f87bf59394
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44643
This method is not used anywhere else.
Also formatted the file.
Test Plan: buck test caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks
Reviewed By: pritamdamania87
Differential Revision: D23675945
fbshipit-source-id: 2d04f94589a20913e46b8d71e6a39b70940c1461
Summary:
Modify contbuild to disable sanitizers, add option to run "cuda" test using TPX RE
(Note: this ignores all push blocking failures!)
Test Plan: CI
Reviewed By: walterddr, cspanda
Differential Revision: D23854578
fbshipit-source-id: 327d7cc3655c17034a6a7bc78f69967403290623
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`
cc: ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240
Reviewed By: mruberry
Differential Revision: D23882498
Pulled By: ngimel
fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45238
Adds a warning when there is much higher than expected amount of
discrepancy of inputs across different processes when running with uneven
inputs. This is because a skew in the thousands can reduce performance a
nontrivial amount as shown in benchmarks, and it was proposed to add this
warning as a result. Tested by running the tests so the threshold is hit and
observing the output.
ghstack-source-id: 112773552
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D23719270
fbshipit-source-id: 306264f62c1de65e733696a912bdb6e9376d5622
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45144
Moves prim ops from C10 back to JIT.
These were originally moved to C10 from JIT in D19237648 (f362cd510d)
ghstack-source-id: 112775781
Test Plan:
buck test //caffe2/test/cpp/jit:jit
https://pxl.cl/1l22N
buck test adsatlas/gavel/lib/ata_processor/tests:ata_processor_test
https://pxl.cl/1lBxD
Reviewed By: iseeyuan
Differential Revision: D23697598
fbshipit-source-id: 36d1eb8c346e9b161ba6af537a218440a9bafd27
Summary:
I noticed that the recently introduced adaptive_autorange tests occasionally timeout CI, and I've been meaning to improve the Timer tests for a while. This PR allows unit tests to swap the measurement portion of `Timer` with a deterministic mock so we can thoroughly test behavior without having to worry about flaky CI measurements. It also means that the tests can be much more detailed and still finish very quickly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45173
Test Plan: You're lookin' at it.
Reviewed By: ezyang
Differential Revision: D23873548
Pulled By: robieta
fbshipit-source-id: 26113e5cea0cbf46909b9bf5e90c878c29e87e88
Summary:
In this PR:
1) Added binary operations with ScalarLists.
2) Fixed _foreach_div(...) bug in native_functions
3) Covered all possible cases with scalars and scalar lists in tests
4) [minor] fixed bug in native_functions by adding "use_c10_dispatcher: full" to all _foreach functions
tested via unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44743
Reviewed By: bwasti, malfet
Differential Revision: D23753711
Pulled By: izdeby
fbshipit-source-id: bf3e8c54bc07867e8f6e82b5d3d35ff8e99b5a0a
Summary:
For integral types, isnan is meaningless. Provide specializations for
maximum and minimum which don't call it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44984
Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_minmax_int_ops
Reviewed By: ezyang
Differential Revision: D23885259
Pulled By: asuhan
fbshipit-source-id: 2e6da2c43c0ed18f0b648a2383d510894c574437
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44550
Part of the `torch.fft` work (gh-42175).
This adds n-dimensional transforms: `fftn`, `ifftn`, `rfftn` and `irfftn`.
This is aiming for correctness first, with the implementation on top of the existing `_fft_with_size` restrictions. I plan to follow up later with a more efficient rewrite that makes `_fft_with_size` work with arbitrary numbers of dimensions.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D23846032
Pulled By: mruberry
fbshipit-source-id: e6950aa8be438ec5cb95fb10bd7b8bc9ffb7d824
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45235
This is so that users know that the profiler works as expected with
RPC and they can learn how to use it to profile RPC-based workloads.
ghstack-source-id: 112773748
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D23777888
fbshipit-source-id: 4805be9b949c8c7929182f291a6524c3c6a725c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45149
The choose_qparams_optimized calculates the the optimized qparams.
It uses a greedy approach to nudge the min and max and calculate the l2 norm
and tries to minimize the quant error by doing `torch.norm(x-fake_quant(x,s,z))`
Test Plan: Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23848060
fbshipit-source-id: c6c57c9bb07664c3f1c87dd7664543e09f634aee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45231
There are two operators:
`PriorCorrectionCalibrationPrediction` and `GatherRangesToDense` is not supported in PT which makes GLOW cannot work.
To unblock, we first try to use C2->PT conversion. In the long-term, we need to implement PT custom ops.
This diff does this conversion to unblock current project.
Test Plan:
Run unit test. the Test input is from current DPER example.
All pass.
```buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_prior_correct_calibration_prediction_op --print-passing-details
> c2 reference output
> [0.14285715 0.27272728 0.39130434 0.5 ]
> PT converted output
> tensor([0.1429, 0.2727, 0.3913, 0.5000])
buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_gather_ranges_to_dense_op --print-passing-details
c2 reference output
> [array([[6, 5, 4, 3], [0, 0, 0, 0]], dtype=int64)]
> PT converted output
> [tensor([[6, 5, 4, 3], [0, 0, 0, 0]])]
```
Reviewed By: allwu, qizzzh
Differential Revision: D23858329
fbshipit-source-id: ed37118ca7f09e1cd0ad1fdec3d37f66dce60dd9
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
We need to check if dtypes differ in scalar type or lanes to decide between
Cast and Broadcast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45179
Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyBroadcastTermExpander
Reviewed By: bwasti
Differential Revision: D23873316
Pulled By: asuhan
fbshipit-source-id: ca141be67e10c2b6c5f2ff9c11e42dcfc62ac620
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44835
This is for feature parity with fx graph mode quantization
Test Plan: Imported from OSS
Reviewed By: z-a-f
Differential Revision: D23745086
fbshipit-source-id: ae2fc86129f9896d5a9039b73006a4da15821307
Summary:
Arithmetic operations on Bool aren't fully supported in the evaluator. Moreover,
such semantics can be implemented by the client code through insertion of
explicit casts to widen and narrow to the desired types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44677
Test Plan:
test_tensorexpr --gtest_filter=TensorExprTest.ExprDisallowBoolArithmetic
python test/test_jit_fuser_te.py
Reviewed By: agolynski
Differential Revision: D23801412
Pulled By: asuhan
fbshipit-source-id: fff5284e3a216655dbf5a9a64d1cb1efda271a36
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43872
This PR allows the recursive scripting to have a separate
submodule_stubs_fn to create its submodule with specific user provided
rules.
Fixes https://github.com/pytorch/pytorch/issues/43729
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D23430176
Pulled By: wanchaol
fbshipit-source-id: 20530d7891ac3345b36f1ed813dc9c650b28d27a
Summary:
When doing a splitWithMask we only mask if the loop extent is not cleanly divide by the split factor. However, the logic does not simplify so any nontrivial loop extents will always cause a mask to be added, e.g. if the loop had been previously split. Unlike splitWithTail, the masks added by splitWithMask are always overhead and we don't have the analysis to optimize them out if they are unnecessary, so it's good to avoid inserting them if we can.
The fix is just to simplify the loop extents before doing the extent calculation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45141
Reviewed By: ezyang
Differential Revision: D23869170
Pulled By: nickgg
fbshipit-source-id: 44686fd7b802965ca4f5097b0172a41cf837a1f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44856
Support following format of qconfig_dict
```python
qconfig_dict = {
# optional, global config
"": qconfig?,
# optional, used for module and function types
# could also be split into module_types and function_types if we prefer
"object_type": [
(nn.Conv2d, qconfig?),
(F.add, qconfig?),
...,
],
# optional, used for module names
"module_name": [
("foo.bar", qconfig?)
...,
],
# optional, matched in order, first match takes precedence
"module_name_regex": [
("foo.*bar.*conv[0-9]+", qconfig?)
...,
]
# priority (in increasing order): global, object_type, module_name_regex, module_name
# qconfig == None means fusion and quantization should be skipped for anything
# matching the rule
}
```
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23751304
fbshipit-source-id: 5b98f4f823502b12ae2150c93019c7b229c49c50
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44684
The ad-hoc quantization benchmarking script in D23689062 recently highlighted that quantized ops were surprisingly slow after the introduction of support for custom ops in torch.fx in D23203204 (f15e27265f).
Using strobelight, it's immediately clear that up to 66% of samples were seen in `c10::get_backtrace`, which is descends from `torch::is_tensor_and_apppend_overloaded -> torch::check_has_torch_function -> torch::PyTorch_LookupSpecial -> PyObject_HasAttrString -> PyObject_GetAttrString`.
I'm no expert by any means so please correct any/all misinterpretation, but it appears that:
- `check_has_torch_function` only needs to return a bool
- `PyTorch_LookupSpecial` should return `NULL` if a matching method is not found on the object
- in the impl of `PyTorch_LookupSpecial` the return value from `PyObject_HasAttrString` only serves as a bool to return early, but ultimately ends up invoking `PyObject_GetAttrString`, which raises, spawning the generation of a backtrace
- `PyObject_FastGetAttrString` returns `NULL` (stolen ref to an empty py::object if the if/else if isn't hit) if the method is not found, anyway, so it could be used singularly instead of invoking both `GetAttrString` and `FastGetAttrString`
- D23203204 (f15e27265f) compounded (but maybe not directly caused) the problem by increasing the number of invocations
so, removing it in this diff and seeing how many things break :)
before:
strobelight: see internal section
output from D23689062 script:
```
$ ./buck-out/gen/scripts/v/test_pt_quant_perf.par
Sequential(
(0): Quantize(scale=tensor([0.0241]), zero_point=tensor([60]), dtype=torch.quint8)
(1): QuantizedLinear(in_features=4, out_features=4, scale=0.017489388585090637, zero_point=68, qscheme=torch.per_tensor_affine)
(2): DeQuantize()
)
fp 0.010896682739257812
q 0.11908197402954102
```
after:
strobelight: see internal section
output from D23689062 script:
```
$ ./buck-out/gen/scripts/v/test_pt_quant_perf.par
Sequential(
(0): Quantize(scale=tensor([0.0247]), zero_point=tensor([46]), dtype=torch.quint8)
(1): QuantizedLinear(in_features=4, out_features=4, scale=0.012683945707976818, zero_point=41, qscheme=torch.per_tensor_affine)
(2): DeQuantize()
)
fp 0.011141300201416016
q 0.022639036178588867
```
which roughly restores original performance seen in P142370729
UPDATE: 9/22 mode/opt benchmarks
```
buck run //scripts/x:test_pt_quant_perf mode/opt
Sequential(
(0): Quantize(scale=tensor([0.0263]), zero_point=tensor([82]), dtype=torch.quint8)
(1): QuantizedLinear(in_features=4, out_features=4, scale=0.021224206313490868, zero_point=50, qscheme=torch.per_tensor_affine)
(2): DeQuantize()
)
fp 0.002968311309814453
q 0.5138928890228271
```
with patch:
```
buck run //scripts/x:test_pt_quant_perf mode/opt
Sequential(
(0): Quantize(scale=tensor([0.0323]), zero_point=tensor([70]), dtype=torch.quint8)
(1): QuantizedLinear(in_features=4, out_features=4, scale=0.017184294760227203, zero_point=61, qscheme=torch.per_tensor_affine)
(2): DeQuantize()
)
fp 0.0026655197143554688
q 0.0064449310302734375
```
Reviewed By: ezyang
Differential Revision: D23697334
fbshipit-source-id: f756d744688615e01c94bf5c48c425747458fb33
Summary:
Although PyTorch already supports CUDA 11, the Dockerfile still relies on CUDA 10. This pull request upgrades all the necessary versions such that recent NVIDIA GPUs like A100 can be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45071
Reviewed By: ezyang
Differential Revision: D23873224
Pulled By: seemethere
fbshipit-source-id: 822c25f183dcc3b4c5b780c00cd37744d34c6e00
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43790
Interface calls were not handled properly when they are used in fork
subgraph. This PR fixes this issue.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23402039
Pulled By: bzinodev
fbshipit-source-id: 41adc5ee7d942250e732e243ab30e356d78d9bf7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45159
By default, pybind11 binds void* to be capsules. After a lot of
Googling, I have concluded that this is not actually useful:
you can't actually create a capsule from Python land, and our
data_ptr() function returns an int, which means that the
function is effectively unusable. It didn't help that we had no
tests exercising it.
I've replaced the void* with uintptr_t, so that we now accept int
(and you can pass data_ptr() in directly). I'm not sure if we
should make these functions accept ctypes types; unfortunately,
pybind11 doesn't seem to have any easy way to do this.
Fixes#43006
Also added cudaHostUnregister which was requested.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: lw
Differential Revision: D23849731
Pulled By: ezyang
fbshipit-source-id: 8a79986f3aa9546abbd2a6a5828329ae90fd298f
Summary:
This is a small developer quality of life improvement. I commonly try to run some snippet of python as I'm working on a PR and forget that I've cd-d into the local clone to run some git commands, resulting in annoying failures like:
`ImportError: cannot import name 'default_generator' from 'torch._C' (unknown location)`
This actually took a non-trivial amount of time to figure out the first time I hit it, and even now it's annoying because it happens just infrequently enough to not sit high in the mental cache.
This PR adds a check to `torch/__init__.py` and warns if `import torch` is likely resolving to the wrong thing:
```
WARNING:root:You appear to be importing PyTorch from a clone of the git repo:
/data/users/taylorrobie/repos/pytorch
This will prevent `import torch` from resolving to the PyTorch install
(instead it will try to load /data/users/taylorrobie/repos/pytorch/torch/__init__.py)
and will generally lead to other failures such as a failure to load C extensions.
```
so that the soon to follow internal import failure makes some sense. I elected to make this a warning rather than an exception because I'm not 100% sure that it's **always** wrong. (e.g. weird `PYTHONPATH` or `importlib` corner cases.)
EDIT: There are now separate cases for `cwd` vs. `PYTHONPATH`, and failure is an `ImportError`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39995
Reviewed By: malfet
Differential Revision: D23817209
Pulled By: robieta
fbshipit-source-id: d9ac567acb22d9c8c567a8565a7af65ac624dbf7
Summary:
This prevents DrCI from misidentifying test failures for the compilation failures, such as:
```
/var/lib/jenkins/workspace/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: use of undeclared identifier \'strtod_l\'
return ((int*)(&strtod_l))[argc];
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45183
Reviewed By: ezyang
Differential Revision: D23859267
Pulled By: malfet
fbshipit-source-id: 283d9bd2ab712f23239b72f3758d121e2d026fb0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44983
`_all_gather` was converted from `_wait_all_workers` and inherited its
5 seconds fixed timeout. As `_all_gather` meant to support a broader
set of use cases, the timeout configuration should be more flexible.
This PR makes `rpc._all_gather` use the global default RPC timeout.
Test Plan: Imported from OSS
Reviewed By: pritamdamania87
Differential Revision: D23794383
Pulled By: mrshenli
fbshipit-source-id: 382f52c375f0f25c032c5abfc910f72baf4c5ad9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44960
Since we have templated selective build, it should be safe to move the operators to prim so that they can be selectively built in mobile
Test Plan: CI
Reviewed By: linbinyu
Differential Revision: D23772025
fbshipit-source-id: 52cebae76e4df5a6b2b51f2cd82f06f75e2e45d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45065
To preserve backwards compatibility with applications that were passing in some ProcessGroupRpcBackendOptions but were not explicitly setting backend=BackendType.PROCESS_GROUP, we're here now inferring the backend type from the options if only the latter ones are passed. If neither are passed, we'll default to TensorPipe, as before this change.
ghstack-source-id: 112586258
Test Plan: Added new unit tests.
Reviewed By: pritamdamania87
Differential Revision: D23814289
fbshipit-source-id: f4be7919e0817a4f539a50ab12216dc3178cb752
Summary:
combineMultilane used the wrong order when ramp was on the left hand side,
which matters for subtract.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45157
Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyRampSubBroadcast
Reviewed By: ailzhang
Differential Revision: D23851751
Pulled By: asuhan
fbshipit-source-id: 864d1611e88769fb43327ef226bb3310017bf858
Summary:
Otherwise, invoking something like `python -c "import torch._C;print(torch._C.ListType(None))"` will result in SIGSEGV
Discovered while trying to create a torch script for function with the following type annotation `Tuple[int, Ellipsis] -> None`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44958
Reviewed By: suo
Differential Revision: D23799906
Pulled By: malfet
fbshipit-source-id: 916a243007d13ed3e7a5b282dd712da3d66e3bf7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45015
torch.package allows you to write packages of code, pickled python data, and
arbitrary binary and text resources into a self-contained package.
torch.package.PackageExporter writes the packages and
torch.package.PackageImporter reads them.
The importers can load this code in a hermetic way, such that code is loaded
from the package rather than the normal python import system. This allows
for the packaging of PyTorch model code and data so that it can be run
on a server or used in the future for transfer learning.
The code contained in packages is copied file-by-file from the original
source when it is created, and the file format is a specially organized
zip file. Future users of the package can unzip the package, and edit the code
in order to perform custom modifications to it.
The importer for packages ensures that code in the module can only be loaded from
within the package, except for modules explicitly listed as external using :method:`extern_module`.
The file `extern_modules` in the zip archive lists all the modules that a package externally depends on.
This prevents "implicit" dependencies where the package runs locally because it is importing
a locally-installed package, but then fails when the package is copied to another machine.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23824337
Pulled By: zdevito
fbshipit-source-id: 1247c34ba9b656f9db68a83e31f2a0fbe3bea6bd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44655
Since `toHere()` does not execute operations over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.
Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).
ghstack-source-id: 112605610
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D23641466
fbshipit-source-id: 109d9eb10bd7fe76122b2026aaf1c7893ad10588
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44653
This changes the profiler per a discussion with ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.
This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.
Added a test in `test_misc.cpp` to test this.
ghstack-source-id: 112605620
Reviewed By: mrshenli
Differential Revision: D23638499
fbshipit-source-id: f5bbb0d41ef883c5e5870bc27e086b8b8908f46b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44646
Per a discussion with ilia-cher, this is not needed anymore and
removing it would make some future changes to support async RPC profiling
easier. Tested by ensuring profiling tests in `test_autograd.py` still pass.
ghstack-source-id: 112605618
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D23683998
fbshipit-source-id: 4e49a439509884fe04d922553890ae353e3331ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45098
**Summary**
This commit adds support for default arguments in methods of class
types. Similar to how default arguments are supported for regular
script functions and methods on scripted modules, default values are
retrieved from the definition of a TorchScript class in Python as Python
objects, converted to IValues, and then attached to the schemas of
already compiled class methods.
**Test Plan**
This commit adds a set of new tests to TestClassType to test default
arguments.
**Fixes**
This commit fixes#42562.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23844769
Pulled By: SplitInfinity
fbshipit-source-id: ceedff7703bf9ede8bd07b3abcb44a0f654936bd
Summary:
This flag simply allows users to get fusion groups that will *eventually* have shapes (such that `getOperation` is a valid).
This is useful for doing early analysis and compiling just in time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44401
Reviewed By: ZolotukhinM
Differential Revision: D23656140
Pulled By: bwasti
fbshipit-source-id: 9a26c202752399d1932ad7d69f21c88081ffc1e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45110
A recent change in DSNN quantizes the ad embedding to 8 bits. Ad embeddings are part of the inputs to the DSNN merge net. To correctly pass shape hints of input tensors including quantized ad embeddings, we need to be able to annotate the data types in shape hints.
A bit on the corner cases, if type is omitted or not a valid type, e.g., white spaces, instead of throwing an exception, I decided to return the default type, float.
Test Plan:
```
buck test caffe2/caffe2/fb/opt:shape_info_utils_test
```
Reviewed By: yinghai
Differential Revision: D23834091
fbshipit-source-id: 5e072144a7a7ff4b5126b618062dfc4041851dd3
Summary:
The ATen/native/cuda headers were copied to torch/include, but then not included in the final package. Further, add ATen/native/hip headers to the installation, as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45097
Reviewed By: mruberry
Differential Revision: D23831006
Pulled By: malfet
fbshipit-source-id: ab527928185faaa912fd8cab208733a9b11a097b
Summary:
NVIDIA GPUs are binary compatible within major compute capability revision
This would prevent: "GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation." messages from appearing, since CUDA-11 do not support code generation for sm_85.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45130
Reviewed By: ngimel
Differential Revision: D23841556
Pulled By: malfet
fbshipit-source-id: bcfc9e8da63dfe62cdec06909b6c049aaed6a18a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44766
There might be modules that are not symbolically traceable, e.g. LSTM (since it has
input dependent control flows), to support quantization in these cases, user will provide
the corresponding observed and quantized version of the custom module, the observed
custom module with observers already inserted in the module and the quantized version will
have the corresponding ops quantized. And use
```
from torch.quantization import register_observed_custom_module_mapping
from torch.quantization import register_quantized_custom_module_mapping
register_observed_custom_module_mapping(CustomModule, ObservedCustomModule)
register_quantized_custom_module_mapping(CustomModule, QuantizedCustomModule)
```
to register the custom module mappings, we'll also need to define a custom delegate class
for symbolic trace in order to prevent the custom module from being traced:
```python
class CustomDelegate(DefaultDelegate):
def is_leaf_module(self, m):
return (m.__module__.startswith('torch.nn') and
not isinstance(m, torch.nn.Sequential)) or \
isinstance(m, CustomModule)
m = symbolic_trace(original_m, delegate_class=CustomDelegate)
```
Test Plan: Imported from OSS
Reviewed By: z-a-f
Differential Revision: D23723455
fbshipit-source-id: 50d666e29b94cbcbea5fb6bcc73b00cff87eb77a
Summary:
This is a sub-task for addressing: https://github.com/pytorch/pytorch/issues/42969. We re-enable type check for `autocast_test_lists `.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45107
Test Plan:
`python test/test_type_hints.py` passed:
```
(pytorch) bash-5.0$ with-proxy python test/test_type_hints.py
....
----------------------------------------------------------------------
Ran 4 tests in 103.871s
OK
```
Reviewed By: walterddr
Differential Revision: D23842884
Pulled By: Hangjun
fbshipit-source-id: a39f3810e3abebc6b4c1cb996b06312f6d42ffd6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45106
**Summary**
This commit fixes `WithTest.test_with_exceptions`. It's been running
in regular Python this whole time; none of the functions created and
invoked for the test were scripted. Fortunately, the tests still pass
after being fixed.
**Test Plan**
Ran unit tests + continuous integration.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23848206
Pulled By: SplitInfinity
fbshipit-source-id: fd975ee34db9441ef4e4a4abf2fb21298166bbaa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45153
xcode 9 is being deprectated within circleci infra so we should get
everything else on a more recent version of xcode
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D23852774
Pulled By: seemethere
fbshipit-source-id: c02e162f1993d408de439fee21b340e9640e5a24
Summary:
Fixes a subtask of https://github.com/pytorch/pytorch/issues/42969
Tested the following and no warnings were seen.
python test/test_type_hints.py
....
----------------------------------------------------------------------
Ran 4 tests in 180.759s
OK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44971
Reviewed By: walterddr
Differential Revision: D23822274
Pulled By: visweshfb
fbshipit-source-id: e3485021e348ee0a8508a9d128f04bad721795ef
Summary:
Previously, `prim::EnumValue` is serialized to `ops.prim.EnumValue`, which doesn't have the right implementation to refine return type. This diff correctly serializes it to enum.value, thus fixing the issue.
Fixes https://github.com/pytorch/pytorch/issues/44892
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44891
Reviewed By: malfet
Differential Revision: D23818962
Pulled By: gmagogsfm
fbshipit-source-id: 6edfdf9c4b932176b08abc69284a916cab10081b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43680
As discussed [here](https://github.com/pytorch/pytorch/issues/43342),
adding in a Python-only implementation of the triplet-margin loss that takes a
custom distance function. Still discussing whether this is necessary to add to
PyTorch Core.
Test Plan:
python test/run_tests.py
Imported from OSS
Reviewed By: albanD
Differential Revision: D23363898
fbshipit-source-id: 1cafc05abecdbe7812b41deaa1e50ea11239d0cb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39111
In our present alias analysis, we consider any Value that enter another container as entering the heap, and thus aliasing all other heap values of the same type. There are a number of advantages to this approach:
- it is not to hard to maintain the aliasDb implementation
- it is much easier from an op schema perspective - there are many composite list ops registered internally and externally that would be tricky to register and get right if we did something more complicated
- It limits the size of the AliasDb, because a container of size 10 only contains a single memory dag element instead of 10 elements.
The downside is that we have are unable to handle the simple and extremely common case of a list of tensors being used in an ATen op.
In an example like:
```
def foo(input):
x = torch.tensor([1, 2, 3, 4])
y = [x, x]
input.add_(1)
return torch.cat(y)
```
we will consider x to be written to. any write to any wildcard element (an element that enters a tuple, an element that is taken from a list) will mark x as written to. This can be limiting for our ability to create a functional subset and fuse graphs - as a result, 4 of TorchVision classification models could not be functionalized.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23828003
Pulled By: eellison
fbshipit-source-id: 9109fcb6f2ca20ca897cae71683530285da9d537
Summary:
Change from self to self._class_() in _DecoratorManager to ensure a new object is every time a function is called recursively
Fixes https://github.com/pytorch/pytorch/issues/44531
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44633
Reviewed By: agolynski
Differential Revision: D23783601
Pulled By: albanD
fbshipit-source-id: a818664dee7bdb061a40ede27ef99e9546fc80bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44639
As title; this will unblock migration of several modules that need learning rate functionality.
Test Plan:
```
buck test //dper3/dper3/modules/low_level_modules/tests:learning_rate_test
```
Reviewed By: yf225
Differential Revision: D23681733
fbshipit-source-id: 1d98cb35bf6a4ff0718c9cb6abf22401980b523c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955
resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`
This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23460526
Pulled By: anjali411
fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
Summary:
We currently are fetching an allreduced tensor from Python in C++ in, where we are storing the resulting tensor in a struct's parameter. This PR removes extra tensor paratemeter in the function parameter and fetch from a single place.
Fixes https://github.com/pytorch/pytorch/issues/43960
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44914
Reviewed By: rohan-varma
Differential Revision: D23798888
Pulled By: bugra
fbshipit-source-id: ad1b8c31c15e3758a57b17218bbb9dc1f61f1577
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45010
The motivation of this change is to differentiate "backend specific" ops
and "generic" ops.
"backend specific" ops are those invoking backend specific kernels thus
only able to run on certain backends, e.g.: CPU, CUDA.
"generic" ops are those not *directly* invoking backend specific kernels.
They are usually calling other "backend specific" ops to get things
done. Thus, they are also referred to as "composite" ops, or "math" ops
(because they are usually pure C++ code constructed from math formula).
The other way to see the difference is that: we have to implement new
kernels for the "backend specific" ops if we want to run these ops on a
new backend. In contrast, "generic"/"composite" ops can run on the new
backend if we've added support for all the "backend specific" ops to
which they delegate their work.
Historically we didn't make a deliberate effort to always populate
supported backends to the "dispatch" section for all the "backend specific"
ops in native_functions.yaml. So now there are many ops which don't have
"dispatch" section but are actually "backend specific" ops. Majority
of them are calling "DispatchStub" kernels, which usually only support
CPU/CUDA (via TensorIterator) or QuantizedCPU/CUDA.
The ultimate goal is to be able to differentiate these two types of ops
by looking at the "dispatch" section in native_functions.yaml.
This PR leveraged the analysis script on #44963 to populate missing
dispatch keys for a set of "backend specific" ops. As the initial step,
we only deal with the simplest case:
* These ops don't already have dispatch section in native_functions.yaml;
* These ops call one or more DispatchStub (thus "backend specific");
* These ops don't call any other aten ops - except for some common
ones almost every op calls via framework, e.g. calling aten::eq via
Dispatcher::checkSchemaCompatibility. Calling other nontrivial aten
ops is a sign of being "composite", so we don't want to deal with this
case now;
* These ops don't call Tensor::is_quantized() / Tensor::is_sparse() / etc.
Some ops call thse Tensor::is_XXX() methods to dispatch to quantized /
sparse kernels internally. We don't deal with this case now.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23803951
Pulled By: ljk53
fbshipit-source-id: aaced7c34427d1ede72380af4513508df366ea16
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44936
need to provide max sequence size and max element size instead of
total
added a check that onnxifi was succesful
Test Plan: sls tests
Reviewed By: yinghai
Differential Revision: D23779437
fbshipit-source-id: 5048d6536ca00f0a3b0b057c4e2cf6584b1329d6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45088Fixes#45082
Found a few problems while working on #44983
1. We deliberately swallow RPC timeouts during shutdown, as we haven't
found a good way to handle those. When we convert `_wait_all_workers`
into `_all_gather`, the same logic was inherited. However, as
`_all_gather` meant to be used in more general scenarios, we should
no longer keep silent about errors. This commit let the error throw
in `_all_gather` and also let `shutdown()` to catch them and log.
2. After fixing (1), I found that `UnpickledPythonCall` needs to
acquire GIL on destruction, and this can lead to deadlock when used
in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a
synchronization point which holds GIL. In `init_rpc`, followers
(`rank != 0`) can exit before the leader (`rank == 0`). If the two
happens together, we could get a) on a follower, it exits `init_rpc`
after running `_broadcast_to_followers` and before the reaching dtor
of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`,
which holds the GIL and wait for the leader to join. However, the
leader is waiting for the response from `_broadcast_to_followers`,
which is blocked by the dtor of `UnpickledPythonCall`. And hence
the deadlock. This commit drops the GIL in `ProcessGroup` ctor.
3. After fixing (2), I found that `TensorPipe` backend
nondeterministically fails with `test_local_shutdown`, due to a
similar reason as (2), but this time it is that `shutdown()` on a
follower runs before the leader finishes `init_rpc`. This commit
adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`.
The 3rd one should be able to solve the 2nd one as well. But since
I didn't see a reason to hold GIL during `ProcessGroup` ctor, I
made that change too.
Test Plan: Imported from OSS
Reviewed By: pritamdamania87
Differential Revision: D23825592
Pulled By: mrshenli
fbshipit-source-id: 94920f2ad357746a6b8e4ffaa380dd56a7310976
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44912
This is to add vec256 test into linux CI system.
The whole test will last 50 to 70 seconds.
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D23772923
Pulled By: glaringlee
fbshipit-source-id: ef929b53f3ea7894abcd9510a8e0389979cab4a2
Summary:
int8_t is not vectorized in vec256_int.h. This PR adds vectorization for
int8_t. As pointed out in https://github.com/pytorch/pytorch/issues/43033, this is an important type for vectorization because
a lot of images are loaded in this data type.
Related issue: https://github.com/pytorch/pytorch/issues/43033
Benchmark (Debian Buster, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Turbo off, Release build):
```python
import timeit
dtype = 'torch.int8'
for op in ('+', '-'):
for n, t in [(10_000, 200000),
(100_000, 20000)]:
print(f'a {op} b, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit(f'c = a {op} b', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))
```
Results:
Before:
```
a + b, numel() == 10000 for 200000 times, dtype=torch.int8
1.2223373489978258
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6108450189931318
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
1.256775538000511
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6101213909860235
```
After:
```
a + b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5713336059998255
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.39169703199877404
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5838428330025636
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.37486923701362684
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44759
Reviewed By: malfet
Differential Revision: D23786383
Pulled By: glaringlee
fbshipit-source-id: 67f5bcd344c0b5014bacbc876143231fca156713
Summary:
This would force jit.script to raise an error if someone tries to mutate tuple
```
Tuple[int, int] does not support subscripted assignment:
File "/home/nshulga/test/tupleassignment.py", line 9
torch.jit.script
def foo(x: Tuple[int, int]) -> int:
x[-1] = x[0] + 1
~~~~~ <--- HERE
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44929
Reviewed By: suo
Differential Revision: D23777668
Pulled By: malfet
fbshipit-source-id: 8efaa4167354ffb4930ccb3e702736a3209151b6
Summary:
This PR was originally authored by slayton58. I steal his implementation and added some tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44986
Reviewed By: mruberry
Differential Revision: D23806039
Pulled By: ngimel
fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43622
- Moves the model loading part of `torch.hub.load()` into a new `torch.hub.load_local()` function that takes in a path to a local directory that contains a `hubconf.py` instead of a repo name.
- Refactors `torch.hub.load()` so that it now calls `torch.hub.load_local()` after downloading and extracting the repo.
- Updates `torch.hub` docs to include the new function + minor fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44204
Reviewed By: malfet
Differential Revision: D23817429
Pulled By: ailzhang
fbshipit-source-id: 788fd83c87a94f487b558715b2809d346ead02b2
Summary:
This PR introduces a (Const)StridedRandomAccessor, a [random access iterator](https://en.cppreference.com/w/cpp/named_req/RandomAccessIterator) over a strided array, and a CompositeRandomAccessor, a random access iterator over two random access iterators.
The main motivation is to be able to use a handful of operations from STL and thrust in numerous dim-apply types of algorithms and eliminate unnecessary buffer allocations. Plus more advanced algorithms are going to be available with C++17.
Porting `sort` provides a hands-on example of how these iterators could be used.
Fixes [https://github.com/pytorch/pytorch/issues/24770](https://github.com/pytorch/pytorch/issues/24770).
Some benchmarks:
```python
from IPython import get_ipython
torch.manual_seed(13)
ipython = get_ipython()
sizes = [
[10000, 10000],
[1000, 1000, 100]
]
for size in sizes:
t = torch.randn(*size)
dims = len(size)
print(f"Tensor of size {size}")
for dim in range(dims):
print(f"sort for dim={dim}")
print("float:")
ipython.magic("timeit t.sort(dim)")
print()
```
#### Master
```
Tensor of size [10000, 10000]
sort for dim=0
float:
10.7 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.27 s ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tensor of size [1000, 1000, 100]
sort for dim=0
float:
7.21 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.1 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=2
float:
3.58 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
#### This PR
```
Tensor of size [10000, 10000]
sort for dim=0
float:
10.5 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.16 s ± 28.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tensor of size [1000, 1000, 100]
sort for dim=0
float:
5.94 s ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
5.1 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=2
float:
3.43 s ± 8.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
As you can see, the legacy sorting routine is actually quite efficient. The performance gain is likely due to the improved reduction with TensorIterator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39744
Reviewed By: malfet
Differential Revision: D23796486
Pulled By: glaringlee
fbshipit-source-id: 7bddad10dfbc0a0e5cad7ced155d6c7964e8702c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45017
this is the default indexing folder for clangd 11.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23817619
Pulled By: suo
fbshipit-source-id: 6a60136e591b2fec3d432ac5343cb76ac0934502
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45018
Now that https://github.com/pytorch/pytorch/pull/44795 has landed, we
can convert the bulk of our cpp tests to use gtest APIs. Eventually
we'll want to get rid of our weird harness for cpp tests entirely in
favor of using regular gtest everywhere. This PR demonstrates some of
the benefits of this approach:
1. You don't need to register your test twice (once to define it, once
in tests.h).
2. Consequently, it's easier to have many individual test cases.
Failures can be reported independently (rather than having huge
functions to test entire modules.
3. Some nicer testing APIs, notably test fixtures.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23802297
Pulled By: suo
fbshipit-source-id: 774255da7716294ac573747dcd5e106e5fe3ac8f
Summary:
Including commits to fix Windows CI failure of enable distributed training on Windows PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45025
Reviewed By: beauby
Differential Revision: D23807995
Pulled By: mrshenli
fbshipit-source-id: a2f4c1684927ca66d7d3e9920ecb588fb4386f7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45014
Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/219
Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/212
+ Introduce buffer.h defining the buffer struct(s). The `CpuBuffer`
struct is always defined, while the `CudaBuffer` struct is defined
only when `TENSORPIPE_SUPPORTS_CUDA` is true.
+ Update all channels to take a `CpuBuffer` or `CudaBuffer` for
`send`/`recv` rather than a raw pointer and a length.
+ Make the base `Channel`/`Context` classes templated on `TBuffer`,
effectively creating two channel hierarchies (one for CPU channels,
one for CUDA channels).
+ Update the Pipe and the generic channel tests to use the new API. So
far, generic channel tests are CPU only, and tests for the CUDA IPC
channel are (temporarily) disabled. A subsequent PR will take care of
refactoring tests so that generic tests work for CUDA channels. An
other PR will add support for CUDA tensors in the Pipe.
Differential Revision: D23598033
Test Plan: Imported from OSS
Reviewed By: lw
Pulled By: beauby
fbshipit-source-id: 1d6c3f91e288420858835cd5e7962e8da051b44b
Summary:
A previous fix for masking Cuda dimensions (https://github.com/pytorch/pytorch/issues/44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled.
The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads.
To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place.
Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44909
Reviewed By: agolynski
Differential Revision: D23800565
Pulled By: nickgg
fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44894
Looks like we added double backwards support but only turned on the ModuleTests.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23762544
Pulled By: gchanan
fbshipit-source-id: b5cef579608dd71f3de245c4ba92e49216ce8a5e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43208
This PR adds gradcheck for complex. The logic used for complex gradcheck is described in Section 3.5.3 here: https://arxiv.org/pdf/1701.00392.pdf
More concretely, this PR introduces the following changes:
1. Updates get_numerical_jacobian to take as input a scalar value for vector (v). Adds gradcheck logic for C -> C, C-> R, R -> C. For R -> C functions, only the real value of gradient is propagated.
2. Adds backward definition for `torch.complex` and also adds a test to verify the definition added.
3. Updates backward for `mul`, `sin`, `cos`, `sinh`, `cosh`.
4. Adds tests for all `torch.real`, `torch.imag`, `torch.view_as_real`, `torch.view_as_complex`, `torch.conj`.
Follow up tasks:
1. Add more thorough tests for R -> C cases. Specifically, add R->C test variants for functions. for e.g., `torch.mul(complex_tensor, real_tensor)`
2. Add back commented test in `common_methods_invocation.py`.
3. Add more special case checking for complex gradcheck to make debugging easier.
4. Update complex autograd note.
5. disable complex autograd for operators not tested for complex.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23655088
Pulled By: anjali411
fbshipit-source-id: caa75e09864b5f6ead0f988f6368dce64cf15deb
Summary:
These alias are consistent with NumPy. Note that C++'s naming would be different (std::multiplies and std::divides), and that PyTorch's existing names (mul and div) are consistent with Python's dunders.
This also improves the instructions for adding an alias to clarify that dispatch keys should be removed when copying native_function.yaml entries to create the alias entries.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44463
Reviewed By: ngimel
Differential Revision: D23670782
Pulled By: mruberry
fbshipit-source-id: 9f1bdf8ff447abc624ff9e9be7ac600f98340ac4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44956
Makes buffer shapes for HistogramObserver have the
same shapes in uninitialized versus initialized states.
This is useful because the detectron2 checkpointer assumes
that these states will stay the same, so it removes the
need for manual hacks around the shapes changing.
Test Plan:
```
python test/test_quantization.py TestObserver.test_histogram_observer_consistent_buffer_shape
```
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23785382
fbshipit-source-id: 1a83fd4f39b244b00747c368d5d305a07d877c92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44861
We were redefining things like ASSERT_EQ to take a _VA_ARGS_ parameter, so compiling these files with gtest (instead of pytorch's custom python-based cpp test infra) fails.
Test Plan: buck build //caffe2/test/cpp/tensorexpr
Reviewed By: asuhan
Differential Revision: D23711293
fbshipit-source-id: 8af14fa7c1f1e8169d14bb64515771f7bc3089e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44889
This HACK doesn't seem to be necessary any more - there is no 'real'
type in generated Declarations.yaml file.
Verified by comparing generated code before/after.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23761624
Pulled By: ljk53
fbshipit-source-id: de996f04d77eebea3fb9297dd90a8ebeb07647bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44934
Fix build errors when using clang to build cuda sources:
```
In file included from aten/src/ATen/native/cuda/DistributionBernoulli.cu:4:
In file included from aten/src/ATen/cuda/CUDAApplyUtils.cuh:5:
caffe2/aten/src/THC/THCAtomics.cuh:321:1: error: control reaches end of non-void function [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for sm_70.
In file included from aten/src/ATen/native/cuda/DistributionBernoulli.cu:4:
In file included from aten/src/ATen/cuda/CUDAApplyUtils.cuh:5:
caffe2/aten/src/THC/THCAtomics.cuh:321:1: error: control reaches end of non-void function [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for sm_60.
In file included from aten/src/ATen/native/cuda/DistributionBernoulli.cu:4:
In file included from aten/src/ATen/cuda/CUDAApplyUtils.cuh:5:
caffe2/aten/src/THC/THCAtomics.cuh:321:1: error: control reaches end of non-void function [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for sm_52.
```
Test Plan: CI
Reviewed By: ngimel
Differential Revision: D23775266
fbshipit-source-id: 141e6624e2da870a8c50ff9f71fcf0717222fb17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44795
Today, we build our cpp tests twice, once as a standalone gtest binary,
and once linked in `libtorch_python` so we can call them from
`test_jit.py`.
This is convenient (it means that `test_jit.py` is a single entry point
for all our tests), but has a few drawbacks:
1. We can't actually use the gtest APIs, since we don't link gtest into
`libtorch_python`. We're stuck with the subset that we want to write
polyfills for, and an awkward registration scheme where you have to
write a test then include it in `tests.h`).
2. More seriously, we register custom operators and classes in these
tests. In a world where we may be linking many `libtorch_python`s, this
has a tendency to cause errors with `libtorch`.
So now, only tests that explicitly require cooperation with Python are
built into `libtorch_python`. The rest are built into
`build/bin/test_jit`.
There are tests which require that we define custom classes and
operators. In these cases, I've built thm into separate `.so`s that we
call `torch.ops.load_library()` on.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity, ZolotukhinM
Differential Revision: D23735520
Pulled By: suo
fbshipit-source-id: d146bf4e7eb908afa6f96b394e4d395d63ad72ff
Summary:
Adds a pass to the IR Simplifier which fuses together the bodies of Cond statements which have identical conditions. e.g.
```
if (i < 10) {
do_thing_1;
} else {
do_thing_2;
}
if (i < 10) {
do_thing_3;
}
```
is transformed into:
```
if (i < 10) {
do_thing_1;
do_thing_3;
} else {
do_thing_2;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44886
Reviewed By: glaringlee
Differential Revision: D23768565
Pulled By: nickgg
fbshipit-source-id: 3fe40d91e82bdfff8dcb8c56a02a4fd579c070df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44699
Original commit changeset: 3b1ec928e3db
Previous revert (D23698861) was on the wrong diff stack. Backing out the revert.
Test Plan: Passed unit tests and previously landed.
Reviewed By: mruberry
Differential Revision: D23702258
fbshipit-source-id: 5c3e197bca412f454db5a7e86251ec85faf621c1
Summary:
Moved description of tool and changes in function name
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44124
Reviewed By: albanD
Differential Revision: D23674618
Pulled By: bzinodev
fbshipit-source-id: 5db0bb14fc106fc96358b1e0590f08e975388c6d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44254
Add a device parameter to RemoteModule, so it can be placed on any device
and not just CPU.
Original PR issue: RemoteModule enhancements #40550
Test Plan: buck test test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: pritamdamania87
Differential Revision: D23483803
fbshipit-source-id: 4918583c15c6a38a255ccbf12c9168660ab7f6db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44786
This predates gradcheck and gradcheck does the same and more.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23731902
Pulled By: gchanan
fbshipit-source-id: 425fd30e943194f63a663708bada8960265b8f05
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175, fixes https://github.com/pytorch/pytorch/issues/34797
This adds complex support to `torch.stft` and `torch.istft`. Note that there are really two issues with complex here: complex signals, and returning complex tensors.
## Complex signals and windows
`stft` currently assumes all signals are real and uses `rfft` with `onesided=True` by default. Similarly, `istft` always takes a complex fourier series and uses `irfft` to return real signals.
For `stft`, I now allow complex inputs and windows by calling the full `fft` if either are complex. If the user gives `onesided=True` and the signal is complex, then this doesn't work and raises an error instead. For `istft`, there's no way to automatically know what to do when `onesided=False` because that could either be a redundant representation of a real signal or a complex signal. So there, the user needs to pass the argument `return_complex=True` in order to use `ifft` and get a complex result back.
## stft returning complex tensors
The other issue is that `stft` returns a complex result, represented as a `(... X 2)` real tensor. I think ideally we want this to return proper complex tensors but to preserver BC I've had to add a `return_complex` argument to manage this transition. `return_complex` defaults to false for real inputs to preserve BC but defaults to True for complex inputs where there is no BC to consider.
In order to `return_complex` by default everywhere without a sudden BC-breaking change, a simple transition plan could be:
1. introduce `return_complex`, defaulted to false when BC is an issue but giving a warning. (this PR)
2. raise an error in cases where `return_complex` defaults to false, making it a required argument.
3. change `return_complex` default to true in all cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43886
Reviewed By: glaringlee
Differential Revision: D23760174
Pulled By: mruberry
fbshipit-source-id: 2fec4404f5d980ddd6bdd941a63852a555eb9147
Summary:
Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800
`dX` is a Tensor, comparing `dX` with `nullptr` was wrong.
cc BIT-silence who wrote the kernel.
The test couldn't pass with `rtol=0` and `x.requires_grad=True`, so I have to update that to `1e-5`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44863
Reviewed By: mruberry
Differential Revision: D23754101
Pulled By: BIT-silence
fbshipit-source-id: 2eb0134dd489480e5ae7113a7d7b84629104cd49
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44345
As part of enhancing profiler support for RPC, when executing TorchScript functions over RPC, we would like to be able to support user-defined profiling scopes created by `with record_function(...)`.
Since after https://github.com/pytorch/pytorch/pull/34705, we support `with` statements in TorchScript, this PR adds support for `with torch.autograd.profiler.record_function` to be used within TorchScript.
This can be accomplished via the following without this PR:
```
torch.opts.profiler._record_function_enter(...)
# Script code, such as forward pass
torch.opts.profiler._record_function_exit(....)
```
This is a bit hacky and it would be much cleaner to use the context manager now that we support `with` statements. Also, `_record_function_` type operators are internal operators that are subject to change, this change will help avoid BC issues in the future.
Tested with `python test/test_jit.py TestWith.test_with_record_function -v`
ghstack-source-id: 112320645
Test Plan:
Repro instructions:
1) Change `def script_add_ones_return_any(x) -> Any` to `def script_add_ones_return_any(x) -> Tensor` in `jit/rpc_test.py`
2) `buck test mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_record_function_on_caller_rpc_async --print-passing-details`
3) The function which ideally should accept `Future[Any]` is `def _call_end_callbacks_on_future` in `autograd/profiler.py`.
python test/test_jit.py TestWith.test_with_foo -v
Reviewed By: pritamdamania87
Differential Revision: D23332074
fbshipit-source-id: 61b0078578e8b23bfad5eeec3b0b146b6b35a870
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44798
[test all]
Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well.
Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112279261
ghstack-source-id: 112279261
Test Plan: unit tests
Reviewed By: rohan-varma
Differential Revision: D23735185
fbshipit-source-id: c26e0efeecb3511640120faa1122a2c856cd694e
Summary:
Fixes the `true_divide` symbolic to cast tensors correctly.
The logic depends on knowing input types at export time, which is a known gap for exporting scripted modules. On that end we are improving exporter by enabling ONNX shape inference https://github.com/pytorch/pytorch/issues/40628, and starting to increase coverage for scripting support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43991
Reviewed By: mruberry
Differential Revision: D23674614
Pulled By: bzinodev
fbshipit-source-id: 1b1b85340eef641f664a14c4888781389c886a8b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44840
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44762
Move CostInferenceForFCGradient to fc_inference.cc/h to be used in multiple .cc files.
Test Plan: CI
Reviewed By: qizzzh
Differential Revision: D23714877
fbshipit-source-id: d27f33e270a93b0e053f2af592dc4a24e35526cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44000
This wasn't documented, so add a doc saying all ranks are used when
ranks=None
ghstack-source-id: 111206308
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D23465034
fbshipit-source-id: 4c51f37ffcba3d58ffa5a0adcd5457e0c5676a5d
Summary:
* Implement tuple sort by traversing contained IValue types and generate a lambda function as comparator for sort.
* Tuple, class objects can now arbitrarily nest within each other and still be sortable
Fixes https://github.com/pytorch/pytorch/issues/43219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43448
Reviewed By: eellison
Differential Revision: D23352273
Pulled By: gmagogsfm
fbshipit-source-id: b6efa8d00e112178de8256da3deebdba7d06c0e1
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43699
- Changed the order of `TORCH_CHECK` and `if (options.layout() == kSparse && self.is_sparse())`
inside `empty_like` method.
- [x] Added tests
EDIT:
More details on that and why we can not take zeros_like approach.
Python code :
```python
res = torch.zeros_like(input_coalesced, memory_format=torch.preserve_format)
```
is routed to
```c++
// TensorFactories.cpp
Tensor zeros_like(
const Tensor& self,
const TensorOptions& options,
c10::optional<c10::MemoryFormat> optional_memory_format) {
if (options.layout() == kSparse && self.is_sparse()) {
auto res = at::empty({0}, options); // to be resized
res.sparse_resize_and_clear_(
self.sizes(), self.sparse_dim(), self.dense_dim());
return res;
}
auto result = at::empty_like(self, options, optional_memory_format);
return result.zero_();
}
```
and passed to `if (options.layout() == kSparse && self.is_sparse())`
When we call in Python
```python
res = torch.empty_like(input_coalesced, memory_format=torch.preserve_format)
```
it is routed to
```c++
Tensor empty_like(
const Tensor& self,
const TensorOptions& options_,
c10::optional<c10::MemoryFormat> optional_memory_format) {
TORCH_CHECK(
!(options_.has_memory_format() && optional_memory_format.has_value()),
"Cannot set memory_format both in TensorOptions and explicit argument; please delete "
"the redundant setter.");
TensorOptions options =
self.options()
.merge_in(options_)
.merge_in(TensorOptions().memory_format(optional_memory_format));
TORCH_CHECK(
!(options.layout() != kStrided &&
optional_memory_format.has_value()),
"memory format option is only supported by strided tensors");
if (options.layout() == kSparse && self.is_sparse()) {
auto result = at::empty({0}, options); // to be resized
result.sparse_resize_and_clear_(
self.sizes(), self.sparse_dim(), self.dense_dim());
return result;
}
```
cc pearu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44058
Reviewed By: albanD
Differential Revision: D23672494
Pulled By: mruberry
fbshipit-source-id: af232274dd2b516dd6e875fc986e3090fa285658
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44773
The model is created and prepared using fx APIs and then scripted for training.
In order to test QAT on scriptmodel we need to be able to disable/enable fake_quant
and observer modules on it.
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qat_and_script
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23741354
fbshipit-source-id: 3fee7aa9b049d9901313b977710f4dc1c4501532
Summary:
[test all]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44330
Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well
ghstack-source-id: 112257271
Test Plan: unit tests
Reviewed By: mrshenli
Differential Revision: D23583347
fbshipit-source-id: a5f2041b2c4f2c2b5faba1af834c7143eaade938
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44729
Switches our MAX_JOBS from a hardcoded value to a more dynamic value so
that we can always utilize all of the core that are available to us
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: walterddr
Differential Revision: D23759643
Pulled By: seemethere
fbshipit-source-id: ad26480cb0359c988ae6f994e26a09f601b728e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44745
Much like CriterionTest, NewCriterionTest these are outdated formulations and we should just use the new one.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23717808
Pulled By: gchanan
fbshipit-source-id: eb91982eef23452456044381334bfc9a5bbd837e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44393
torch.quantile now correctly propagates nan and implemented torch.nanquantile similar to numpy.nanquantile.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23649613
Pulled By: heitorschueroff
fbshipit-source-id: 5201d076745ae1237cedc7631c28cf446be99936
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33394 .
This PR does two things:
1. Implement CUDA scatter reductions with revamped GPU atomic operations.
2. Remove support for divide and subtract for CPU reduction as was discussed with ngimel .
I've also updated the docs to reflect the existence of only multiply and add.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41977
Reviewed By: mruberry
Differential Revision: D23748888
Pulled By: ngimel
fbshipit-source-id: ea643c0da03c9058e433de96db02b503514c4e9c
Summary:
buck build has -Wall for downcasts - need to add safe_downcast<int32_t> everywhere
BUCK build changes for aten_vulkan to include vulkan_wrapper lib
Test Plan: The next diff with segmentation demo works fine
Reviewed By: dreiss
Differential Revision: D23739445
fbshipit-source-id: b22a30e1493c4174c35075a68586defb0fccd2af
Summary:
Enabled type checking in common_distributed by using tensors of ints
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44821
Test Plan: Run python test/test_type_hints.py, errors are no longer ingnored by mypy.ini
Reviewed By: walterddr
Differential Revision: D23747466
Pulled By: alanadakotashine
fbshipit-source-id: 820fd502d7ff715728470fbef0be90ae7f128dd6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44570
**Summary**
This commit improves subtype checking for futures so that
`Future[T]` is considered to be a subtype of `Future[U]` if `U` is a
subtype of `V`.
**Test Plan**
This commit adds a test case to `test_async.py` that tests this.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23660588
Pulled By: SplitInfinity
fbshipit-source-id: b606137c91379debab91b9f41057f7b1605757c5
Summary:
Adds a new optimization to the IRSimplifier which changes this pattern:
```
for ...
if ...
do thing;
```
into:
```
if ...
for ...
do thing;
```
Which should be almost strictly better.
There are many cases where this isn't safe to do, hence tests. Most obviously when the condition depends on something modified within the loop.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44764
Reviewed By: mruberry
Differential Revision: D23734463
Pulled By: nickgg
fbshipit-source-id: 51617e837de96b354fb702d0090ac65ddc523d36
Summary:
### Java, CPP
Introducing additional parameter `device` to LiteModuleLoader to specify device on which the `forward` will work.
On the java side this is enum that contains CPU and VULKAN, passing as jint to jni side and storing it as a member field on the same level as module.
On pytorch_jni_lite.cpp - for all input tensors converting them to vulkan.
On pytorch_jni_common.cpp (also goes to OSS) - if result Tensor is not cpu - call cpu. (Not Cpu at the moment is only Vulkan).
### BUCK
Introducing `pytorch_jni_lite_with_vulkan` target, that depends on `pytorch_jni_lite_with_vulkan` and adds `aten_vulkan`
In that case `pytorch_jni_lite_with_vulkan` can be used along with `pytorch_jni_lite_with_vulkan`.
Test Plan:
After the following diff with aidemo segmentation:
```
buck install -r aidemos-android
```
{F296224521}
Reviewed By: dreiss
Differential Revision: D23198335
fbshipit-source-id: 95328924e398901d76718c4d828f96e112dfa1b0
Summary:
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
* When an error occurs in a net or it got cancelled, running ops will have the
`Cancel` method called.
* This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
blocking ops to return.
* Adds unit test that verified the error propagation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44495
Test Plan:
## Unit Test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test
```
Reviewed By: dzhulgakov
Differential Revision: D23236088
Pulled By: dahsh
fbshipit-source-id: daa90d9ee32483fb51195e269a52cf5987bb0a5a
Summary:
PyObject_IsSubclass may set python live exception bit if given object is not a class. `IsNamedTuple` is currently using it incorrectly, which may trip all following python operations in debug-build python. Normal release-build python is not affected because `assert` is no-op in release-build.
Fixes https://github.com/pytorch/pytorch/issues/43577
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44769
Reviewed By: jamesr66a
Differential Revision: D23725584
Pulled By: gmagogsfm
fbshipit-source-id: 2dabd4f8667a045d5bf75813500876c6fd81542b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44586
**Summary**
This commit disallows plain `Optional` type annotations without
any contained types both in type comments and in-line as
Python3-style type annotations.
**Test Plan**
This commit adds a unit test for these two situations.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23721517
Pulled By: SplitInfinity
fbshipit-source-id: ead411e94aa0ccce227af74eb0341e2a5331370a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43796
This diff adds an option for the process group NCCL backend to pick high priority cuda streams.
Test Plan: waitforsandcastle
Reviewed By: jiayisuse
Differential Revision: D23404286
fbshipit-source-id: b79ae097b7cd945a26e8ba1dd13ad3147ac790eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44577
I would like to to move this to cmake so that I can depend on it
happening from other parts of the build.
This PR pulls out the logic for determining the version string and
writing the version file into its own module. `setup.py` still receives
the version string and uses it as before, but now the code for writing
out `torch/version.py` lives in a custom command in torch/CMakeLists.txt
I noticed a small inconsistency in how version info is populated.
`TORCH_BUILD_VERSION` is populated from `setup.py` at configuration
time, while `torch/version.py` is written at build time. So if, e.g. you
configured cmake on a certain git rev, then built it in on another, the
two versions would be inconsistent.
This does not appear to matter, so I opted to preserve the existing
behavior.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23734781
Pulled By: suo
fbshipit-source-id: 4002c9ec8058503dc0550f8eece2256bc98c03a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44585
**Summary**
This commit disallows plain `Tuple` type annotations without any
contained types both in type comments and in-line as Python3-style
type annotations.
**Test Plan**
This commit adds a unit test for these two situations.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23721515
Pulled By: SplitInfinity
fbshipit-source-id: e11c77a4fac0b81cd535c37a31b9f4129c276592
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44584
**Summary**
This commit extends the work done in #38130 and disallows plain
Python3-style `List` type annotations.
**Test Plan**
This commit extends `TestList.test_no_element_type_annotation` to the
Python3-style type annotation.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23721514
Pulled By: SplitInfinity
fbshipit-source-id: 48957868286f44ab6d5bf5e1bf97f0a4ebf955df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44334
**Summary**
This commit detects and prohibits the case in which `typing.Dict` is
used as an annotation without type arguments (i.e. `typing.Dict[K, V]`).
At present, `typing.Dict` is always assumed to have two arguments, and
when it is used without them, `typing.Dict.__args__` is nonempty and
contains some `typing.TypeVar` instances, which have no JIT type equivalent.
Consequently, trying to convert `typing.Dict` to a JIT type results in
a `c10::DictType` with `nullptr` for its key and value types, which can cause
a segmentation fault.
This is fixed by returning a `DictType` from
`jit.annotations.try_ann_to_type` only if the key and value types are converted
successfully to a JIT type and returning `None` otherwise.
**Test Plan**
This commit adds a unit test to `TestDict` that tests the plain `Dict`
annotations throw an error.
**Fixes**
This commit closes#43530.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23610766
Pulled By: SplitInfinity
fbshipit-source-id: 036b10eff6e3206e0da3131cfb4997d8189c4fec
Summary:
Unifies a number of partial solutions to the thread and block dimension extent masking, including the NoThreadIdxWriter and my last fix https://github.com/pytorch/pytorch/issues/44325. The NoThreadIdxWriter is gone in favour of tracking the current loop extents and masking any statements that have a lower rank than the launch parameters in any Block or Thread dimension, which handles both the "no" and "smaller" axis binding cases.
For example it will transform the following:
```
for i in 0..10 // blockIdx.x
for j in 0..10 // threadIdx.x
do thing(i, j);
for k in 0..5 // threadIdx.x
do other thing(i, k);
```
Into:
```
do thing(blockIdx.x, threadIdx.x);
if (threadIdx.x < 5) {
do other thing(blockIdx.x, threadIdx.x);
}
```
And handle the case where statements are not bound by any axis, eg.
```
do outer thing;
for i in 0..10 // blockIdx.x
for j in 0..10 // threadIdx.x
do thing(i, j);
do other thing(i);
```
will become:
```
if (blockIdx.x < 1) {
if (threadIdx.x < 1) {
do outer thing;
}
}
syncthreads();
do thing(blockIdx.x, threadIdx.x);
syncthreads();
if (threadIdx.x < 1) {
do other thing(blockIdx.x);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44733
Reviewed By: mruberry
Differential Revision: D23736878
Pulled By: nickgg
fbshipit-source-id: 52d08626ae8043d53eb937843466874d479a6768
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44819
[12:39 AM] Cherckez, Tal
please review the following patch.
should address these issues that our validation team found:
A) test_op_nnpi_fp16: hypothesis to trigger max_example*max_example.
B) batchnorm: batchNorm has derived from unit test which doesnt have setting required for hypothesis. hence default value as 100 getting set.
Test Plan:
buck test //caffe2/caffe2/contrib/fakelowp/test/...
https://our.intern.facebook.com/intern/testinfra/testrun/5910974543950859
Reviewed By: hyuen
Differential Revision: D23740970
fbshipit-source-id: 16fcc49f7bf84a5d7342786f671cd0b4e0fc87d3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44703
The description of this public function should be in the header file.
Also fix some typos.
Test Plan: N/A.
Reviewed By: pritamdamania87
Differential Revision: D23703661
fbshipit-source-id: 24ae63de9498e321b31dfb2efadb44183c6370df
Summary:
Make `gcs_cuda_only` and `gcs_gpu_only` return empty device lists if CUDA/GPU(CUDA or RocM) not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44578
Reviewed By: walterddr
Differential Revision: D23664227
Pulled By: malfet
fbshipit-source-id: 176b5d964c0b02b8379777cd9a38698c11818690
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44649
To unblock #43208, which adds "is_complex" checks to backward formulas
that are being tested for batched gradient support with vmap.
Test Plan: - `pytest test/test_vmap.py -v`
Reviewed By: anjali411
Differential Revision: D23685356
Pulled By: zou3519
fbshipit-source-id: 29e41a9296336f6d1008e3040cade4c643bf5ebf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44663
The new API returns the type of the data object referenced by this
`RRef`. On the owner, this is same as `type(rref.local_value())`.
On a user, this will trigger an RPC to fetch the `type` object from
the owner. After this function is run once, the `type` object is
cached by the `RRef`, and subsequent invocations no longer trigger
RPC.
closes#33210
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D23691990
Pulled By: mrshenli
fbshipit-source-id: a2d87cd601a691dd75164b6bcd7315245e9cf6bd
Summary:
[Tests for Vec256 classes https://github.com/pytorch/pytorch/issues/15676](https://github.com/pytorch/pytorch/issues/15676)
Testing
Current list:
- [x] Blends
- [x] Memory: UnAlignedLoadStore
- [x] Arithmetics: Plus,Minu,Multiplication,Division
- [x] Bitwise: BitAnd, BitOr, BitXor
- [x] Comparison: Equal, NotEqual, Greater, Less, GreaterEqual, LessEqual
- [x] MinMax: Minimum, Maximum, ClampMin, ClampMax, Clamp
- [x] SignManipulation: Absolute, Negate
- [x] Interleave: Interleave, DeInterleave
- [x] Rounding: Round, Ceil, Floor, Trunc
- [x] Mask: ZeroMask
- [x] SqrtAndReciprocal: Sqrt, RSqrt, Reciprocal
- [x] Trigonometric: Sin, Cos, Tan
- [x] Hyperbolic: Tanh, Sinh, Cosh
- [x] InverseTrigonometric: Asin, ACos, ATan, ATan2
- [x] Logarithm: Log, Log2, Log10, Log1p
- [x] Exponents: Exp, Expm1
- [x] ErrorFunctions: Erf, Erfc, Erfinv
- [x] Pow: Pow
- [x] LGamma: LGamma
- [x] Quantization: quantize, dequantize, requantize_from_int
- [x] Quantization: widening_subtract, relu, relu6
Missing:
- [ ] Constructors, initializations
- [ ] Conversion , Cast
- [ ] Additional: imag, conj, angle (note: imag and conj only checked for float complex)
#### Notes on tests and testing framework
- some math functions are tested within domain range
- mostly testing framework randomly tests against std implementation within the domain or within the implementation domain for some math functions.
- some functions are tested against the local version. ~~For example, std::round and vector version of round differs. so it was tested against the local version~~
- round was tested against pytorch at::native::round_impl. ~~for double type on **Vsx vec_round failed for (even)+0 .5 values**~~ . it was solved by using vec_rint
- ~~**complex types are not tested**~~ **After enabling complex testing due to precision and domain some of the complex functions failed for vsx and x86 avx as well. I will either test it against local implementation or check within the accepted domain**
- ~~quantizations are not tested~~ Added tests for quantizing, dequantize, requantize_from_int, relu, relu6, widening_subtract functions
- the testing framework should be improved further
- ~~For now `-DBUILD_MOBILE_TEST=ON `will be used for Vec256Test too~~
Vec256 Test cases will be built for each CPU_CAPABILITY
Fixes: https://github.com/pytorch/pytorch/issues/15676
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42685
Reviewed By: malfet
Differential Revision: D23034406
Pulled By: glaringlee
fbshipit-source-id: d1bf03acdfa271c88744c5d0235eeb8b77288ef8
Summary:
per title. If `beta=0` and slow path was taken, `nan` and `inf` in the result were not masked as is the case with other linear algebra functions. Similarly, since `mv` is implemented as `addmv` with `beta=0`, wrong results were sometimes produced for `mv` slow path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44681
Reviewed By: mruberry
Differential Revision: D23708653
Pulled By: ngimel
fbshipit-source-id: e2d5d3e6f69b194eb29b327e1c6f70035f3b231c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44702
Original commit changeset: c6bd6d277aca
This diff caused windows build to fail due to a compiler bug in VS2019 (lambda capture constant int value). This back out works around the issue with explicit capture of const int value.
Test Plan: Tested and previously landed.
Reviewed By: mruberry
Differential Revision: D23703215
fbshipit-source-id: f9ef23be97540bc9cf78a855295fb8c69f360459
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44763
The file had separate rules for RPC and DDP/c10d, consolidated all of
it together and placed all the distributed rules together.
ghstack-source-id: 112140871
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D23721162
fbshipit-source-id: d41c757eb1615376d442bd6b2802909624bd1d3f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44439
Adds a test to ddp_under_dist_autograd_test to enusre that that uneven
inputs join() API works properly when DDP + RPC is combined. We test that when
running in outside DDP mode (DDP applied to whole hybrid module) we can
correctly process uneven inputs across different trainers.
ghstack-source-id: 112156980
Test Plan: CI
Reviewed By: albanD
Differential Revision: D23612409
fbshipit-source-id: f1e328c096822042daaba263aa8747a9c7e89de7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44749
Ensure fx module is scriptable after calling prepare_qat on it
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qat_and_script
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23718380
fbshipit-source-id: abf63ffb21e707f7def8f6c88246877f5aded58c
Summary:
The subclass sets "self.last_epoch" when this is set in the parent class's init function. Why would we need to set last_epoch twice? I think calling "super" resets last_epoch anyway, so I am not sure why we would want to include this in the subclass. Am I missing something?
For the record, I am just a Pytorch enthusiast. I hope my question isn't totally silly.
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44613
Reviewed By: albanD
Differential Revision: D23691770
Pulled By: mrshenli
fbshipit-source-id: 080d9acda86e1a2bfaafe2c6fcb8fc1544f8cf8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44566
The Delegate objects were confusing. They were suppose to be a way to
configure how tracing works, but in some cases they appeared necessary
for consturcting graphs, which was not true. This makes the organization
clearer by removing Delgate and moving its functionality into a Tracer class,
similar to how pickle has a Pickler class.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23683177
Pulled By: zdevito
fbshipit-source-id: 7605a34e65dfac9a487c0bada39a23ca1327ab00
Summary:
Modified files in `benchmarks/tensorexpr` to add support for NVIDIA's Fuser for the jit compiler.
This support has some modifications besides adding an option to support the NVIDIA fuser:
* Adds FP16 Datatype support
* Fixes SOL/Algo calculations to generally use the data type instead of being fixed to 4 bytes
* Adds IR printing and kernel printing knobs
* Adds a knob `input_iter` to create ranges of inputs currently only for reductions
* Adds further reduction support for Inner and Outer dimension reductions that are compatible with the `input_iter` knob.
* Added `simple_element`, `reduce2d_inner`, and `reduce2d_outer` to isolate performance on elementwise and reduction operations in the most minimal fashion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44101
Reviewed By: ngimel
Differential Revision: D23713658
Pulled By: bertmaher
fbshipit-source-id: d6b83cfab559aefe107c23b3c0f2df9923b3adc1
Summary:
There's an annoying O(N^2) in module export logic that makes saving some of the models (if they have many classes) take eternity.
I'm not super familiar with this code to properly untangle the deps and make it a pure hash lookup. So I just added a side lookup table for raw pointers. It's still quadratic, but it's O(num_classes^2) instead of O(num_classes * num_references) which already gives huge savings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44589
Test Plan:
Tested with one of the offending models - just loading a saving a Torchscript file:
```
Before:
load 1.9239683151245117
save 165.74712467193604
After:
load 1.9409027099609375
save 1.4711427688598633
```
Reviewed By: suo
Differential Revision: D23675278
Pulled By: dzhulgakov
fbshipit-source-id: 8f3fa7730941085ea20d9255b49a149ac1bf64fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44540
Support output type to be fp16 for UniformFill
Reviewed By: jianyuh
Differential Revision: D23558030
fbshipit-source-id: 53a5b2c92cfe78cd11f55e6ee498e1bd682fe4a1
Summary:
This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context.
The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it.
I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231
Reviewed By: albanD
Differential Revision: D23689688
Pulled By: nickgg
fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44654
Previously we weren't creating a fallback graph as intended in specialize autograd zero, so if a Tensor failed one of our undefinedness checks we would run the backward normally without reprofiling & optimizing.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23691764
Pulled By: eellison
fbshipit-source-id: 10c6fa79518c84a6f5ef2bfbd9ea10843af751eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44089
Add support of fp16 as input type in SparseLengthSum/Mean caffe2 operator
Reviewed By: xianjiec
Differential Revision: D23436877
fbshipit-source-id: 02fbef2fde17d4b0abea9ca5d17a36aa989f98a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44656
All this time, test_vmap wasn't running in the CI. Fortunately all the
tests pass locally for me. h/t to anjali411 for pointing this out.
Test Plan: - Wait for CI
Reviewed By: anjali411
Differential Revision: D23689355
Pulled By: zou3519
fbshipit-source-id: 543c3e6aed0af77bfd6ea7a7549337f8230e3d32
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44651
Adds pruning for our anaconda channels (pytorch-nightly, pytorch-test)
into our CI pipeline so that it gets run on a more consistent basis.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: walterddr
Differential Revision: D23692851
Pulled By: seemethere
fbshipit-source-id: fa69b506b73805bf2ffbde75d221aef1ee3f753e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44326
Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112011490
Test Plan: unit tests
Reviewed By: mrshenli
Differential Revision: D23583017
fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83
Summary:
- Bump oneDNN (mkl-dnn) to 1.6 for bug fixes
- Fixes https://github.com/pytorch/pytorch/issues/42446. RuntimeError: label is redefined for convolutions with large filter size on Intel AVX512
- Implemented workaround for internal compiler error when building oneDNN with Microsoft Visual Studio 2019 (https://github.com/pytorch/pytorch/pull/43169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44706
Reviewed By: ngimel
Differential Revision: D23705967
Pulled By: albanD
fbshipit-source-id: 65e8fecc52a76c9f3324403a8b60ffa8a8948bc6
Summary:
For https://github.com/pytorch/pytorch/issues/44206 and https://github.com/pytorch/pytorch/issues/42218, I'd like to update trilinear interpolate backward and grid_sample backward to use `fastAtomicAdd`.
As a prelude, I spotted a UB risk in `fastAtomicAdd`. I think existing code incurs a misaligned `__half2` atomicAdd when `index` is odd and `tensor` is not 32-bit aligned (`index % 2 == 1` and `(reinterpret_cast<std::uintptr_t>(tensor) % sizeof(__half2) == 1`). In this case we think we're `!low_bit` and go down the `!low_bit` code path, but in fact we are `low_bit`. It appears the original [fastAtomicAdd PR](https://github.com/pytorch/pytorch/pull/21879#discussion_r295040377)'s discussion did not consider that case explicitly.
I wanted to push my tentative fix for discussion ASAP. jjsjann123 and mkolod as original authors of `fastAtomicAdd`. (I'm also curious why we need to `reinterpret_cast<std::uintptr_t>(tensor...` for the address modding, but that's minor.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44642
Reviewed By: mruberry
Differential Revision: D23699820
Pulled By: ngimel
fbshipit-source-id: 0db57150715ebb45e6a1fb36897e46f00d61defd
Summary:
This PR adds dilation to _ConvTransposeNd._output_padding method and tests using a bunch of different sized inputs.
Fixes https://github.com/pytorch/pytorch/issues/14272
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43793
Reviewed By: zou3519
Differential Revision: D23493313
Pulled By: ezyang
fbshipit-source-id: bca605c428cbf3a97d3d24316d8d7fde4bddb307
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42390
**Summary**
This commit extends support for properties to include
ScriptModules.
**Test Plan**
This commit adds a unit test that has a ScriptModule with
a user-defined property.
`python test/test_jit_py3.py TestScriptPy3.test_module_properties`
Test Plan: Imported from OSS
Reviewed By: eellison, mannatsingh
Differential Revision: D22880298
Pulled By: SplitInfinity
fbshipit-source-id: 74f6cb80f716084339e2151ca25092b6341a1560
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44645
Moved CallbackManager as a nested class of RecordFunction to allow private access to the call handles and context without exposing them publicly. It still hides the singleton instance of the CallbackManager inside record_function.cpp.
Test Plan: Unit tests.
Reviewed By: ilia-cher
Differential Revision: D23494065
fbshipit-source-id: 416d5bf6c9426e112877fbd233a6f4dff7bef455
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44252
Add tracing to DPP client. Because DPP requests are async, we need to be able to start a trace event in one thread and potentially end in a different thread. RecordFunction and LibgpumonObserver previously assume each trace event starts and finishes in the same thread. So they use a thread local context to track enter and exit call backs. Async events breaks this assumption. This change attaches the event context to the RecordFunction object so we do not need to use thread local context.
Test Plan:
Tested with dpp perf test and able to collect trace.
{F307824044}
Reviewed By: ilia-cher
Differential Revision: D23323486
fbshipit-source-id: 4b6ca6c0e32028fb38a476cd1f44c17a001fc03b
Summary:
We were hitting an assert error when you passed in an empty `List[List[int]]` - this fixes that error by not recursing into 0-element tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44652
Reviewed By: ZolotukhinM
Differential Revision: D23688247
Pulled By: eellison
fbshipit-source-id: d48ea24893044fae96bc39f76c0f1f9726eaf4c7
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 1d710393d5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44647
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D23684528
fbshipit-source-id: 316ff2e448707a6e5a83248c9b22e58118bc8741
Summary:
This PR:
- updates div to perform true division
- makes torch.true_divide an alias of torch.div
This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907
Reviewed By: ngimel
Differential Revision: D23622114
Pulled By: mruberry
fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927
Summary:
* Support sequence type (de)serialization, enables onnx shape inference on sequence nodes.
* Fix shape inference with block input/output: e.g. Loop and If nodes.
* Fix bugs in symbolic discovered by coverage of onnx shape inference.
* Improve debuggability: added more jit logs. For simplicity, the default log level, when jit log is enabled, will not dump ir graphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43929
Reviewed By: albanD
Differential Revision: D23674604
Pulled By: bzinodev
fbshipit-source-id: ab6aacb16d0e3b9a4708845bce27c6d65e567ba7
Summary:
When caller / callee pairs are inserted into the mapping, verify that
the arity of the buffer access is consistent with its declared rank.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44561
Test Plan: CI, test_tensorexpr --gtest_filter=TensorExprTest.DetectInlineRankMismatch
Reviewed By: albanD
Differential Revision: D23684342
Pulled By: asuhan
fbshipit-source-id: dd3a0cdd4c2492853fa68381468e0ec037136cab
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43389.
This PR replaces the old ELU formula from the docs that yields wrong results for negative alphas with the new one that fixes the issue and relies on the cases notation which makes the formula more straightforward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43764
Reviewed By: ailzhang
Differential Revision: D23425532
Pulled By: albanD
fbshipit-source-id: d0931996e5667897d926ba4fc7a8cc66e8a66837
Summary:
Improve simplification of nested Min and Max patterns.
Specifically, handles the following pattern simplications:
* `Max(A, Max(A, Const)) => Max(A, Const)`
* `Max(Min(A, B), Min(A, C)) => Min(A, Max(B, C))`
* `Max(Const, Max(A, OtherConst) => Max(A, Max(Const, OtherConst))`
- This case can have an arbitrarily long chain of Max ops. For example: `Max(5, Max(x, Max(y, Max(z, 8)))) => Max(Max(Max(x, 8), y), z)`
Similarly, for the case of Min as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44142
Reviewed By: albanD
Differential Revision: D23644486
Pulled By: navahgar
fbshipit-source-id: 42bd241e6c2af820566744c8494e5dee172107f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44562
Add a note that torch.median returns the smaller of the two middle elements for even-sized input and refer user to torch.quantile for the mean of the middle values.
fixes https://github.com/pytorch/pytorch/issues/39520
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23657208
Pulled By: heitorschueroff
fbshipit-source-id: 2747aa652d1e7f10229d9299b089295aeae092c2
Summary:
We run remove profile nodes and specialize types before batch_mm, so we cannot run peepholes on the type information of tensors since these properties have not been guarded to be guaranteed to be correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44565
Reviewed By: albanD
Differential Revision: D23661538
Pulled By: eellison
fbshipit-source-id: 0dd23a65714f047f49b4db4ec582b21870925fe1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44622
Remove an extra empty line in the warning comments.Remove an extra empty line.
Test Plan: N/A
Reviewed By: rohan-varma
Differential Revision: D23674070
fbshipit-source-id: 4ee570590c66a72fb808e9ee034fb773b833efcd
Summary:
This adds HIP version info to the `collect_env.py` output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44106
Reviewed By: VitalyFedyunin
Differential Revision: D23652341
Pulled By: zou3519
fbshipit-source-id: a1f5bce8da7ad27a1277a95885934293d0fd43c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44442
I noticed lock contention on startup as lookupByLiteral() was
calling registerPendingOperators() - some calls were holding the
lock for 10+ ms, as operators were being registered.
canonicalSchemaString() was using ostreamstring, which isn't typically
particularly fast (partly because of c++ spec locale requirements).
If we repalce with regular c++ string appends, it's somewhat faster
(which isn't hard when comparing with stringstream; albeit a bit
more codegen)
Over the first minute or so, this cuts out 1.4 seconds under the
OperatorRegistry lock (as part of registerPendingOperators) in the
first couple minutes of run time (mostly front-loaded) when running
sync sgd.
As an example, before:
registerPendingOperators 12688 usec for 2449 operators
After:
registerPendingOperators 6853 usec for 2449 operators
ghstack-source-id: 111862971
Test Plan: buck test mode/dev-nosan caffe2/test/cpp/...
Reviewed By: ailzhang
Differential Revision: D23614515
fbshipit-source-id: e712f9dac5bca0b1876e11fb8f0850402f03873a
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 0725301da5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44581
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia, VitalyFedyunin
Differential Revision: D23665173
fbshipit-source-id: 03cee22335eef0517e561827795bbe2036942ea0
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44219
Rebasing https://github.com/pytorch/pytorch/pull/44288 and fixing the git history.
This allows users to bencmark code without having to specify how long to run the benchmark. It runs the benchmark until the variance (IQR / Median) is low enough that we can be confident in the measurement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44607
Test Plan: There are unit tests, and we manually tested using Examples posted in git.
Reviewed By: robieta
Differential Revision: D23671208
Pulled By: bitfort
fbshipit-source-id: d63184290b88b26fb81c2452e1ae701c7d513d12
Summary:
This fixes a `katex` error I was getting trying to build the docs:
```
ParseError: KaTeX parse error: Undefined control sequence: \0 at position 55: …gin{cases}
```
This failure was introduced in https://github.com/pytorch/pytorch/issues/42523.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44481
Reviewed By: colesbury
Differential Revision: D23627700
Pulled By: mruberry
fbshipit-source-id: 9cc09c687a7d9349da79a0ac87d6c962c9cfbe2d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44337
Add a new run_method to mobile Module which is variadic (takes any number of arguments) to match full jit.
ghstack-source-id: 111909068
Test Plan: Added new unit test to test_jit test suite
Reviewed By: linbinyu, ann-ss
Differential Revision: D23585763
fbshipit-source-id: 007cf852290f03615b78c35aa6f7a21287ccff9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44588
1) SOURCE_DUMP crashes when invoked on a backward graph since
`prim::GradOf` nodes can't be printed as sources (they don't have
schema).
2) Dumping graph each time we execute an optimized plan produces lots of
output in tests where we run the graph multiple times (e.g.
benchmarks). Outputting that on the least level of verbosity seems
like an overkill.
3) Duplicated log statement is removed.
Differential Revision: D23666812
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: b9a30e34fd39c85f3e13c3f1e3594e157e1c130f
Summary:
**BC-breaking note**
This change is BC-breaking for C++ callers of linspace and logspace if they were providing a steps argument that could not be converted to an optional.
**PR note**
This PR deprecates calling linspace and logspace wihout setting steps explicitly by:
- updating the documentation to warn that not setting steps is deprecated
- warning (once) when linspace and logspace are called without steps being specified
A test for this behavior is added to test_tensor_creation_ops. The warning only appears once per process, however, so the test would pass even if no warning were thrown. Ideally there would be a mechanism to force all warnings, include those from TORCH_WARN_ONCE, to trigger.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43860
Reviewed By: izdeby
Differential Revision: D23498980
Pulled By: mruberry
fbshipit-source-id: c48d7a58896714d184cb6ff2a48e964243fafc90
Summary: As title; this will unblock migration of several modules that need learning rate functionality.
Test Plan:
```
buck test //dper3/dper3/modules/low_level_modules/tests:learning_rate_test
```
WIP: need to add more learning rate tests for the different policies
Reviewed By: yf225
Differential Revision: D23584071
fbshipit-source-id: f6656531b1caba38c3e3a7d6e16d9591563391e2
Summary:
Adding snpe dependencies to caffe2_benchmark so that this can benchmark SNPE models on portal devices.
Also need to change ndk_libcxx to gnustl till snpe is updated to work with ndk.
Test Plan: Tested on top of the stack.
Reviewed By: linbinyu
Differential Revision: D23569397
fbshipit-source-id: a6281832804ed4fbb5a8406f436caeae1ff4fd2b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44340
Changed the constructor of GradBucket to pass the input by const
reference and hence avoided unnecessary explicit move semantics. Since
previously the declaration and definition are separated, passing the input
tensor vector by value looks quite bizarre.
Test Plan: buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest
Reviewed By: pritamdamania87
Differential Revision: D23569939
fbshipit-source-id: db761d42e76bf938089a0b38e98e76a05bcf4162
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44339
Moved the inline implementations of GradBucket class to the header for
succinctness and readability. This coding style is also consistent with
reducer.h under the same directory.
Test Plan: buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest
Reviewed By: pritamdamania87
Differential Revision: D23569701
fbshipit-source-id: 237d9e2c5f63a6bcac829d0fcb4a5ba3bede75e5
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/36404
Adding prim::device and prim::dtype to list of skipped peepholes when we run inlining. In the long term another fix may not be to encode shape / dtype info on the traced graph, because it is not guaranteed to be correct. This is blocked by ONNX currently.
Partial fix for https://github.com/pytorch/pytorch/issues/43134
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43363
Reviewed By: glaringlee
Differential Revision: D23383987
Pulled By: eellison
fbshipit-source-id: 2e9c5160d39d690046bd9904be979d58af8d3a20
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44564
Before this change we sometimes inlined autodiff subgraph containing
fusion groups. This happened because we didn't look for 'unsupported'
nodes recursively (maybe we should), but fusion groups were inside
if-nodes.
The problem was detected by bertmaher in 'LearningToPaint' benchmark
investigation where this bug caused us to keep constantly hitting
fallback paths of the graph.
Test Plan: Imported from OSS
Reviewed By: bwasti
Differential Revision: D23657049
Pulled By: ZolotukhinM
fbshipit-source-id: 7c853424f6dce4b5c344d6cd9c467ee04a8f167e
Summary:
Fix an issue where loops of different sizes are bound to the same Cuda dimension / metavar.
Coming soon more info and tests...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44325
Reviewed By: colesbury
Differential Revision: D23628859
Pulled By: nickgg
fbshipit-source-id: 3621850a4cc38a790b62ad168d32e7a0e2462fad
Summary:
CentOS 8 on AArch64 has vld1_* intrinsics but lacks vst1q_f32_x2 one.
This patch checks for it and handle it separately to vld1_* ones.
Fixes https://github.com/pytorch/pytorch/issues/44198
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44199
Reviewed By: seemethere
Differential Revision: D23641273
Pulled By: malfet
fbshipit-source-id: c2053c8e0427705eaeeeb82ec030925bff22623a
Summary:
According to [documentation](https://github.com/pytorch/pytorch/blob/master/tools/setup_helpers/cmake.py#L265), only options starts with `BUILD_` / `USE_` / `CMAKE_` in `CMakeLists.txt` can be imported by environment variables.
---
This diff is originally intended to enable `c++` source coverage with `CircleCI` and `codecov.io`, but we will finish it in the future. You can find the related information in the diff history. Following is the originally procedur:
Based on [this pull request](1bda5e480c), life becomes much easier for this time.
1.in `build.sh`
- Enable coverage builld option for c++
- `apt-get install lcov`
2.in `test.sh`
- run `lcov`
3.in `pytorch-job-specs.yml`
- copy coverage.info to `test/` folder and upload it to codecov.io
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43999
Test Plan: Test on github
Reviewed By: malfet
Differential Revision: D23464656
Pulled By: scintiller
fbshipit-source-id: b2365691f04681d25ba5c00293fbcafe8e8e0745
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44568
By `lcov`, we can generate beautiful html. It's better than current file report and line report. Therefore in oss gcc, remove `export` code and `file/line level report` code, only use the html report.
But in clang, since such tool is not available, we will still use file-report and line-report generated by ourself.
Test Plan:
Test in docker ubuntu machine.
## Mesurement
1. After running `atest`, it takes about 15 mins to collect code coverage and genrate the report.
```
# gcc code coverage
python oss_coverage.py --run-only=atest
```
## Presentation
**The html result looks like:**
*Top Level:*
{F328330856}
*File Level:*
{F328336709}
Reviewed By: malfet
Differential Revision: D23550784
fbshipit-source-id: 1fff050e7f7d1cc8e86a6a200fd8db04b47f5f3e
Summary: Some tests like `test_dataloader.py` are not able to run under `clang` in oss, because it generates too large intermediate files (~40G) that can't be merged by `llvm`. Skip them when user doesn't specify the `--run-only` option
Test Plan: Test locally. But still, not recomend user to run `clang` coverage in default mode, because it takes too much space.
Reviewed By: malfet
Differential Revision: D23549829
fbshipit-source-id: 0737e6e9dcbe3f38de00580ee6007906e743e52f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44066
Add STL Input iterator to DispatchKeySet:
* Iterator is able to iterate from first not undefined DispatchKey
to NumDispatchKeys.
* Iterator is invalidated once underlying DispatchKeySet is invalidated
Note see http://www.cplusplus.com/reference/iterator/ for comparisons of
different iterators.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23611405
Pulled By: linux-jedi
fbshipit-source-id: 131b287d60226a1d67a6ee0f88571f8c4d29f9c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44494
These tests check (most) operations that are useful for bayesian logistic
regression (BLR) models. Said operators are basically those found in the
log_prob functions of Distributions objects. This PR is not a general,
structured solution for testing batched gradients (see "Alternative
solution" for that), but I wanted to test a small subset of operations
to confirm that the BLR use case works.
There will be follow-up PRs implementing support for some missing
operations for the BLR use case.
Alternative solution
=====================
Ideally, and in the future, I want to autogenerate tests from
common_method_invocations and delete all of the manual tests
introduced by this PR. However, if we were to do this now,
we would need to store the following additional metadata somewhere:
- operator name, supports_batched_grad, allow_vmap_fallback_usage
We could store that metadata as a separate table from
common_method_invocations, or add two columns to
common_method_invocations. Either way that seems like a lot of work and
the situation will get better once vmap supports batched gradients for
all operators (on the fallback path).
I am neutral between performing the alternative approach now v.s. just
manually writing out some tests for these operations, so I picked the
easier approach. Please let me know if you think it would be better to
pursue the alternative approach now.
Test Plan: - `pytest test/test_vmap.py -v -k "BatchedGrad"`
Reviewed By: anjali411
Differential Revision: D23650408
Pulled By: zou3519
fbshipit-source-id: 2f26c7ad4655318a020bdaab5c767cd3956ea5eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43043
This add the support for rpc_sync in TorchScript in a way similar to
rpc_async
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D23252039
Pulled By: wanchaol
fbshipit-source-id: 8a05329cb8a24079b2863178b73087d47273914c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44537
Originally, the `min_val`, `max_val`, `min_vals`, `max_vals`
attributes of observers were Tensors but not buffers. They had custom
state_dict save/load code to ensure their state was saved.
At some point, these attributes became buffers, and the custom
save/load code remained. This introduced a subtle bug:
* create model A, move it to a device (cpu/cuda) and save its state_dict
* create model B, load its state dict.
* `min_val|min_vals|max_val|max_vals` would always be loaded to model A's device, even if the rest of model B was on a different device
* the above is inconsistent with how save/load on different devices is expected to work (see https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-across-devices)
In practice, the case people would sometimes hit is:
* model A is on CPU, state dict is saved
* model B is created and moved to GPU, state_dict from model A is loaded
* assertions throw when operations are attempted across different devices
This PR fixes the behavior by removing the custom save/load where
possible and letting the default `nn.Module` save/load code handle
device assignment. We special case `PerChannelMinMaxObserver` and its
children to allow for loading buffers or different size, which is
normal.
There are some followups to also enable this for HistogramObserver
and FakeQuantize, which can be done in separate PRs due to higher
complexity.
Test Plan:
```
python test/test_quantization.py TestObserver.test_state_dict_respects_device_affinity
```
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23644493
fbshipit-source-id: 0dbb6aa309ad569a91a663b9ee7e44644080032e
Summary:
Solves `the '-j' option requires a positive integer argument` error on some systems when MAX_JOBS is not defined
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44557
Reviewed By: vkuzo
Differential Revision: D23653511
Pulled By: malfet
fbshipit-source-id: 7d86fb7fb6c946c34afdc81bf2c3168a74d00a1f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44145
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
## Summary
* Adds `NetBase::Cancel()` to NetBase which iterates over the entire list of
operators and call Cancel.
* Cancel on all ops was added to Net since there's nothing Asyc specific about it.
* `AsyncSchedulingNet` calls parent Cancel.
* To preserve backwards compatibility, `AsyncSchedulingNet`'s Cancel still calls
`CancelAndFinishAsyncTasks` .
* Adds `Cancel()` to `OperatorBase`.
Reviewed By: dzhulgakov
Differential Revision: D23279202
fbshipit-source-id: e1bb0ff04a4e1393f935dbcac7c78c0baf728550
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44486
SmoothL1Loss had a completely different (and incorrect, see #43228) path when target.requires_grad was True.
This PR does the following:
1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the SmoothL1Loss CriterionTests to verify that the target derivative is checked.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23630699
Pulled By: gchanan
fbshipit-source-id: 0f94d1a928002122d6b6875182867618e713a917
Summary:
Add new transforms `sliceHead` and `sliceTail` to `LoopNest`, for example:
Before transformation:
```
for x in 0..10:
A[x] = x*2
```
After `sliceHead(x, 4)`:
```
for x in 0..4:
A[x] = x*2
for x in 4..10:
A[x] = x*2
```
After `sliceTail(x, 1)`:
```
for x in 0..4:
A[x] = x*2
for x in 4..9:
A[x] = x*2
for x in 9..10:
A[x] = x*2
```
`sliceHead(x, 10)` and `sliceTail(x, 10)` is no-op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43854
Test Plan: Tests are added in `test_loopnest.cpp`, the tests cover the basic transformations, and also tests the combination with other transformations such as `splitWithTail`.
Reviewed By: nickgg
Differential Revision: D23417366
Pulled By: cheng-chang
fbshipit-source-id: 06c6348285f2bafb4be3286d1642bfbe1ea499bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44235
Removes nonvariadic run_method() from mobile Module entirely (to be later replaced by a variadic version). All use cases should have been migrated to use get_method() and Method::operator() in D23436351
ghstack-source-id: 111848220
Test Plan: CI
Reviewed By: iseeyuan
Differential Revision: D23484577
fbshipit-source-id: 602fcde61e13047a34915b509da048b9550103b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44202
In preparation for changing mobile run_method() to be variadic, this diff:
* Implements get_method() for mobile Module, which is similar to find_method but expects the method to exist.
* Replaces calls to the current nonvariadic implementation of run_method() by calling get_method() and then invoking the operator() overload on Method objects.
ghstack-source-id: 111848222
Test Plan: CI, and all the unit tests which currently contain run_method that are being changed.
Reviewed By: iseeyuan
Differential Revision: D23436351
fbshipit-source-id: 4655ed7182d8b6f111645d69798465879b67a577
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43025
- Use new overloads that better reflect the arguments to interpolate.
- More uniform interface for upsample ops allows simplifying the Python code.
- Also reorder overloads in native_functions.yaml to give them priority.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37177
ghstack-source-id: 106938111
Test Plan:
test_nn has pretty good coverage.
Relying on CI for ONNX, etc.
Didn't test FC because this change is *not* forward compatible.
To ensure backwards compatibility, I ran this code before this change
```python
def test_func(arg):
interp = torch.nn.functional.interpolate
with_size = interp(arg, size=(16,16))
with_scale = interp(arg, scale_factor=[2.1, 2.2], recompute_scale_factor=False)
with_compute = interp(arg, scale_factor=[2.1, 2.2])
return (with_size, with_scale, with_compute)
traced_func = torch.jit.trace(test_func, torch.randn(1,1,1,1))
sample = torch.randn(1, 3, 7, 7)
output = traced_func(sample)
assert not torch.allclose(output[1], output[2])
torch.jit.save(traced_func, "model.pt")
torch.save((sample, output), "data.pt")
```
then this code after this change
```python
model = torch.jit.load("model.pt")
sample, golden = torch.load("data.pt")
result = model(sample)
for r, g in zip(result, golden):
assert torch.allclose(r, g)
```
Reviewed By: AshkanAliabadi
Differential Revision: D21209991
fbshipit-source-id: 5b2ebb7c3ed76947361fe532d1dbdd6faa3544c8
Summary: I think these were missed due to a code landing race condition.
Test Plan: Fixes CUDA tests with PR 43025 applied.
Reviewed By: iseeyuan, AshkanAliabadi
Differential Revision: D23639566
fbshipit-source-id: 1322d7708e246b075a66588e7e54f4e12092477f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44471
L1Loss had a completely different (and incorrect, see #43228) path when target.requires_grad was True.
This PR does the following:
1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the L1Loss CriterionTests to verify that the target derivative is checked.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23626008
Pulled By: gchanan
fbshipit-source-id: 2828be16b56b8dabe114962223d71b0e9a85f0f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44500
Some user models are using those operators. Unblock them while keep the ops selective.
Test Plan: CI
Reviewed By: linbinyu
Differential Revision: D23634769
fbshipit-source-id: 55841d1b07136b6a27b6a39342f321638dc508cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44525
Since `TEST_SKIPS` is a global multiprocessing.manager, this was causing
issues when one test would fail and make the rest of the tests fail during
setup due to networking errors.
See the failed CI job: https://app.circleci.com/pipelines/github/pytorch/pytorch/212491/workflows/0450151d-ca09-4cf6-863d-272de6ed917f/jobs/7389065 for an example, where `test_ddp_backward` failed but then caused the rest of the tests to fail at the line `test_skips.update(TEST_SKIPS)`.
To fix this issue, at the end of every test we revert `TEST_SKIPS` back to a regular dict, and redo the conversion to a `mulitiprocessing.Manager` in the next test, which prevents these errors.
ghstack-source-id: 111844724
Test Plan: CI
Reviewed By: malfet
Differential Revision: D23641618
fbshipit-source-id: 27ce823968ece9804bb4dda898ffac43ef732b89
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44437
MSELoss had a completely different (and incorrect, see https://github.com/pytorch/pytorch/issues/43228) path when target.requires_grad was True.
This PR does the following:
1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the MSELoss CriterionTests to verify that the target derivative is checked.
TODO:
1) do we still need check_criterion_jacobian when we run grad/gradgrad checks?
2) ensure the Module tests check when target.requires_grad
3) do we actually test when reduction='none' and reduction='mean'?
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23612166
Pulled By: gchanan
fbshipit-source-id: 4f74d38d8a81063c74e002e07fbb7837b2172a10
Summary:
Fixes a bug in the NNC registerizer for Cuda where it would hoist reads out of a conditional context when trying to cache them. As a quick fix, prevent scalar replacement if a usage is within a condition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44223
Reviewed By: gchanan
Differential Revision: D23551247
Pulled By: nickgg
fbshipit-source-id: 17a7bf2be4c8c3dd8a9ab7997dce9aea200c3685
Summary:
Previously we were not removing profiling nodes in graphs that required grad and contained diff graphs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44420
Reviewed By: bertmaher
Differential Revision: D23607482
Pulled By: eellison
fbshipit-source-id: af095f3ed8bb3c5d09610f38cc7d1481cbbd2613
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44493
This function allows to execute a graph exactly as it is, without going
through a graph executor which would run passes on the graph before
interpreting it. I found this feature extremely helpful when I worked on
a stress-testing script to shake out bugs from the TE fuser: I needed to
execute a very specific set of passes on a graph and nothing else, and
then execute exactly it.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23632505
Pulled By: ZolotukhinM
fbshipit-source-id: ea81fc838933743e2057312d3156b77284d832ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44411
This basically aborts errored NCCL communicators if either blocking
wait or async error handling is enabled. Otherwise we may abort nccl
communicators where neither are enabled, and this may result in subsequent GPU
operations using corrupted data.
ghstack-source-id: 111839264
Test Plan: Succesful Flow run: f217591683
Reviewed By: jiayisuse
Differential Revision: D23605382
fbshipit-source-id: 6c16f9626362be3b0ce2feaf0979b2dff97ce61b
Summary:
`stdbuf` affects not only the process it launches, but all of its subprocessed, which have a very negative effects on the IPC communication between nvcc and c++ preprocessor, which results in 2x slowdown, for example:
```
$ time /usr/local/cuda/bin/nvcc /pytorch/aten/src/THC/generated/THCTensorMathPointwiseByte.cu -c ...
real 0m34.623s
user 0m31.736s
sys 0m2.825s
```
but
```
time stdbuf -i0 -o0 -e0 /usr/local/cuda/bin/nvcc /pytorch/aten/src/THC/generated/THCTensorMathPointwiseByte.cu -c ...
real 1m14.113s
user 0m37.989s
sys 0m36.104s
```
because OS spends lots of time transferring preprocessed source back to nvcc byte by byte, as requested via stdbuf call
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44532
Reviewed By: ngimel
Differential Revision: D23643411
Pulled By: malfet
fbshipit-source-id: 9fdaf8b8a49574e6b281f68a5dd9ba9d33464dff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44410
See #44052 for context. One of the cumprod_backward overloads was unused
so I just deleted it.
Test Plan: - `pytest test/test_autograd.py -v`
Reviewed By: mrshenli
Differential Revision: D23605503
Pulled By: zou3519
fbshipit-source-id: f9c5b595e62d2d6e71f26580ba96df15cc9de4f7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44427
Closes https://github.com/pytorch/pytorch/issues/44425
DDP join API currently does not work properly with `model.no_sync()`, see https://github.com/pytorch/pytorch/issues/44425 for details. This PR fixes the problem via the approach mentioned in the issue, namely scheduling an allreduce that tells joined ranks whether to sync in the backwards pass or not. Tests are added for skipping gradient synchronization for various `sync_interval`s.
ghstack-source-id: 111786479
Reviewed By: pritamdamania87
Differential Revision: D23609070
fbshipit-source-id: e8716b7881f8eee95e3e3499283e716bd3d7fe76
Summary:
Noticed this bug in `torch.movedim` (https://github.com/pytorch/pytorch/issues/41480). [`std::unique`](https://en.cppreference.com/w/cpp/algorithm/unique) only guarantees uniqueness for _sorted_ inputs. The current check lets through non-unique values when they aren't adjacent to each other in the list, e.g. `(0, 1, 0)` wouldn't raise an exception and instead the algorithm fails later with an internal assert.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44307
Reviewed By: mrshenli
Differential Revision: D23598311
Pulled By: zou3519
fbshipit-source-id: fd6cc43877c42bb243cfa85341c564b6c758a1bf
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:
- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases
The functions moved are:
- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2
In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277
Reviewed By: mrshenli, ngimel
Differential Revision: D23617361
Pulled By: mruberry
fbshipit-source-id: edb292947769967de9383f6a84eb327f027509e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44224
The purpose of this file is to help developers on PT distributed get
upto speed on the code structure and layout for PT Distributed.
ghstack-source-id: 111644842
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D23548377
fbshipit-source-id: 561d5b8e257642de172def8fdcc1311fae20690b
Summary:
To help with further typing, move dynamically added native contributions from `torch.autograd` to `torch._C._autograd`
Fix invalid error handling pattern in
89ac30afb8/torch/csrc/autograd/init.cpp (L13-L15)
`PyImport_ImportModule` already raises Python exception and nullptr should be returned to properly propagate the to Python runtime.
And all native methods/types in `torch/autograd/__init.py` after `torch._C._init_autograd()` has been called
Use f-strings instead of `.format` in test_type_hints.py
Fixes https://github.com/pytorch/pytorch/issues/44450
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44451
Reviewed By: ezyang
Differential Revision: D23618261
Pulled By: malfet
fbshipit-source-id: fa5f739d7cff8410641128b55b810318c5f636ae
Summary:
Previously the specialized types were copied over to the fallback function, although the tensors in the fallback type were not of that type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44434
Reviewed By: SplitInfinity
Differential Revision: D23611943
Pulled By: eellison
fbshipit-source-id: 2ea88a97529409f6c5c4c1f59a14b623524933de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43651
This is a forward compatibility follow-up to
https://github.com/pytorch/pytorch/pull/43086/. We switch the
conv serialization to output the v2 format instead of the v1 format.
The plan is to land this 1 - 2 weeks after the base PR.
Test Plan:
```
python test/test_quantization.py TestSerialization.test_conv2d_graph_v2
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph_v2
```
Imported from OSS
Reviewed By: z-a-f
Differential Revision: D23355480
fbshipit-source-id: 4cb04ed8b90a0e3e452297a411d641a15f6e625f
Summary:
cuda builds using clang error out when building caffe2 due to an incorrect std::move
This does not fix all known errors, but it's a step in the right direction.
Differential Revision: D23626667
fbshipit-source-id: 7d9df886129f671ec430a166dd22e4af470afe1e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44347
Cloned from Pull Request resolved: https://github.com/pytorch/pytorch/pull/44097, because the original author Sinan has completed the internship and now is unable to submit this diff.
As johnsonpaul mentioned in D23277575 (7d517cf96f). It looks like all processes were allocating memory on GPU-ID=0.
I was able to reproduce it by running `test_ddp_comm_hook_allreduce_with_then_hook_nccl` unit test of `test_c10d.py` and running `nvidia-smi` while test was running. The issue was reproduced as:
```
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3132563 C python 777MiB |
| 0 3132564 C python 775MiB |
| 4 3132564 C python 473MiB |
+-----------------------------------------------------------------------------+
```
I realized that as we initialize ProcessGroupNCCL both processes were initially allocating memory on GPU 0.
We later also realized that I forgot `isHighPriority` input of `getStreamFromPool` and `futureNCCLCallbackStreams_.push_back(std::make_shared<at::cuda::CUDAStream>(at::cuda::getStreamFromPool(device_index)));` was just creating a vector of GPU 0 streams. As i changed `at::cuda::getStreamFromPool(device_index)` to `at::cuda::getStreamFromPool(false, device_index)`. `nvidia-smi` looked like:
```
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 673925 C python 771MiB |
| 0 673926 C python 771MiB |
| 1 673925 C python 771MiB |
| 1 673926 C python 771MiB |
| 2 673925 C python 771MiB |
| 2 673926 C python 771MiB |
| 3 673925 C python 771MiB |
| 3 673926 C python 771MiB |
| 4 673925 C python 771MiB |
| 4 673926 C python 771MiB |
| 5 673925 C python 771MiB |
| 5 673926 C python 771MiB |
| 6 673925 C python 771MiB |
| 6 673926 C python 771MiB |
| 7 673925 C python 707MiB |
| 7 673926 C python 623MiB |
+-----------------------------------------------------------------------------+
```
This confirms that we were just getting GPU 0 streams for the callback. I think this does not explain the `fp16_compress` stability issue, because we were able to reproduce that even without any then callback and just calling copy from fp32 to fp16 before allreduce. However, this can explain other issues where `allreduce` was not on par with `no_hook`. I'll run some additional simulations with this diff.
I tried to to replace `getStreamFromPool` by `getDefaultCUDAStream(deviceIndex)` and it wasn't causing additional memory usage. In this diff, I temporarily solved the issue by just initializing null pointers for each device in the constructor and setting the callback stream for corresponding devices inside `ProcessGroupNCCL::getNCCLComm`. After the fix it looks like the memory issue was resolved:
```
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2513142 C python 745MiB |
| 4 2513144 C python 747MiB |
+-----------------------------------------------------------------------------+
```
I could use a dictionary instead of a vector for `futureNCCLCallbackStreams_`, but since number of devices is fixed, I think it isn't necessary. Please let me know what you think in the comments.
ghstack-source-id: 111485483
Test Plan:
`test_c10d.py` and some perf tests. Also check `nvidia-smi` while running tests to validate memory looks okay.
This diff also fixes the regression in HPC tests as we register a hook:
{F322730175}
See https://fb.quip.com/IGuaAbD8 (474fdd7e2d)bnvy for details.
Reviewed By: pritamdamania87
Differential Revision: D23495436
fbshipit-source-id: ad08e1d94343252224595d7c8a279fe75e244822
Summary:
This PR fixes unexpected `SystemError` when warnings are emitted and warning filters are set.
## Current behavior
```
$ python -Werror
>>> import torch
>>> torch.range(1, 3)
UserWarning: torch.range is deprecated in favor of torch.arange and will be removed in 0.5. Note that arange generates values in [start; end), not [start; end].
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
SystemError: <built-in method range of type object at 0x7f38c7703a60> returned a result with an error set
```
## Expected behavior
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UserWarning: torch.range is deprecated and will be removed in a future release because its behavior is inconsistent with Python's range builtin. Instead, use torch.arange, which produces values in [start, end).
```
## Note
Python exception must be raised if `PyErr_WarnEx` returns `-1` ([python docs](https://docs.python.org/3/c-api/exceptions.html#issuing-warnings)). This PR fixes warnings raised in the following code:
```py
import torch
torch.range(1, 3)
torch.autograd.Variable().volatile
torch.autograd.Variable().volatile = True
torch.tensor(torch.tensor([]))
torch.tensor([]).new_tensor(torch.tensor([]))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44371
Reviewed By: mrshenli
Differential Revision: D23598410
Pulled By: albanD
fbshipit-source-id: 2fbcb13fe4025dbebaf1fd837d4c8e0944e05010
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44398
These end up executing the same tests, so no reason to have them separate.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23600855
Pulled By: gchanan
fbshipit-source-id: 0952492771498bf813f1bf8e1d7c8dce574ec965
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43958
There is not any difference between these tests (I'm merging them), so let's merge them in the JIT as well.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23452337
Pulled By: gchanan
fbshipit-source-id: e6d13cdb164205eec3dbb7cdcd0052b02c961778
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44381
Perhaps this was necessary when the test was originally introduced, but it's difficult to figure out what is actually tested. And I don't think we actually use NotImplementedErorrs.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23598646
Pulled By: gchanan
fbshipit-source-id: aa18154bfc4969cca22323e61683a301198823be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44406
this fix makes fakelowp identical to hw
- mask out the floating point number with 0x7fff so we are always dealing
with positive numbers
- dsp implementation is correct, ice-ref suffers from this same problem
Test Plan: - tested with test_fusions.py, can't enable the test until the fix in ice-ref appears
Reviewed By: venkatacrc
Differential Revision: D23603878
fbshipit-source-id: a72d93a4bc811f98d1b5e82ddb204be028addfeb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44440
`aten-op.cc` takes a long time to compile due to the large generated constructor. For each case, the `std::function` constructor and the initialization functions are inlined, producing a huge amount of intermediate code that takes a long time to optimize, given that many compiler optimization passes are superlinear in the function size.
This diff moves each case to a separate function, so that each one is cheap to optimize, and the constructor is just a large jump table, which is easy to optimize.
Reviewed By: dzhulgakov
Differential Revision: D23593741
fbshipit-source-id: 1ce7a31cda10d9b0c9d799716ea312a291dc0d36
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44226
**Summary**
At present, the `share_types` argument to `create_script_module` is used
to decide whether to reuse a previously created type for a top-level
module that has not yet been compiled. However, that setting does not apply
to the compilation of submodules of the top-level module; types are
still reused if possible.
This commit modifies `create_script_module` so that the `share_types`
flag is honoured during submodule compilation as well.
**Test Plan**
This commit adds a unit test to `TestTypeSharing` that checks that
submodule types are not shared or reused when `share_types` is set to
`False`.
**Fixes**
This commit fixes#43605.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23602371
Pulled By: SplitInfinity
fbshipit-source-id: b909b8b6abbe3b4cb9be8319ac263ade90e83bd3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44352
**Summary**
This commit adds support for `del` with class instances. If a class
implements `__delitem__`, then `del class_instance[key]` is syntactic
sugar for `class_instance.__delitem__[key]`.
**Test Plan**
This commit adds a unit test to TestClassTypes to test this feature.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23603102
Pulled By: SplitInfinity
fbshipit-source-id: 28ad26ddc9a693a58a6c48a0e853a1c7cf5c9fd6
Summary:
Expose the interface of `nesterov` of SGD Optimizer from caffe2 to dper.
dper sgd optimizer (https://fburl.com/diffusion/chpobg0h) has referred to NAG sgdoptimizer in caffe2: https://fburl.com/diffusion/uat2lnan. So just need to add the parameter 'nesterov' in dper sgd optimizer.
Analysis of run resutls: N345540.
- train_ne increases as momentum (m) decreases.
- for m=0.95, 0.9: eval_ne is lower with NAG than production (no NAG, m = 0.95).
- for m=0.99: eval_ne with or without NAG is higher than production. It indicates larger variance in validation and overfit in training (lower train_ne).
Test Plan:
1. unit tests:
`buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_sgd_without_nesterov`
`buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_sgd_with_nesterov`
.
1. build dper front end package: `flow-cli canary ads.dper3.workflows.sparse_nn.train --mode opt --entitlement ads_global --run-as-secure-group team_ads_ml_ranking`. The build result (refreshed) is here https://www.internalfb.com/intern/buck/build/2a368b55-d94b-45c1-8617-2753fbce994b. Flow package version is ads_dper3.canary:856b545cc6b249c0bd328f845adeb0d2.
.
2. To build dper back end package: `flow-cli canary dper.workflows.dper3.train --mode opt --entitlement ads_global --run-as-secure-group team_ads_ml_ranking`. The build result (refreshed) is here: https://www.internalfb.com/intern/buck/build/70fa91cd-bf6e-4a08-8a4d-41e41a77fb52. Flow package version is aml.dper2.canary:84123a34be914dfe86b1ffd9925869de.
.
3. Compare prod with NAG-enabled runs:
a) refreshed prod run (m=0.95): f213877098
NAG enabled run (m=0.95): f213887113
.
b) prod run (m=0.9): f214065288
NAG enabled run (m=0.9): f214066319
.
c) prod run (m=0.99): f214065804
NAG enabled run (m=0.99): f214066725
.
d) change date type of nestrov to `bool` and launched a validation run
NAG enabled (m=0.95): f214500597
Reviewed By: ustctf
Differential Revision: D23152229
fbshipit-source-id: 61703ef6b4e72277f4c73171640fb8afc6d31f3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44043
To invoke `cancel` from the net instance in Python, we expose it through pybind state.
Reviewed By: dzhulgakov
Differential Revision: D23249660
fbshipit-source-id: 45a1e9062dca811746fcf2e5e42199da8f76bb54
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43384
Much like the FileStoreTests, the HashStoreTests were also run in a single blob and threw exceptions upon failure. This modularizes the test by separating each function into separate gtest test cases.
ghstack-source-id: 111690834
Test Plan: Confirmed that the tests pass on devvm.
Reviewed By: jiayisuse
Differential Revision: D23257579
fbshipit-source-id: 7e821f0e9ee74c8b815f06facddfdb7dc2724294
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43383
FileStore Test currently has a large blob of tests that throw
exceptions upon failure. This PR modularizes each test so they can run
independently, and migrates the framework to gtest.
ghstack-source-id: 111690831
Test Plan: Confirmed tests pass on devvm
Reviewed By: jiayisuse
Differential Revision: D22879473
fbshipit-source-id: 6fa5468e594a53c9a6b972757068dfc41645703e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43382
StoreTestCommon defines standard helper functions that are used by all of our Store tests. These helpers currently throw exceptions upon failure, this PR changes them to use gtest assertions instead.
ghstack-source-id: 111690833
Test Plan: Tested the 2 PR's above this on devvm
Reviewed By: jiayisuse
Differential Revision: D22828156
fbshipit-source-id: 9e116cf2904e05ac0342a441e483501e00aad3dd
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/41946/, to suggest enumerating a module as an alternative if a user tries indexing into a modulelist/sequential with a non-integer literal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43361
Reviewed By: mrshenli
Differential Revision: D23602388
Pulled By: eellison
fbshipit-source-id: 51fa28d5bc45720529b3d45e92d367ee6c9e3316
Summary: Exporting the Bucketize operator on CUDA. Also adding unit test.
Test Plan: buck test mode/dev-nosan caffe2/torch/fb/sparsenn:gpu_test -- test_bucketize
Differential Revision: D23581321
fbshipit-source-id: 7f21862984c04d840410b8718db93006f526938a
Summary:
Do not add gencode flags to NVCC_FLAGS twice: First time they are added in `cmake/public/cuda.cmake` no need to do it again in `cmake/Dependencies.cmake`
Copy `additional_unittest_args` before appending local options to it in `run_test()` method
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44414
Reviewed By: seemethere
Differential Revision: D23605733
Pulled By: malfet
fbshipit-source-id: 782a0da61650356a978a892fb03c66cb1a1ea26b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44382
This is to fix a typo that introduced in #44032.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D23601316
Pulled By: glaringlee
fbshipit-source-id: 17d6de5900443ea46c7a6ee9c7614fe6f2d92890
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44400
This diff does the identical thing as D23549149 (398409f072) does. A fix included for OSS CI: pytorch_windows_vs2019_py36_cuda10.1_test1
ghstack-source-id: 111679745
Test Plan:
- CI
- OSS CI
Reviewed By: xcheng16
Differential Revision: D23601050
fbshipit-source-id: 8ebdcd8fdc5865078889b54b0baeb397a90ddc40
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44163
In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.
ghstack-source-id: 111637788
Test Plan:
CI/Sandcastle. We will turn on this env var by default in
torchelastic and HPC trainer soon.
Reviewed By: jiayisuse
Differential Revision: D23517895
fbshipit-source-id: e7cd244b2ddf2dc0800ff7df33c73a6f00b63dcc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41054
**This Commit:**
ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector.
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
ghstack-source-id: 111614314
Test Plan:
1. **DDP Sanity Check**: First we have a sanity check based on the PyTorch DDP benchmark. This verifies that the baseline DDP training with NCCL for standard CU workloads works well (esp. with standard models like Resnet50 and BERT). Here is a sample Flow: f213293473
1. **HPC Performance Benchmarks**: This stack has undergone thorough testing and profiling on the Training Cluster with varying number of nodes. This introduces 1-1.5% QPS regression only (~200-400 QPS regression for 8-64 GPUs).
1. **HPC Accuracy Benchmarks**: We've confirmed NE parity with the existing NCCL/DDP stack without this change.
1. **Kernel-Specific Benchmarks**: We have profiled other approaches for this system (such as cudaStreamAddCallback) and performed microbenchmarks to confirm the current solution is optimal.
1. **Sandcastle/CI**: Apart from the recently fixed ProcessGroupNCCL tests, we will also introduce a new test for desynchronization scenarios.
Reviewed By: jiayisuse
Differential Revision: D22054298
fbshipit-source-id: 2b95a4430a4c9e9348611fd9cbcb476096183c06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41053
**This Commit:**
Some minor refactoring - added helper to check if `WorkNCCL` objects have timed out. Adding a new finish function to ProcessGroupNCCL::WorkNCCL that avoids notifying CV and uses `lock_guard`. Also renaming the timeoutCVMutex mutex to be more descriptive.
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
ghstack-source-id: 111614315
Test Plan: See D22054298 for verification of correctness and performance
Reviewed By: jiayisuse
Differential Revision: D21943520
fbshipit-source-id: b27ee329f0da6465857204ee9d87953ed6072cbb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41052
**This Commit:**
Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.)
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
ghstack-source-id: 111614313
Test Plan: See D22054298 for verification of correctness and performance
Reviewed By: jiayisuse
Differential Revision: D21943151
fbshipit-source-id: 337bfcb8af7542c451f1e4b3dcdfc5870bdec453
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41051
**This Commit:**
In the workCleanupThread, we process completion and exception handling for workNCCL objects corresponding to collective calls that have either completed GPU Execution, or have already thrown an exception. This way, we throw an exception from the workCleanupThread for failed GPU operations. This approach replaces the previous (and lower performance) approach of enqueuing a callback on the CUDA stream to process failures.
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
ghstack-source-id: 111614319
Test Plan: See D22054298 for verification of correctness and performance
Reviewed By: jiayisuse
Differential Revision: D21938498
fbshipit-source-id: df598365031ff210afba57e0c7be865e3323ca07
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41050
**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
Test Plan: See D22054298 for verification of correctness and performance
Reviewed By: jiayisuse
Differential Revision: D21916637
fbshipit-source-id: f8cadaab0071aaad1c4e31f9b089aa23cba0cfbe
Summary:
Helpful for later analysis on the build time trends
Also, same .whl files out of regular linux build job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44390
Reviewed By: walterddr
Differential Revision: D23602049
Pulled By: malfet
fbshipit-source-id: 4d55c9aa2d161a7998ad991a3da0436da83f70ad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44331
[10:22 AM] Cherckez, Tal
summary of issues(just to have a clear list):
* std::clamp forces the user to use c++17
* using setting without given fails the test
* avoid using max_examples for tests
(Note: this ignores all push blocking failures!)
Test Plan: https://www.internalfb.com/intern/testinfra/testconsole/testrun/6192449509073222/
Reviewed By: hyuen
Differential Revision: D23581440
fbshipit-source-id: fe9fbc341f8fca02352f531cc622fc1035d0300c
Summary:
This should prevent torch_python from linking the entire cudnn library statically just to query its version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44402
Reviewed By: seemethere
Differential Revision: D23602720
Pulled By: malfet
fbshipit-source-id: 185b15b789bd48b1df178120801d140ea54ba569
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42488
Currently, ProcessGroupGloo tests do not emit logs if the test was
skipped due CUDA not being available/not enough CUDA devices. This PR clarifies
the reason for skipping through these logs.
ghstack-source-id: 111638111
Test Plan: tested on devvm and devgpu
Reviewed By: jiayisuse
Differential Revision: D22879396
fbshipit-source-id: d483ca46b5e22ed986521262c11a1c6dbfbe7efd
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:
- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases
The functions moved are:
- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2
In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277
Reviewed By: ngimel
Differential Revision: D23568330
Pulled By: mruberry
fbshipit-source-id: 03e69fccdbfd560217c34ce4e9a5f20e10d05a5e
Summary:
This can be taken from the system in which case it is not used from the submodule. Hence the check here limits the usage unnecessarily
ccing malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44278
Reviewed By: malfet
Differential Revision: D23568552
Pulled By: ezyang
fbshipit-source-id: 7fd2613251567f649b12eca0b1fe7663db9cb58d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44315
I find it more intuitive to dump the optimized graph if we have one;
when I first saw the unoptimized graph being dumped I thought we had failed to
apply any optimizations.
Test Plan: Observe output by hand
Reviewed By: Lilyjjo
Differential Revision: D23578813
Pulled By: bertmaher
fbshipit-source-id: e2161189fb0e1cd53aae980a153aea610871662a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44162
This diff exports Node::isBefore/isAfter method to PythonAPI.
Test Plan: Tested locally. Please let me know if there is a set of unit tests to be passed.
Reviewed By: soumith
Differential Revision: D23514448
fbshipit-source-id: 7ef709b036370217ffebef52fd93fbd68c464e89
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42932
Follow up from https://github.com/pytorch/pytorch/pull/41769, rename `test_distributed` to `test_distributed_fork` to make it explicit that it forks.
New command to run test:
`python test/run_test.py -i distributed/test_distributed_fork -v`
ghstack-source-id: 111632568
Test Plan: `python test/run_test.py -i distributed/test_distributed_fork -v`
Reviewed By: izdeby
Differential Revision: D23072201
fbshipit-source-id: 48581688b6c5193a309e803c3de38e70be980872
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41769
Currently the tests in `test_distributed` only work with the `fork` mode multiprocessing, this PR introduces support for `spawn` mode multiprocessing as well (while keeping the `fork` mode intact).
Motivations for the change:
1) Spawn multiprocessing is the default on MacOS, so it better emulates how MacOS users would use distributed
2) With python 3.8+, spawn is the default on linux, so we should have test coverage for this
3) PT multiprocessing suggests using spawn/forkserver over fork, for sharing cuda tensors: https://pytorch.org/docs/stable/multiprocessing.html
4) Spawn is better supported with respect to certain sanitizers such as TSAN, so adding this sanitizer coverage may help us uncover issues.
How it is done:
1) Move `test_distributed` tests in `_DistTestBase` class to a shared file `distributed_test` (similar to how the RPC tests are structured)
2) For `Barrier`, refactor the setup of temp directories, as the current version did not work with spawn, each process would get a different randomly generated directory and thus would write to different barriers.
3) Add all the relevant builds to run internally and in OSS.
Running test_distributed with spawn mode in OSS can be done with:
`python test/run_test.py -i distributed/test_distributed_spawn -v`
Reviewed By: izdeby
Differential Revision: D22408023
fbshipit-source-id: e206be16961fd80438f995e221f18139d7e6d2a9
Summary:
1) Ports nonzero from THC to ATen
2) replaces most thrust uses with cub, to avoid synchronization and to improve performance. There is still one necessary synchronization point, communicating number of nonzero elements from GPU to CPU
3) slightly changes algorithm, now we first compute the number of nonzeros, and then allocate correct-sized output, instead of allocating full-sized output as was done before, to account for possibly all elements being non-zero
4) unfortunately, since the last transforms are still done with thrust, 2) is slightly beside the point, however it is a step towards a future without thrust
4) hard limits the number of elements in the input tensor to MAX_INT. Previous implementation allocated a Long tensor with the size ndim*nelements, so that would be at least 16 GB for a tensor with MAX_INT elements. It is reasonable to say that larger tensors could not be used anyway.
Benchmarking is done for tensors with approximately half non-zeros
<details><summary>Benchmarking script</summary>
<p>
```
import torch
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys
device = "cuda"
results = []
for numel in (1024 * 128,):#, 1024 * 1024, 1024 * 1024 * 128):
inp = torch.randint(2, (numel,), device="cuda", dtype=torch.float)
for ndim in range(2,3):#(1,4):
if ndim == 1:
shape = (numel,)
elif ndim == 2:
shape = (1024, numel // 1024)
else:
shape = (1024, 128, numel // 1024 // 128)
inp = inp.reshape(shape)
repeats = 3
timer = Timer(stmt="torch.nonzero(inp, as_tuple=False)", label="Nonzero", sub_label=f"number of elts {numel}",
description = f"ndim {ndim}", globals=globals())
for i in range(repeats):
results.append(timer.blocked_autorange())
print(f"\rnumel {numel} ndim {ndim}", end="")
sys.stdout.flush()
comparison = Compare(results)
comparison.print()
```
</p>
</details>
### Results
Before:
```
[--------------------------- Nonzero ---------------------------]
| ndim 1 | ndim 2 | ndim 3
1 threads: ------------------------------------------------------
number of elts 131072 | 55.2 | 71.7 | 90.5
number of elts 1048576 | 113.2 | 250.7 | 497.0
number of elts 134217728 | 8353.7 | 23809.2 | 54602.3
Times are in microseconds (us).
```
After:
```
[-------------------------- Nonzero --------------------------]
| ndim 1 | ndim 2 | ndim 3
1 threads: ----------------------------------------------------
number of elts 131072 | 48.6 | 79.1 | 90.2
number of elts 1048576 | 64.7 | 134.2 | 161.1
number of elts 134217728 | 3748.8 | 7881.3 | 9953.7
Times are in microseconds (us).
```
There's a real regression for smallish 2D tensor due to added work of computing number of nonzero elements, however, for other sizes there are significant gains, and there are drastically lower memory requirements. Perf gains would be even larger for tensors with fewer nonzeros.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44259
Reviewed By: izdeby
Differential Revision: D23581955
Pulled By: ngimel
fbshipit-source-id: 0b99a767fd60d674003d83f0848dc550d7a363dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44217
Move the tests to static ones as well
Test Plan:
python test/test_quantization.py TestStaticQuantizedModule.test_embedding_bag_api
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23547386
fbshipit-source-id: 41f81c31e1613098ecf6a7eff601c7dcd4b09c76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44208
Add quantized module in static quantization namespace. Embedding
quantization requires only weights to be quantized so it is static.
Internally it calls the embedding_bag_byte op with the offsets set corresponding to the
indices.
Future PR will move EmbeddingBag quantization from dynamic to static as well.
Test Plan:
python test/test_quantization.py test_embedding_api
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23547384
fbshipit-source-id: eddc6fb144b4a771060e7bab5853656ccb4443f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42735
We use reduced precision only for embedding table (not for momentum) in RowWiseSparseAdagrad
Test Plan: .
Reviewed By: jianyuh
Differential Revision: D23003939
fbshipit-source-id: 062290d94b160100bc4c2f48b797833819f8e88a
Summary:
Fixes a bug where FP16 values could be incorrectly cast to a half type that doesn't have a cast operator by inserting the cuda specific cast to float during handling of the Cast node, not as a wrapper around printing Loads and Stores. Two main changes: the HalfChecker now inserts the casts to float explicitly in the IR, and the PrioritizeLoad mutator now consumes both Loads and a Cast which immediately preceded a load.
Tested with test_jit_fuser_te.py and test_tensorexpr.py, plus C++ tests obv.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44209
Reviewed By: izdeby
Differential Revision: D23575577
Pulled By: nickgg
fbshipit-source-id: 808605aeb2af812758f96f9fdc11b07e08053b46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44042
Missed one case last time
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23479345
fbshipit-source-id: 30e6713120c494e9fab5584de4df9b25bec83d32
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43890
1. auto-detect `CXX` default compiler type in oss, and `clang` as default compiler type in fbcode (because auto-detecting will say `gcc` is the default compiler on devserver).
2. change `compiler type` from str `"CLANG" "GCC"` to enum type
3. rename function `get_cov_type` to `detect_compiler_type`
4. auto-set the default pytorch folder for users in oss
Test Plan:
on devserver:
```
buck run :coverage //caffe2/c10:
```
on oss:
```
python oss_coverage.py --run-only=atest
```
Reviewed By: malfet
Differential Revision: D23420034
fbshipit-source-id: c0ea88188578bb1343a286f2090eb8a74cdf3982
Summary:
When the backward ops execute via the autograd engine evaluate_function(), the fn.release_variables() is called to release the SavedVariables. For the eager mode ops, this releases the saved inputs that was required for backward grad function. However, with TorchScript, we get a DifferentableGraph and the DifferentiableGraphBackward() doesn't implement a release_variables(). This leads to the SavedVariables to be alive longer. Implement release_variables() for DifferentiableGraphBackward to release these SavedVariables early.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42994
Reviewed By: izdeby
Differential Revision: D23503172
Pulled By: albanD
fbshipit-source-id: d87127498cfa72883ae6bb31d0e6c7056c4c36d4
Summary:
This test is failing consistently on linux-bionic-rocm3.7-py3.6-test2. Relevant log snippet:
```
03:43:11 FAIL: test_addcmul_cuda_float16 (__main__.TestForeachCUDA)
03:43:11 ----------------------------------------------------------------------
03:43:11 Traceback (most recent call last):
03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 818, in wrapper
03:43:11 method(*args, **kwargs)
03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 258, in instantiated_test
03:43:11 result = test(self, *args)
03:43:11 File "test_foreach.py", line 83, in test_addcmul
03:43:11 self._test_pointwise_op(device, dtype, torch._foreach_addcmul, torch._foreach_addcmul_, torch.addcmul)
03:43:11 File "test_foreach.py", line 58, in _test_pointwise_op
03:43:11 self.assertEqual(tensors, expected)
03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1153, in assertEqual
03:43:11 exact_dtype=exact_dtype, exact_device=exact_device)
03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1127, in assertEqual
03:43:11 self.assertTrue(result, msg=msg)
03:43:11 AssertionError: False is not true : Tensors failed to compare as equal! With rtol=0.001 and atol=1e-05, found 10 element(s) (out of 400) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.00048828125 (-0.46484375 vs. -0.46533203125), which occurred at index (11, 18).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44304
Reviewed By: malfet, izdeby
Differential Revision: D23578316
Pulled By: mruberry
fbshipit-source-id: 558eecf42677383e7deaa4961e12ef990ffbe28c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44149
Thanks Christian Puhrsch for reporting.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: izdeby
Differential Revision: D23574739
Pulled By: ezyang
fbshipit-source-id: 8c9d0d78e6970139e0103cd1e0004b743e3c7f9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44233
**Summary**
By default, scripting tries to share concrete and JIT types across
compilations. However, this can lead to incorrect results if a module
extends `torch.jit.ScriptModule`, and injects instance variables into
methods defined using `define`.
This commit detects when this has happened and disables type sharing
for the compilation of the module that uses `define` in `__init__`.
**Test Plan**
This commit adds a test to TestTypeSharing that tests this scenario.
**Fixes**
This commit fixes#43580.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23553870
Pulled By: SplitInfinity
fbshipit-source-id: d756e87fcf239befa0012998ce29eeb25728d3e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44312
This test is failing now when running on card. Let's disable it while Intel is investigating the issue.
Test Plan: Sandcastle
Reviewed By: hyuen
Differential Revision: D23577475
fbshipit-source-id: 84f957c69ed75e0e0f563858b8b8ad7a2158da4e
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: d5ace7ca70
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44177
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D23533561
fbshipit-source-id: 9e580f8dbfb83e57bebc28f8e459caa0c5fc7317
Summary: Exports the operator to PyTorch, to be made into a low-level module.
Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_learning_rate
```
Reviewed By: yf225
Differential Revision: D23545582
fbshipit-source-id: 6b6d9aa6a47b2802ccef0f87c1263c6cc2d2fdf6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42537
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.
**Broadcasting**
At this point we don't support broadcasting.
**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.
----------------
**In this PR**
Adding APIs:
```
torch._foreach_exp(TensorList tl1)
torch._foreach_exp_(TensorList tl1)
torch._foreach_sqrt(TensorList tl1)
torch._foreach_sqrt_(TensorList tl1)
```
**Tests**
Tested via unit tests
**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors
**Plan for the next PRs**
1. APIs
- Pointwise Ops
2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.
Test Plan: Imported from OSS
Reviewed By: cpuhrsch
Differential Revision: D23331889
Pulled By: izdeby
fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376
Summary:
Check return code of `nvcc --version` and if it's not zero, print warning and mark CUDA as not found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44236
Test Plan: Run `CUDA_NVCC_EXECUTABLE=/foo/bar cmake ../`
Reviewed By: ezyang
Differential Revision: D23552336
Pulled By: malfet
fbshipit-source-id: cf9387140a8cdbc8dab12fcc4bfaf55ae8e6a502
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44270
The previous PR (#44212) was reverted since I didn't update the
`upload_scribe.py` script and it was looking for 'executor_and_fuser'
field in the json which now is replaced with two separate fields:
'executor' and 'fuser'.
Differential Revision: D23561500
Test Plan: Imported from OSS
Reviewed By: ngimel
Pulled By: ZolotukhinM
fbshipit-source-id: 7fe86d34afa488a0e43d5ea2aaa7bc382337f470
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42536
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.
**Broadcasting**
At this point we don't support broadcasting.
**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.
----------------
**In this PR**
Adding APIs:
```
torch._foreach_sub(TensorList tl1, TensorList tl2)
torch._foreach_sub_(TensorList self, TensorList tl2)
torch._foreach_mul(TensorList tl1, TensorList tl2)
torch._foreach_mul_(TensorList self, TensorList tl2)
torch._foreach_div(TensorList tl1, TensorList tl2)
torch._foreach_div_(TensorList self, TensorList tl2)
torch._foreach_sub(TensorList tl1, Scalar scalar)
torch._foreach_sub_(TensorList self, Scalar scalar)
torch._foreach_mul(TensorList tl1, Scalar scalar)
torch._foreach_mul_(TensorList self, Scalar scalar)
torch._foreach_div(TensorList tl1, Scalar scalar)
torch._foreach_div(TensorList self, Scalar scalar)
```
**Tests**
Tested via unit tests
**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors
**Plan for the next PRs**
1. APIs
- Unary Ops for list
- Pointwise Ops
2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.
Test Plan: Imported from OSS
Reviewed By: cpuhrsch
Differential Revision: D23331891
Pulled By: izdeby
fbshipit-source-id: 18c5937287e33e825b2e391e41864dd64e226f19
Summary:
Adding a calibration module called histogram binning:
Divide the prediction range (e.g., [0, 1]) into B bins. In each bin, use two parameters to store the number of positive examples and the number of examples that fall into this bucket. So we basically have a histogram for the model prediction.
As a result, for each bin, we have a statistical value for the real CTR (num_pos / num_example). We use this statistical value as the final calibrated prediction if the pre-cali prediction falls into the corresponding bin.
In this way, the predictions within each bin should be well-calibrated if we have sufficient examples. That is, we have a fine-grained calibrated model by this calibration module.
Theoretically, this calibration layer can fix any uncalibrated model or prediction if we have sufficient bins and examples. It provides the potential to use any kind of training weight allocation to our training data, without worrying about the calibration issue.
Test Plan:
buck test dper3/dper3/modules/calibration/tests:calibration_test -- test_histogram_binning_calibration
buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_histogram_binning_calibration
All tests passed.
Example workflows:
f215431958
{F326445092}
f215445048
{F326445223}
Reviewed By: chenshouyuan
Differential Revision: D23356450
fbshipit-source-id: c691b66c51ef33908c17575ce12e5bee5fb325ff
Summary:
When var and std are called without args (other than unbiased) they currently call into TH or THC. This PR:
- Removes the THC var_all and std_all functions and updates CUDA var and std to use the ATen reduction
- Fixes var's docs, which listed its arguments in the incorrect order
- Adds new tests comparing var and std with their NumPy counterparts
Performance appears to have improved as a result of this change. I ran experiments on 1D tensors, 1D tensors with every other element viewed ([::2]), 2D tensors and 2D transposed tensors. Some notable datapoints:
- torch.randn((8000, 8000))
- var measured 0.0022215843200683594s on CUDA before the change
- var measured 0.0020322799682617188s on CUDA after the change
- torch.randn((8000, 8000)).T
- var measured .015128850936889648 on CUDA before the change
- var measured 0.001912832260131836 on CUDA after the change
- torch.randn(8000 ** 2)
- std measured 0.11031460762023926 on CUDA before the change
- std measured 0.0017833709716796875 on CUDA after the change
Timings for var and std are, as expected, similar.
On the CPU, however, the performance change from making the analogous update was more complicated, and ngimel and I decided not to remove CPU var_all and std_all. ngimel wrote the following script that showcases how single-threaded CPU inference would suffer from this change:
```
import torch
import numpy as np
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys
base = 8
multiplier = 1
def stdfn(a):
meanv = a.mean()
ac = a-meanv
return torch.sqrt(((ac*ac).sum())/a.numel())
results = []
num_threads=1
for _ in range(7):
size = base*multiplier
input = torch.randn(size)
tasks = [("torch.var(input)", "torch_var"),
("torch.var(input, dim=0)", "torch_var0"),
("stdfn(input)", "stdfn"),
("torch.sum(input, dim=0)", "torch_sum0")
]
timers = [Timer(stmt=stmt, num_threads=num_threads, label="Index", sub_label=f"{size}",
description=label, globals=globals()) for stmt, label in tasks]
repeats = 3
for i, timer in enumerate(timers * repeats):
results.append(
timer.blocked_autorange()
)
print(f"\r{i + 1} / {len(timers) * repeats}", end="")
sys.stdout.flush()
multiplier *=10
print()
comparison = Compare(results)
comparison.print()
```
The TH timings using this script on my devfair are:
```
[------------------------------ Index ------------------------------]
| torch_var | torch_var0 | stdfn | torch_sum0
1 threads: ----------------------------------------------------------
8 | 16.0 | 5.6 | 40.9 | 5.0
80 | 15.9 | 6.1 | 41.6 | 4.9
800 | 16.7 | 12.0 | 42.3 | 5.0
8000 | 27.2 | 72.7 | 51.5 | 6.2
80000 | 129.0 | 715.0 | 133.0 | 18.0
800000 | 1099.8 | 6961.2 | 842.0 | 112.6
8000000 | 11879.8 | 68948.5 | 20138.4 | 1750.3
```
and the ATen timings are:
```
[------------------------------ Index ------------------------------]
| torch_var | torch_var0 | stdfn | torch_sum0
1 threads: ----------------------------------------------------------
8 | 4.3 | 5.4 | 41.4 | 5.4
80 | 4.9 | 5.7 | 42.6 | 5.4
800 | 10.7 | 11.7 | 43.3 | 5.5
8000 | 69.3 | 72.2 | 52.8 | 6.6
80000 | 679.1 | 676.3 | 129.5 | 18.1
800000 | 6770.8 | 6728.8 | 819.8 | 109.7
8000000 | 65928.2 | 65538.7 | 19408.7 | 1699.4
```
which demonstrates that performance is analogous to calling the existing var and std with `dim=0` on a 1D tensor. This would be a significant performance hit. Another simple script shows the performance is mixed when using multiple threads, too:
```
import torch
import time
# Benchmarking var and std, 1D with varying sizes
base = 8
multiplier = 1
op = torch.var
reps = 1000
for _ in range(7):
size = base * multiplier
t = torch.randn(size)
elapsed = 0
for _ in range(reps):
start = time.time()
op(t)
end = time.time()
elapsed += end - start
multiplier *= 10
print("Size: ", size)
print("Avg. elapsed time: ", elapsed / reps)
```
```
var cpu TH vs ATen timings
Size: 8
Avg. elapsed time: 1.7853736877441406e-05 vs 4.9788951873779295e-06 (ATen wins)
Size: 80
Avg. elapsed time: 1.7803430557250977e-05 vs 6.156444549560547e-06 (ATen wins)
Size: 800
Avg. elapsed time: 1.8569469451904296e-05 vs 1.2302875518798827e-05 (ATen wins)
Size: 8000
Avg. elapsed time: 2.8756141662597655e-05 vs. 6.97789192199707e-05 (TH wins)
Size: 80000
Avg. elapsed time: 0.00026622867584228516 vs. 0.0002447957992553711 (ATen wins)
Size: 800000
Avg. elapsed time: 0.0010556647777557374 vs 0.00030616092681884767 (ATen wins)
Size: 8000000
Avg. elapsed time: 0.009990205764770508 vs 0.002938544034957886 (ATen wins)
std cpu TH vs ATen timings
Size: 8
Avg. elapsed time: 1.6681909561157225e-05 vs. 4.659652709960938e-06 (ATen wins)
Size: 80
Avg. elapsed time: 1.699185371398926e-05 vs. 5.431413650512695e-06 (ATen wins)
Size: 800
Avg. elapsed time: 1.768803596496582e-05 vs. 1.1279821395874023e-05 (ATen wins)
Size: 8000
Avg. elapsed time: 2.7791500091552735e-05 vs 7.031106948852539e-05 (TH wins)
Size: 80000
Avg. elapsed time: 0.00018650460243225096 vs 0.00024368906021118164 (TH wins)
Size: 800000
Avg. elapsed time: 0.0010522041320800782 vs 0.0003039860725402832 (ATen wins)
Size: 8000000
Avg. elapsed time: 0.009976618766784668 vs. 0.0029211788177490234 (ATen wins)
```
These results show the TH solution still performs better than the ATen solution with default threading for some sizes.
It seems like removing CPU var_all and std_all will require an improvement in ATen reductions. https://github.com/pytorch/pytorch/issues/40570 has been updated with this information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43858
Reviewed By: zou3519
Differential Revision: D23498981
Pulled By: mruberry
fbshipit-source-id: 34bee046c4872d11c3f2ffa1b5beee8968b22050
Summary:
This PR adds the following aliaes:
- not_equal for torch.ne
- greater for torch.gt
- greater_equal for torch.ge
- less for torch.lt
- less_equal for torch.le
This aliases are consistent with NumPy's naming for these functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43870
Reviewed By: zou3519
Differential Revision: D23498975
Pulled By: mruberry
fbshipit-source-id: 78560df98c9f7747e804a420c1e53fd1dd225002
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43985
Added
```
def(detail::SelectiveStr<true>, ...)
impl(detail::SelectiveStr<true>, ...)
```
in torch/library, which can also be used for other templated selective registration.
Size saves for this diff:
fbios-pika: 78 KB
igios: 87 KB
Test Plan: Imported from OSS
Reviewed By: ljk53, smessmer
Differential Revision: D23459774
Pulled By: iseeyuan
fbshipit-source-id: 86d34cfe8e3f852602f203db06f23fa99af2c018
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44048
Inline the fork-wait calls to make sure we can see the ops to be quantized in the main graph
Also fix the InlineForkWait JIT pass to account for the case where the aten::wait call isn't present in the main graph
and we return future tensor from subgraph
Example
```
graph(%self.1 : __torch__.dper3.core.interop.___torch_mangle_6325.DperModuleWrapper,
%argument_1.1 : Tensor,
%argument_2.1 : Tensor):
%3 : Future[Tensor[]] = prim::fork_0(%self.1, %argument_1.1, %argument_2.1) # :0:0
return (%3)
with prim::fork_0 = graph(%self.1 : __torch__.dper3.core.interop.___torch_mangle_5396.DperModuleWrapper,
%argument_1.1 : Tensor,
%argument_2.1 : Tensor):
%3 : __torch__.dper3.core.interop.___torch_mangle_6330.DperModuleWrapper = prim::GetAttr[name="x"](%self.1)
%4 : __torch__.dper3.core.interop.___torch_mangle_5397.DperModuleWrapper = prim::GetAttr[name="y"](%self.1)
%5 : __torch__.dper3.core.interop.___torch_mangle_6327.DperModuleWrapper = prim::GetAttr[name="z"](%4)
%6 : Tensor = prim::CallMethod[name="forward"](%5, %argument_1.1, %argument_2.1) # :0:0
%7 : None = prim::CallMethod[name="forward"](%3, %6) # :0:0
%8 : Tensor[] = prim::ListConstruct(%6)
return (%8)
```
Test Plan:
python test/test_quantization.py test_interface_with_fork
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23481003
fbshipit-source-id: 2e756be73c248319da38e053f021888b40593032
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44008
embedding_bag requires only quantization of weights (no dynamic quantization of inputs)
So the type of quantization is essentially static (without calibration)
This will enable pyper to do fc and embedding_bag quantization using the same API call
Test Plan:
python test/test_quantization.py test_embedding_bag
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23467019
fbshipit-source-id: 41a61a17ee34bcb737ba5b4e19fb7a576d4aeaf9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43989
When we trace the model it produces aten::embedding_bag node in the graph,
Add necessary passes in graph mode to help support quantizing it as well
Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23460485
fbshipit-source-id: 328c5e1816cfebb10ba951113f657665b6d17575
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44137
We only insert guards on Tensor types, so we rely on the output
of a node being uniquely determined by its input types.
bail if any non-Tensor input affects the output type
and cannot be reasoned about statically
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23543602
Pulled By: eellison
fbshipit-source-id: abd6fe0b1fd7fe6fc251694d4cd442b19c032dd7
Summary:
- test beta=0, self=nan
- test transposes
- fixes broadcasting of addmv
- not supporting tf32 yet, will do it in future PR together with other testing fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43980
Reviewed By: mruberry
Differential Revision: D23507559
Pulled By: ngimel
fbshipit-source-id: 14ee39d1a0e13b9482932bede3fccb61fe6d086d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44125
In `Quantizer._prepare`, `observed` was used for two different variables
with different types. Making the names a bit cleaner and removing the
name conflict.
Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```
Imported from OSS
Reviewed By: dskhudia
Differential Revision: D23504109
fbshipit-source-id: 0f73eac3d6dd5f72ad5574a4d47d33808a70174a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44165
Allows convolutions to be quantized if `torch.cudnn.backends.benchmark`
flag was set.
Not for land yet, just testing.
Test Plan:
in the gist below, the resulting graph now has quantized convolutions
https://gist.github.com/vkuzo/622213cb12faa0996b6700b08d6ab2f0
Imported from OSS
Reviewed By: supriyar
Differential Revision: D23518775
fbshipit-source-id: 294f678c6afbd3feeb89b7a6655bc66ac9f8bfbc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44227
As title
ghstack-source-id: 111490242
Test Plan: CI
Reviewed By: xcheng16
Differential Revision: D23549149
fbshipit-source-id: fad742a8d4e6f844f83495514cd60ff2bf0d5bcb
Summary:
Update repeat op so that the inputs to sizes argument can a mixture of dynamic and constant inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43430
Reviewed By: houseroad
Differential Revision: D23494257
Pulled By: bzinodev
fbshipit-source-id: 90c5e90e4f73e98f3a9d5c8772850e72cecdf0d4
Summary: Integrate aot flow with model exporter.
Test Plan:
buck test dper3/dper3_backend/delivery/tests:dper3_model_export_test
replayer test see D23407733
Reviewed By: ipiszy
Differential Revision: D23313689
fbshipit-source-id: 39ae8d578ed28ddd6510db959b65974a5ff62888
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43906
This method returns a list of RRefs of remote parameters that can be fed into the DistributedOptimizer.
Original PR issue: RemoteModule enhancements #40550
Test Plan: buck test caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: rohan-varma
Differential Revision: D23399586
fbshipit-source-id: 4b0f1ccf2e47c8a9e4f79cb2c8668f3cdbdff820
Summary: Disable unroll hints when COMPILING_FOR_MIN_SIZE is on. We were seeing hundreds of errors in the build because the optimization was not being performed.
Test Plan: Smoke builds
Differential Revision: D23513255
fbshipit-source-id: 87da2fdc3c1146e8ffcacf14a49d5151d313f367
Summary:
Duplicate of https://github.com/pytorch/pytorch/issues/41413
This PR initiates the process of updating the torchsciprt backend interface used by ONNX exporter.
Replace jit lower graph pass by freeze module pass
Enable ScriptModule tests for ONNX operator tests (ORT backend) and model tests by default.
Replace jit remove_inplace_ops pass with remove_mutation and consolidation all passes for handling inplace ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43791
Reviewed By: houseroad
Differential Revision: D23421872
Pulled By: bzinodev
fbshipit-source-id: a98710c45ee905748ec58385e2a232de2486331b
Summary: Fix the potential divide by zero error in CostInferenceForRowWiseSparseAdagrad, when n has zero elements
Test Plan:
Ran buck test caffe2/caffe2/python/operator_test:adagrad_test
Result: https://our.intern.facebook.com/intern/testinfra/testrun/562950122086369
Reviewed By: idning
Differential Revision: D23520763
fbshipit-source-id: 191345bd24f5179a9dbdb41c6784eab102cfe89c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44092
instead submodules and weights are installed directly on the
graph_module by transferring the original modules. This makes it more
likely that scripting will succeed (since we no longer have submodules
that are not used in the trace). It also prevents layered transforms
from having to special case handling of the `root` module. GraphModules
can now be re-traced as part of the input to other transforms.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23504210
Pulled By: zdevito
fbshipit-source-id: f79e5c4cbfc52eb0ffb5d6ed89b37ce35a7dc467
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44148
Automatically remove the build_code_analyzer folder each time build.sh is run
ghstack-source-id: 111458413
Test Plan:
Run build.sh with different options and compare the outputs (should be different).
Ex:
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=OFF' tools/code_analyzer/build.sh `
should produce a shorter file than
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`
Reviewed By: iseeyuan
Differential Revision: D23503886
fbshipit-source-id: 9b95d4365540da0bd2d27760e1315caed5f44eec
Summary:
Because those jobs are running in Docker2XLarge+ container that has 20 cores
Unfortunately `nproc` returns number of cores available on the host rather than number of cores available to container
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44168
Reviewed By: walterddr, ngimel
Differential Revision: D23539558
Pulled By: malfet
fbshipit-source-id: 3df858722e153a8fcbe8ef6370b1a9c1993ada5b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44052
Summary
=======
This PR registers the following backwards functions as operators:
- slice_backward
- select_backward
- gather_backward
- index_select_backward (the backward function for index_select)
- select_index_backward (prevously known as index_select_backward, but is actually the backward function for max.dim, min.dim, etc)
In the future, I'd like to register more backward functions as operators
so that we can write batching rules for the backward functions. Batching
rules for backward functions makes it so that we can compute batched
gradients.
Motivation
==========
The rationale behind this PR is that a lot of backwards functions (27 in total)
are incompatible with BatchedTensor due to using in-place operations.
Sometimes we can allow the in-place operations, but other times we can't.
For example, consider select_backward:
```
Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes, int64_t dim, int64_t index) {
auto grad_input = at::zeros(input_sizes, grad.options());
grad_input.select(dim, index).copy_(grad);
return grad_input;
}
```
and consider the following code:
```
x = torch.randn(5, requires_grad=True)
def select_grad(v):
torch.autograd.grad(x[0], x, v)
vs = torch.randn(B0)
batched_grads = vmap(select_grad)(vs)
```
For the batched gradient use case, `grad` is a BatchedTensor.
The physical version of `grad` has size `(B0,)`.
However, select_backward creates a `grad_input` of shape `(5)`, and
tries to copy `grad` to a slice of it.
Other approaches
================
I've considered the following:
- register select_backward as an operator (this PR)
- have a branch inside select_backward for if `grad` is batched.
- this is OK, but what if we have more tensor extensions that want to override this?
- modify select_backward to work with BatchedTensor, by creating a new operator for the "select + copy_ behavior".
- select + copy_ isn't used elsewhere in derivative formulas so this doesn't seem useful
Test Plan
=========
- `pytest test/test_autograd.py -v`
- Registering backward functions may impact performance. I benchmarked
select_backward to see if registering it as an operator led to any noticable
performance overheads: https://gist.github.com/zou3519/56d6cb53775649047b0e66de6f0007dc.
The TL;DR is that the overhead is pretty minimal.
Test Plan: Imported from OSS
Reviewed By: ezyang, fbhuba
Differential Revision: D23481183
Pulled By: zou3519
fbshipit-source-id: 125af62eb95824626dc83d06bbc513262ee27350
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40785
The main goal of this change is to support creating Tensors specifying blob in NHWC (ChannelsLast) format.
ChannelsLast is supported only for 4-dim tensors, this is enforced on LibTorch side, I have not added asserts on java side in case that this limitation will be changed in future and not to have double asserts.
Additional changes in `aten/src/ATen/templates/Functions.h`:
`from_blob` creates `at::empty({0}, options)` tensor first and sets it Storage with sizes and strides afterwards.
But as ChannelsLast is only for 4-dim tensors - it fails on that creation, as dim==1.
I've added `zero_sizes()` function that returns `{0, 0, 0, 0}` for ChannelsLast and ChannelsLast3d.
Test Plan: Imported from OSS
Reviewed By: dreiss
Differential Revision: D22396244
Pulled By: IvanKobzarev
fbshipit-source-id: 02582d748a554e0f859aefe71cd2c1e321fb8979
Summary:
A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two.
This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches.
This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs).
This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list:
* When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body.
* When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined.
* `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885
Reviewed By: gmagogsfm
Differential Revision: D23503636
Pulled By: nickgg
fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44139
Also, make sure that we're checking that condition when we're starting a
new fusion group, not only when we merge a node into an existing fusion
group. Oh, and one more: add a test checking that we're rejecting graphs
with unspecified shapes.
Differential Revision: D23507510
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: 9c268825ac785671d7c90faf2aff2a3e5985ac5b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44115
Fixes device affinity in the FX prepare pass for QAT. Before this PR, observers
were always created on CPU. After this PR, observers are created on the
same device as the rest of the model. This will enable QAT prepare to
work regardless of whether users move the model to cuda before or after
calling this pass.
Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_qat_prepare_device_affinity
```
Imported from OSS
Reviewed By: supriyar
Differential Revision: D23502291
fbshipit-source-id: ec4ed20c21748a56a25e3395b35ab8640d71b5a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43298
IR emitter uses `ModuleValue` to represent ScriptModules and emit IR for
attribute access, submodule access, etc.
`ModuleValue` relies on two pieces of information, the JIT type of the
module, and the `ConcreteModuleType`, which encapsulates Python-only
information about the module.
ScriptModules loaded from a package used to create a dummy
ConcreteModuleType without any info in it. This led to divergences in
behavior during compilation.
This PR makes the two ways of constructing a ConcreteModuleType equivalent,
modulo any py-only information (which, by definition, is never present in
packaged files anyway).
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23228738
Pulled By: suo
fbshipit-source-id: f6a660f42272640ca1a1bb8c4ee7edfa2d1b07cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43284
The IR emitter looks for attributes on modules like:
1. Check the JIT type for the attribute
2. Check the originating Python class, in order to fulfill requests for, e.g. static methods or ignored methods.
In the case where you do:
```
inner_module = torch.jit.load("inner.pt")
wrapped = Wrapper(inner_module) # wrap the loaded ScriptModule in an nn.Module
torch.jit.script(wrapped)
```
The IR emitter may check for attributes on `inner_module`. There is no
originating Python class for `inner_module`, since it was directly
compiled from the serialized format.
Due to a bug in the code, we don't guard for this case an a segfault
results if the wrapper asks for an undefined attribute. The lookup in
this case looks like:
1. Check the JIT type for the attribute (not there!)
2. Check the originating Python class (this is a nullptr! segfault!)
This PR guards this case and properly just raises an attribute missing
compiler error instead of segfaulting.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23224337
Pulled By: suo
fbshipit-source-id: 0cf3060c427f2253286f76f646765ec37b9c4c49
Summary:
Some tests for alias analysis.
The first aliases at the module level and the second at the input level.
Please let me know if there are other alias situations!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44110
Reviewed By: nickgg
Differential Revision: D23509473
Pulled By: bwasti
fbshipit-source-id: fbfe71a1d40152c8fbbd8d631f0a54589b791c34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44083
Match on the complete schema of a node instead of its node kind when deciding to fuse it. Previously we matched on node kind, which could fail with something like `aten::add(int, int)` and if a new overload was added to an op without corresponding NNC support we would fuse it.
Follow ups are:
- bail when an output tensor type isnt uniquely determined by the input types (e.g. aten::add and the second input could be either a float or an int)
- remove NNC lowering for _tanh_backward & _sigmoid_backward
- Validate that we support all of the overloads here. I optimistically added ops that included Tensors, it's possible that we do not support every overload here. This isn't a regression, and this PR is at least improving our failures in that regard.
I can do any of these as part of this PR if desired, but there are a number of failures people have run into that this PR fixes so I think it would be good to land this sooner than later.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23503704
Pulled By: eellison
fbshipit-source-id: 3ce971fb1bc3a7f1cbaa38f1ed853e2db3d67c18
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43965
As part of a larger effort to unify the API between the lite interpreter and full JIT:
- implement torch::jit::mobile::Method, a proxy for torch::jit::mobile::Function
- add support for overloaded operator() to mobile Method and Function
- mobile find_method now returns a c10::optional<Method> (so signature matches full jit)
- moves some implementation of Function from module.cpp to function.cpp
ghstack-source-id: 111161942
Test Plan: CI
Reviewed By: iseeyuan
Differential Revision: D23330762
fbshipit-source-id: bf0ba0d711d9566c92af31772057ecd35983ee6d
Summary:
Polishes DDP join api docstrings and makes a few minor cosmetic changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43973
Reviewed By: zou3519
Differential Revision: D23467238
Pulled By: rohan-varma
fbshipit-source-id: faf0ee56585fca5cc16f6891ea88032336b3be56
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44036
Running replaceAtenConvolution on older traced model wont work as
_convolution signature has changed and replaceAtenConvolution was
changed to account for that.
But we did not preserve the old behavior during that. This change
restores the old behavior while keeing the new one.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23476775
fbshipit-source-id: 73a0c2b7387f2a8d82a8d26070d0059972126836
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44035
change
Also added test so as to capture such cases for future.
Test Plan:
python test/test_xnnpack_integration.py
Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D23476773
fbshipit-source-id: a62c4429351c909245106a70b4c60b1bacffa817
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44060
Right now it skips grad checks as well.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23484018
Pulled By: gchanan
fbshipit-source-id: 24a8f1af41f9918aaa62bc3cd78b139b2f8de1e1
Summary:
Bucketize returns integers, currently this triggers an internal assert, so we apply the mechanism for this case (also used for argmax etc.).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44102
Reviewed By: zou3519
Differential Revision: D23500048
Pulled By: albanD
fbshipit-source-id: fdd869cd1feead6616b532b3e188bd5512adedea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44054
**Summary**
This commit improves the error message that is printed when an
`Optional` type annotation with an unsupported contained type is
encountered. At present, the `Optional` is printed as-is, and
`Optional[T]` is syntatic sugar for `Union[T, None]`, so that is what
shows up in the error message and can be confusing. This commit modifies
the error message so that it prints `T` instead of `Union[T, None]`.
**Test Plan**
Continuous integration.
Example of old message:
```
AssertionError: Unsupported annotation typing.Union[typing.List, NoneType] could not be resolved.
```
Example of new message:
```
AssertionError: Unsupported annotation typing.Union[typing.List, NoneType] could not be resolved because typing.List could not be resolved.
```
**Fixes**
This commit fixes#42859.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23490365
Pulled By: SplitInfinity
fbshipit-source-id: 2aa9233718e78cf1ba3501ae11f5c6f0089e29cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44078
When PyTorch mobile inference failed and throw exception, if caller catch and not crash the app, we are not able to track all the inference failures.
So we are adding native soft error reporting to capture all the failures occurring during module loading and running including both crashing and on-crashing failures. Since c10::Error has good error messaging stack handling (D21202891 (a058e938f9)), we are utilizing it for the error handling and message print out.
ghstack-source-id: 111307080
Test Plan:
Verified that the soft error reporting is sent through module.cpp when operator is missing, make sure a logview mid is generated with stack trace: https://www.internalfb.com/intern/logview/details/facebook_android_softerrors/5dd347d1398c1a9a73c804b20f7c2179/?selected-logview-tab=latest.
Error message with context is logged below:
```
soft_error.cpp [PyTorchMobileInference] : Error occured during model running entry point: Could not run 'aten::embedding' with arguments from the 'CPU' backend. 'aten::embedding' is only available for these backends: [BackendSelect, Named, Autograd, Autocast, Batched, VmapMode].
BackendSelect: fallthrough registered at xplat/caffe2/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at xplat/caffe2/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Autograd: fallthrough registered at xplat/caffe2/aten/src/ATen/core/VariableFallbackKernel.cpp:31 [backend fallback]
Autocast: fallthrough registered at xplat/caffe2/aten/src/ATen/autocast_mode.cpp:253 [backend fallback]
Batched: registered at xplat/caffe2/aten/src/ATen/BatchingRegistrations.cpp:317 [backend fallback]
VmapMode: fallthrough registered at xplat/caffe2/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Exception raised from reportError at xplat/caffe2/aten/src/ATen/core/dispatch/OperatorEntry.cpp:261 (m
```
Reviewed By: iseeyuan
Differential Revision: D23428636
fbshipit-source-id: 82d5d9c054300dff18d144f264389402d0b55a8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44082
Automated submodule is running into some test failures and I am not sure how can I rebase that.
automated submodule update:
https://github.com/pytorch/pytorch/pull/43817
Test Plan: CI tests
Reviewed By: jianyuh
Differential Revision: D23489240
fbshipit-source-id: a49b01786ebf0a59b719a0abf22398e1eafa90af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43734
Following the additional GH comments on the original PR https://github.com/pytorch/pytorch/pull/43307.
ghstack-source-id: 111327130
Test Plan: Run `python test/distributed/test_c10d.py`
Reviewed By: smessmer
Differential Revision: D23380288
fbshipit-source-id: 4b8889341c57b3701f0efa4edbe1d7bbc2a82ced
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44055
There is no functional change here. Another patch will rename NewCriterionTest to CriterionTest.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23482572
Pulled By: gchanan
fbshipit-source-id: de364579067e2cc9de7df6767491f8fa3a685de2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44050
We don't actually turn on the CTCLoss tests since they fail, but this allows you to toggle check_forward_only and for the code to actually run.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23481091
Pulled By: gchanan
fbshipit-source-id: f2a3b0a2dee27341933c5d25f1e37a878b04b9f6
Summary:
This PR adds a new test suite, test_ops.py, designed for generic tests across all operators with OpInfos. It currently has two kinds of tests:
- it validates that the OpInfo has the correct supported dtypes by verifying that unsupported dtypes throw an error and supported dtypes do not
- it runs grad and gradgrad checks on each op and its variants (method and inplace) that has an OpInfo
This is a significant expansion and simplification of the current autogenerated autograd tests, which spend considerable processing their inputs. As an alternative, this PR extends OpInfos with "SampleInputs" that are much easier to use. These sample inputs are analogous to the existing tuples in`method_tests()`.
Future PRs will extend OpInfo-based testing to other uses of `method_tests()`, like test_jit.py, to ensure that new operator tests can be implemented entirely using an OpInfo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43451
Reviewed By: albanD
Differential Revision: D23481723
Pulled By: mruberry
fbshipit-source-id: 0c2cdeacc1fdaaf8c69bcd060d623fa3db3d6459
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44073
We don't have a proper support on NNC and JIT IR->NNC lowering side for it yet.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23487905
Pulled By: ZolotukhinM
fbshipit-source-id: da0da7478fc8ce7b455176c95d8fd610c94352c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43961
Currently we're removing prim::profile nodes and embed the type info
directly in the IR right before the fuser, because it is difficult to
fuse in a presence of prim::profile nodes. It turns out that BatchMM has
a similar problem: it doesn't work when there are prim::profile nodes in
the graph. These two passes run next to each other, so we could simply
remove prim::profile nodes slightly earlier: before the BatchMM pass.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23453266
Pulled By: ZolotukhinM
fbshipit-source-id: 92cb50863962109b3c0e0112e56c1f2cb7467ff1
Summary:
- This test is very fast and very important, so it makes no sense in marking it as slowTest
- This test is should also run on CUDA
- This test should check alpha and beta support
- This test should check `out=` support
- manual computation should use list instead of index_put because list is much faster
- precision for TF32 needs to be fixed. Will do it in future PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43831
Reviewed By: ailzhang
Differential Revision: D23435032
Pulled By: ngimel
fbshipit-source-id: d1b8350addf1e2fe180fdf3df243f38d95aa3f5a
Summary:
Move `multigpu`, `noavx` and `slow` test configs to CUDA-10.2, but keep them a master only tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44057
Reviewed By: walterddr, seemethere
Differential Revision: D23482732
Pulled By: malfet
fbshipit-source-id: a6b050701cbc1d8f176ebb302f7f5076a78f1f58
Summary:
I usually get this extra "legacy_conv2d.pt" file in my git "changed files". I found that this is from tests with `download_file`
42c895de4d/test/test_nn.py (L410-L426)
and its definition (see `data_dir` for download output location)
f17d7a5556/torch/testing/_internal/common_utils.py (L1338-L1357)
I assume a file "generated" by test should not be tracked in VCS? Also, if the file is updated on the server, users may still use the old version of it if they have already downloaded that before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43941
Reviewed By: anjali411
Differential Revision: D23451264
Pulled By: ezyang
fbshipit-source-id: 7fcdfb24685a7e483914cc46b3b024df798bf7f7
Summary:
To avoid conflicts, this PR does not remove all imports. More are coming in further PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43808
Reviewed By: wanchaol
Differential Revision: D23436675
Pulled By: ailzhang
fbshipit-source-id: ccc21a1955c244f0804277e9e47e54bfd23455cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43972
It is useful when debugging a bug to disable NNC backend to see whether
the bug is there or in the fuser logic.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23455624
Pulled By: ZolotukhinM
fbshipit-source-id: f7c0452a29b860afc806e2d58acf35aa89afc060
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44001
This is to align with the naming in numpy and in
https://github.com/pytorch/pytorch/pull/43092
Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_aminmax_cpu_float32
python test/test_torch.py TestTorchDeviceTypeCUDA.test_aminmax_cuda_float32
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23465298
fbshipit-source-id: b599035507156cefa53942db05f93242a21c8d06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42894
Continuing the min_max kernel implementation, this PR adds the
CPU path when a dim is specified. Next PR will replicate for CUDA.
Note: after a discussion with ngimel, we are taking the fast path
of calculating the values only and not the indices, since that is what
is needed for quantization, and calculating indices would require support
for reductions on 4 outputs which is additional work. So, the API
doesn't fully match `min.dim` and `max.dim`.
Flexible on the name, let me know if something else is better.
Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_minmax_cpu_float32
```
performance: seeing a 49% speedup on a min+max tensor with similar shapes
to what we care about for quantization observers (bench:
https://gist.github.com/vkuzo/b3f24d67060e916128a51777f9b89326). For
other shapes (more dims, different dim sizes, etc), I've noticed a
speedup as low as 20%, but we don't have a good use case to optimize
that so perhaps we can save that for a future PR.
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23086798
fbshipit-source-id: b24ce827d179191c30eccf31ab0b2b76139b0ad5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42868
Adds a CUDA kernel for the _min_max function.
Note: this is a re-submit of https://github.com/pytorch/pytorch/pull/41805,
was faster to resubmit than to ressurect that one. Thanks to durumu
for writing the original implementation!
Future PRs will add index support, docs, and hook this up to observers.
Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```
Basic benchmarking shows a 50% reduction in time to calculate min + max:
https://gist.github.com/vkuzo/b7dd91196345ad8bce77f2e700f10cf9
TODO
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23057766
fbshipit-source-id: 70644d2471cf5dae0a69343fba614fb486bb0891
Summary: Add cost inference for AdaGrad and RowWiseSparseAdagrad
Test Plan:
Ran `buck test caffe2/caffe2/python/operator_test:adagrad_test`
Result: https://our.intern.facebook.com/intern/testinfra/testrun/5629499567799494
Reviewed By: bwasti
Differential Revision: D23442607
fbshipit-source-id: 67800fb82475696512ad19a43067774247f8b230
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43270
`torch.conj` is a very commonly used operator for complex tensors, but it's mathematically a no op for real tensors. Switching to tensorflow gradients for complex tensors (as discussed in #41857) would involve adding `torch.conj()` to the backward definitions for a lot of operators. In order to preserve autograd performance for real tensors and maintain numpy compatibility for `torch.conj`, this PR updates `torch.conj()` which behaves the same for complex tensors but performs a view/returns `self` tensor for tensors of non-complex dtypes. The documentation states that the returned tensor for a real input shouldn't be mutated. We could perhaps return an immutable tensor for this case in future when that functionality is available (zdevito ezyang ).
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23460493
Pulled By: anjali411
fbshipit-source-id: 3b3bf0af55423b77ff2d0e29f5d2c160291ae3d9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43927
Adds uninitialized placeholders for various state
used throughout the Quantizer object, with documentation
on what they are. No logic change.
Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFx
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23439473
fbshipit-source-id: d4ae83331cf20d81a7f974f88664ccddca063ffc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43248
We add the support of __torch_function__ override for C++ custom op. The logic is the same as the other components, like torch.nn.Module.
Refactored some code a little bit to make it reusable.
Test Plan: buck test //caffe2/test:fx -- test_torch_custom_ops
Reviewed By: bradleyhd
Differential Revision: D23203204
fbshipit-source-id: c462a86e407e46c777171da32d7a40860acf061e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42533
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.
**Broadcasting**
At this point we don't support broadcasting.
**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.
----------------
**In this PR**
- Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API
- Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API
**Tests**
Tested via unit tests
**TODO**
1. Properly handle empty lists
**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops
2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23331894
Pulled By: izdeby
fbshipit-source-id: 876dd1bc82750f609b9e3ba23c8cad94d8d6041c
Summary:
Previously when merging a node without a subgraph, we would merge the node's outputs to the corresponding subgraph values, but when merging a node with a subgraph the node's outputs would be absent in the value mapping. This PR makes it so they are included.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43988
Reviewed By: ZolotukhinM
Differential Revision: D23462116
Pulled By: eellison
fbshipit-source-id: 232c081261e9ae040df0accca34b1b96a5a5af57
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43974
We already install devtoolset7 in our docker images for binary builds
and tclsh shouldn't be needed since we're not relying on unbuffer
anymore
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D23462531
Pulled By: seemethere
fbshipit-source-id: 83cbb8b0782054f0b543dab8d11fa6ac57685272
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43711
this makes them available in forward if needed
No change to the file content, just a copy-paste.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D23454146
Pulled By: albanD
fbshipit-source-id: 6269a4aaf02ed53870fadf8b769ac960e49af195
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43914
Renames `matches` function to `is_match`, since there is also
a list named `matches` we are passing around in `Quantizer`,
and would be good to decrease name conflicts.
Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23435601
fbshipit-source-id: 394af11e0120cfb07dedc79d5219247330d4dfd6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43910
Adds a debug function to get a representation of all nodes in the
graph, such as
```
name op target args kwargs
x plchdr x () {}
linear_weight gt_prm linear.weight () {}
add_1 cl_fun <bi_fun add> (x, linear_weight) {}
linear_1 cl_mod linear (add_1,) {}
relu_1 cl_meth relu (linear_1,) {}
sum_1 cl_fun <bi_meth sum> (relu_1,) {'dim': -1}
topk_1 cl_fun <bi_meth topk> (sum_1, 3) {}
```
using only Python STL. This is useful for printing internal state of
graphs when working on FX code.
Has some on-by-default logic to shorten things so that node reprs for
toy models and unit tests fit into 80 chars.
Flexible on function name and location, I care more that this is
accessible from both inside PT as well as from debug scripts which
are not checked in.
Test Plan:
see
https://gist.github.com/vkuzo/ed0a50e5d6dc7442668b03bb417bd603 for
example usage
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23435029
fbshipit-source-id: 1a2df797156a19cedd705e9e700ba7098b5a1376
Summary:
MKLDNN linear incorrectly assumes that bias is defined and will fail for no-bias calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43703
Reviewed By: glaringlee
Differential Revision: D23373182
Pulled By: bwasti
fbshipit-source-id: 1e817674838a07d237c02eebe235c386cf5b191e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43954
CriterionTest is basically dead -- see https://github.com/pytorch/pytorch/pull/43769 and https://github.com/pytorch/pytorch/pull/43776.
The only exception is the cpp parity test, but the difference there doesn't actually have any effect -- the get_target has unpack=True, but all examples don't require unpacking (I checked).
As a pre-requisite for merging these tests, have the cpp parity test start using the NewCriterionTest.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23452144
Pulled By: gchanan
fbshipit-source-id: 5dca1eb0878b882c93431d3b0e880b5bb1764522
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43853
Add QPL logging for mobile module's metadata
ghstack-source-id: 111113492
(Note: this ignores all push blocking failures!)
Test Plan:
- CI
- Load the model trained by `mobile_model_util.py`
- Local QPL logger standard output.
{F319012106}
Reviewed By: xcheng16
Differential Revision: D23417304
fbshipit-source-id: 7bc834f39e616be1eccfae698b3bccdf2f7146e5
Summary:
`is_complex_t` is a bad name. For example in std, there are `std::is_same` but not `std::is_same_t`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39906
Reviewed By: mrshenli
Differential Revision: D22665013
Pulled By: anjali411
fbshipit-source-id: 4b71745f5e2ea2d8cf5845d95ada4556c87e040d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43887
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this.
The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.
Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111180436
Reviewed By: mrshenli
Differential Revision: D23422577
fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e
Summary:
This is the common behavior when one builds PyTorch (or any other CUDA project) using CMake, so it should be held true for Torch CUDA extensions as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43931
Reviewed By: ezyang, seemethere
Differential Revision: D23441793
Pulled By: malfet
fbshipit-source-id: 1af392107a94840331014fda970ef640dc094ae4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43892
Run weight observer in the convert function, so user do not need to run calibration
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23429758
fbshipit-source-id: 5bc222e3b731789ff7a86463c449690a58dffb7b
Summary: Update `README.md` for oss to explain the usage of `--run` `--export` `--summary`
Test Plan: Test locally.
Reviewed By: malfet
Differential Revision: D23431508
fbshipit-source-id: 368b8dd8cd5099f39c7f5bc985203c417bf7af39
Summary:
No need for compatibility wrapper in Python3+ world
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43981
Reviewed By: seemethere
Differential Revision: D23458325
Pulled By: malfet
fbshipit-source-id: 00f822895625f4867c22376fe558c50316f5974d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43893
Update `README.md` in oss, provide more examples, start from the most common use to specified use. Make `README.md` be more friendly and more specific.
Test Plan: `README.md` doesn't need test.
Reviewed By: malfet, seemethere
Differential Revision: D23420203
fbshipit-source-id: 1a4c146393fbcaf2893321e7892740edf5d0c248
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43540
selected_mobile_ops.h is generated at BUCK build time, which contains the whitelist of root operators. It's used for templated selective build when XPLAT_MOBILE_BUILD is defined.
ghstack-source-id: 111014372
Test Plan: CI and BSB
Reviewed By: ljk53
Differential Revision: D22618309
fbshipit-source-id: ddf813904892f99c3f4ae0cd14ce8b27727be5a2
Summary:
This PR needs discussion as it changes the behavior of `DataLoader`. It can be closed if its not considered a good practice.
Currently, the `DataLoader` spawns a new `_BaseDataLoaderIter` object every epoch,
In the case of the multiprocess DataLoader, every epoch the worker processes are re-created and they make a copy of the original `Dataset` object.
If users want to cache data or do some tracking on their datasets, all their data will be wiped out every epoch. Notice that this doesn't happen when the number of workers is 0. giving some inconsistencies with the multiprocess and serial data loaders.
This PR keeps the `_BaseDataLoaderIter` object alive and just resets it within epochs, so the workers remain active and so their own `Dataset` objects. People seem to file issues about this often.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35795
Reviewed By: ailzhang
Differential Revision: D23426612
Pulled By: VitalyFedyunin
fbshipit-source-id: e16950036bae35548cd0cfa78faa06b6c232a2ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43889
1. rename input argunment `interested-folder` to `interest-only` -- be consistent with `run-only`, `coverage-only` and be shorted
Test Plan: Test on devserver and linux docker.
Reviewed By: malfet
Differential Revision: D23417338
fbshipit-source-id: ce9711e75ca3a1c30801ad6bd1a620f3b06819c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43938
resubmit
Test Plan: unit test included
Reviewed By: mruberry
Differential Revision: D23443493
fbshipit-source-id: 7b68f8f7d1be58bee2154e9a498b5b6a09d11670
Summary:
This PR moves `DispatchKey::Autograd` to an alias dispatch key mapping to `AutogradCPU, AutogradCUDA, AutogradXLA, AutogradOther, AutogradPrivate*` keys.
A few things are handled in this PR:
- Update alias dispatch key mapping and precompute dispatchTable logic
- Move `Autograd` key from `always_included` set to TensorImpl constructor.
- Update `dummyTensor` constructor to take `requires_grad` as optional argument so that it's closer to the real application in op_registration_test.
- Use `BackendSelect` key for both backend select before and after autograd layer. (1 liner in backend_select codegen)
A few planned followups ordered by priority:
- [cleanup] Update `test_dispatch.py` to include testing `Autograd`.
- [cleanup] Add Math alias key and move catchAll to Math. (to remove 2.2 in `computeDispatchTableEntryWithDebug`)
- [new feature] Add support for Math in native_functions.yaml
- [cleanup] Add iterator like functionality to DispatchKeySet
- [cleanup/large] Only add Autograd backend keys when tensor requires grad. (cc: ljk53 ?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43070
Reviewed By: ezyang
Differential Revision: D23281535
Pulled By: ailzhang
fbshipit-source-id: 9ad00b17142e9b83304f63cf599f785500f28f71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43780
The general strategy is:
- unsqueeze the physical inputs enough
- pass the unsqueezed physical inputs to at::matmul
- squeeze any extra dimensions
Test Plan: - `pytest test/test_vmap.py -v`
Reviewed By: ezyang
Differential Revision: D23400842
Pulled By: zou3519
fbshipit-source-id: c550eeb935747c08e3b083609ed307a4374b9096
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43731
After this PR, for each test in TestVmapOperators, TestVmapOperators
tests that the test never invokes the slow vmap fallback path. The
rationale behind this change is that TestVmapOperators is used for
testing batching rules and we want confidence that the batching rules
actually get invoked.
We set this up using a similar mechanism to the CUDA memory leak check:
(bff741a849/torch/testing/_internal/common_utils.py (L506-L511))
This PR also implements the batching rule for `to.dtype_layout`; the new
testing caught that we were testing vmap on `to.dtype_layout` but it
didn't actually have a batching rule implemented!
Test Plan: - New tests in `pytest test/test_vmap.py -v` that test the mechanism.
Reviewed By: ezyang
Differential Revision: D23380729
Pulled By: zou3519
fbshipit-source-id: 6a4b97a7fa7b4e1c5be6ad80d6761e0d5b97bb8c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43386Resolves#43178
ghstack-source-id: 111109716
Test Plan: Added checks to existing unit test and ran it on gpu devserver.
Reviewed By: rohan-varma
Differential Revision: D23216393
fbshipit-source-id: fed5e37fbabbd2ac4a9055b20057fffe3c416c0b
Summary:
During scripting, combination of shape (or size()) and slice (e.g x.shape[2:]) produces following error:
slice() missing 1 required positional argument: 'step'
This happens because aten::slice has 2 signatures:
- aten::slice(Tensor self, int dim, int start, int end, int step) -> Tensor
- aten::slice(t[] l, int start, int end, int step) -> t[]
and when a list is passed instead of tensor the 2nd of the two slice signatures is called, and since it has 4 instead of 5 arguments it produces the above exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42935
Reviewed By: houseroad
Differential Revision: D23398435
Pulled By: bzinodev
fbshipit-source-id: 4151a8f878c520cea199b265973fb476b17801fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684
This PR attempts to address #42560 by capturing the appropriate
exception_ptr in the autograd engine and passing it over to the Future.
As part of this change, there is a significant change the Future API where we
now only accept an exception_ptr as part of setError.
For the example in #42560, the exception trace would now look like:
```
> Traceback (most recent call last):
> File "test_autograd.py", line 6914, in test_preserve_backtrace
> Foo.apply(t).sum().backward()
> File "torch/tensor.py", line 214, in backward
> torch.autograd.backward(self, gradient, retain_graph, create_graph)
> File "torch/autograd/__init__.py", line 127, in backward
> allow_unreachable=True) # allow_unreachable flag
> File "torch/autograd/function.py", line 87, in apply
> return self._forward_cls.backward(self, *args)
> File "test_autograd.py", line 6910, in backward
> raise ValueError("something")
> ValueError: something
```
ghstack-source-id: 111109637
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D23365408
fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43902
Trace back from the weight node util we hit getattr, reconstruct the graph module with the traced nodes
and run the graph module to pack the weight. then replace the original chain of ops with the packed weight.
Test Plan:
Imported from OSS
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23432431
fbshipit-source-id: 657f21a8287494f7f87687a9d618ca46376d3aa3
Summary:
`torch.range` still hasn't been removed way after version 0.5. This PR fixes the warning message. Alternatively, we can remove `torch.range`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43569
Reviewed By: ngimel
Differential Revision: D23408233
Pulled By: mruberry
fbshipit-source-id: 86c4f9f018ea5eddaf80b78a3c54dfa41cfc6fa6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43901
Add similar APIs like eager and graph mode on torchscript
- fuse_fx
- quantize_fx (for both post training static and qat)
- quantize_dynamic_fx (for post training dynamic)
- prepare_fx (for both post training static and qat)
- prepare_dynamic_fx (for post training dynamic)
- convert_fx (for all modes)
Test Plan:
Imported from OSS
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23432430
fbshipit-source-id: fc99eb75cbecd6ee7a3aa6c8ec71cd499ff7e3c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43789
Since it's single element.. In some cases we may not be able to resize the
buffers.
Test Plan: unit tests
Reviewed By: supriyar
Differential Revision: D23393108
fbshipit-source-id: 46cd7f73ed42a05093662213978a01ee726433eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43847
It seems to slowdown two fastRNN benchmarks and does not speed up
others.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23416197
Pulled By: ZolotukhinM
fbshipit-source-id: 598144561979e84bcf6bccf9b0ca786f5af18383
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43900
The original code assumed that the versioning if was inserted in the
beginning of the graph while in fact it was inserted in the end. We're
now also not removing `profile_optional` nodes and rely on DCE to clean
it up later (the reason we're not doing it is that deletion could
invalidate the insertion point being used).
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23432175
Pulled By: ZolotukhinM
fbshipit-source-id: 1bf55affaa3f17af1bf71bad3ef64edf71a3e3fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43846
We are looking for tensors that are expected to be undefined (according
to the profile info) and should be checking for them to satisfy the
following condition: "not(have any non-zero)", which is equivalent to
"tensor is all zeros". The issue was that we've been checking tensors
that were expected *not* to be undefined.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23416198
Pulled By: ZolotukhinM
fbshipit-source-id: 71e22f552680f68f2af29f427b7355df9b1a4278
Summary:
- Add `torch._C` bindings from `torch/csrc/autograd/init.cpp`
- Renamed `torch._C.set_grad_enabled` to `torch._C._set_grad_enabled`
so it doesn't conflict with torch.set_grad_enabled anymore
This is a continuation of gh-38201. All I did was resolve merge conflicts and finish the annotation of `_DecoratorContextManager.__call__` that ezyang started in the first commit.
~Reverts commit b5cd3a80bbc, which was only motivated by not having `typing_extensions` available.~ (JIT can't be made to understand `Literal[False]`, so keep as is).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43415
Reviewed By: ngimel
Differential Revision: D23301168
Pulled By: malfet
fbshipit-source-id: cb5290f2e556b4036592655b9fe54564cbb036f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43883
Check the result of GCC coverage in OSS is reasonable and ready to ship.
The amount of executable lines are not the same between `gcc` and `clang` because of the following reasons:
* Lines following are counted in `clang` but not in `gcc`:
1. empty line or line with only “{” or “}”
3. some comments are counted in clang but not in gcc
5. `#define ...` -- not supported by gcc according to official documentation
* Besides, a statement that explains to more than one line will be counted as only one executable line in gcc, but several lines in clang
## Advantage of `gcc` coverage
1. Much faster
- code coverage tool runtime is onle **4 min** (*ammazzzing!!*) by `gcc`, compared to **3 hours!!** by `clang`, to analyze all the tests' artifacts
2. Use less disk
- `Clang`'s artifacts will take as large as 170G, but `GCC` is 980M
Besides, also update `README.md`.
Test Plan:
Compare the result in OSS `clang` and OSS `gcc` with the same command:
```
python oss_coverage.py --run-only atest test_nn.py --interested-folder=aten
```
----
## GCC
**Summary**
> time: 0:15:45
summary percentage: 44.85%
**Report and Log**
[File Coverage Report](P140825162)
[Line Coverage Report](P140825196)
[Log](P140825385)
------
## CLANG
**Summary**
> time: 0:21:35
summary percentage: 44.08%
**Report and Log**
[File Coverage Report](P140825845)
[Line Coverage Report](P140825923)
[Log](P140825950)
----------
# Run all tests
```
# run all tests and get coverage over Pytorch
python oss_coverage.py
```
**Summary**
> time: 1:27:20. ( time to run tests: 1:23:33)
summary percentage: 56.62%
**Report and Log**
[File Coverage Report](P140837175)
[Log](P140837121)
Reviewed By: malfet
Differential Revision: D23416772
fbshipit-source-id: a6810fa4d8199690f10bd0a4f58a42ab2a22182b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43800
1. Move fbcode related coverage code to fb/ folder and add TARGETS so that we can use buck run to run the tool and solved the import probelm.
2. Write `README.md` to give users guidance about the tool
Test Plan:
On devserver:
```
buck run //caffe2/fb/code_coverage/tool:coverage -- //caffe2/c10:
```
More examples in README.md
Reviewed By: malfet
Differential Revision: D23404988
fbshipit-source-id: 4942cd0e0fb7bd28a5e884d9835b93f00adb7b92
Summary:
Python code coverage tests should not rely on target determination as it will negatively impact the coverage score
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43899
Reviewed By: seemethere
Differential Revision: D23432069
Pulled By: malfet
fbshipit-source-id: 341fcadafaab6bd96d33d23973e01f7d421a6593
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42577
Closes https://github.com/pytorch/pytorch/issues/38174. Implements a join-based API to support training with the DDP module in the scenario where different processes have different no. of inputs. The implementation follows the description in https://github.com/pytorch/pytorch/issues/38174. Details are available in the RFC, but as a summary, we make the following changes:
#### Approach
1) Add a context manager `torch.nn.parallel.distributed.join`
2) In the forward pass, we schedule a "present" allreduce where non-joined process contribute 1 and joined processes contribute 0. This lets us keep track of joined processes and know when all procs are joined.
3) When a process depletes its input and exits the context manager, it enters "joining" mode and attempts to "shadow" the collective comm. calls made in the model's forward and backward pass. For example we schedule the same allreduces in the same order as the backward pass, but with zeros
4) We adjust the allreduce division logic to divide by the effective world size (no. of non-joined procs) rather than the absolute world size to maintain correctness.
5) At the end of training, the last joined process is selected to be the "authoritative" model copy
We also make some misc. changes such as adding a `rank` argument to `_distributed_broadcast_coalesced` and exposing some getters/setters on `Reducer` to support the above changes.
#### How is it tested?
We have tests covering the following models/scenarios:
- [x] Simple linear model
- [x] Large convolutional model
- [x] Large model with module buffers that are broadcast in the forward pass (resnet). We verify this with a helper function `will_sync_module_buffers` and ensure this is true for ResNet (due to batchnorm)
- [x] Scenario where a rank calls join() without iterating at all, so without rebuilding buckets (which requires collective comm)
- [x] Model with unused params (with find unused parameters=True)
- [x] Scenarios where different processes iterate for a varying number of different iterations.
- [x] Test consistency in tie-breaking when multiple ranks are the last ones to join
- [x] Test that we divide by the effective world_size (no. of unjoined processes)
#### Performance implications
###### Trunk vs PR patched, 32 GPUs, batch size = 32
P50, forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 369/s vs 0.087 368/s
###### join(enable=True) vs without join, 32 GPUs, batch size = 32, even inputs
P50, forward + backward + optimizer batch latency & total QPS: 0.120 265/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.088 364/s vs 0.087 368/s
###### join(enable=False) vs without join, 32 GPUs, batch size = 32, even inputs
P50 forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 368/s vs 0.087 368/s
###### join(enable=True) with uneven inputs (offset = 2000), 32 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.183 174/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.150 213/s vs 0.087 368/s
###### join(enable=True) with uneven inputs ((offset = 2000)), 8 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.104 308/s vs 0.104 308/s
P50 backwards only batch latency & total QPS: 0.070 454/s vs 0.070 459/s
The 2 above uneven inputs benchmark was conducted 32 GPUs and 4 GPUs immediately depleting their inputs and entering "join" mode (i.e. not iterating at all), while the other 28 iterating as normal. It looks like there is a pretty significant perf hit for this case when there are uneven inputs and multi-node training. Strangely, when there is a single node (8 GPUs), this does not reproduce.
#### Limitations
1) This is only implemented for MPSD, not SPMD. Per a discussion with mrshenli we want to encourage the use of MPSD over SPMD for DDP.
2) This does not currently work with SyncBN or custom collective calls made in the model's forward pass. This is because the `join` class only shadows the `broadcast` for buffers in the forward pass, the gradient allreduces in the bwd pass, unused parameters reduction, and (optionally) the rebuild buckets broadcasting in the backwards pass. Supporting this will require additional design thought.
3) Has not been tested with the [DDP comm. hook](https://github.com/pytorch/pytorch/issues/39272) as this feature is still being finalized/in progress. We will add support for this in follow up PRs.
ghstack-source-id: 111033819
Reviewed By: mrshenli
Differential Revision: D22893859
fbshipit-source-id: dd02a7aac6c6cd968db882c62892ee1c48817fbe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43742
We can remove all prim::profiles, update the values to their specialized profiled types, and then later guard the input graphs based on the input types of the fusion group. After that we remove specialized tensor types from the graph. This gets rid of having to update the vmap and removes all of the profile nodes in fusing.
Test Plan: Imported from OSS
Reviewed By: Krovatkin
Differential Revision: D23385206
Pulled By: eellison
fbshipit-source-id: 2c84bd1d1c38df0d7585e523c30f7bd28f399d7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43636
We weren't running inlining in the forward graph of differentiable subgraphs, and we weren't getting rid of all profiles as part of optimization.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23358804
Pulled By: eellison
fbshipit-source-id: 05ede5fa356a15ca385f899006cb5b35484ef620
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43635
Intern the symbol, no functional changes. Aliasing need to be looked at but this should be done in a separate PR; this PR is just changing the symbol.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358806
Pulled By: eellison
fbshipit-source-id: f18bcd142a0daf514136f019ae607e4c3f45d9f8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43634
Because differentiable graphs detach the gradients of input Tensors, creating and inlining differentiable graphs changes the requires_grad property of tensors in the graph. In the legacy executor, this was not a problem as the Fuser would simply ignore the gradient property because it would be invariant that the LegacyExecutor only passed tensors with grad = False. This is not the case with the profiler, as the Fuser does it's own guarding.
Updating the type also helps with other typechecks, e.g. the ones specializing the backward, and with debugging the graph.
Other possibilities considered were:
- Fuser/Specialize AutogradZero always guards against requires_grad=False regardless of the profiled type
- Re-profile forward execution of differentiable graph
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23358803
Pulled By: eellison
fbshipit-source-id: b106998accd5d0f718527bc00177de9af5bad5fc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43657
We didn't have a test that ensures functions ran over RPC that are being profiled can use `with record_function()` to profile specific blocks in the function execution. This is useful if the user wants certain information about specific blocks in the function ran over RPC composed of many torch ops and some custom logic, for example.
Currently, this will not work if the function is TorchScripted since `with record_function()` is not torchscriptable yet. We can add support for this in future PRs so that torchscript RPC functions can also be profiled like this.
ghstack-source-id: 111033981
Reviewed By: mrshenli
Differential Revision: D23355215
fbshipit-source-id: 318d92e285afebfeeb2a7896b4959412c5c241d4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43655
Pure, unadulerated bikeshed. The good stuff.
This makes things more consistent with ScriptModule.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23401528
Pulled By: suo
fbshipit-source-id: 7dd8396365f118abcd045434acd9348545314f44
Summary:
Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance.
First up the good stuff, benchmark before:
```
Column sum Caffe2 NNC Simple Better
(10, 100) 5.7917 9.7037 6.9386 6.0448
(100, 100) 5.9338 14.972 7.1139 6.3254
(100, 10000) 21.453 741.54 145.74 12.555
(1000, 1000) 8.0678 122.75 22.833 9.0778
Row sum Caffe2 NNC Simple Better
(10, 100) 5.4502 7.9661 6.1469 5.5587
(100, 100) 5.7613 13.897 21.49 5.5808
(100, 10000) 21.702 82.398 75.462 22.793
(1000, 1000) 22.527 129 176.51 22.517
```
After:
```
Column sum Caffe2 NNC Simple Better
(10, 100) 6.0458 9.4966 7.1094 6.056
(100, 100) 5.9299 9.1482 7.1693 6.593
(100, 10000) 21.739 121.97 162.63 14.376
(1000, 1000) 9.2374 29.01 26.883 10.127
Row sum Caffe2 NNC Simple Better
(10, 100) 5.9773 8.1792 7.2307 5.8941
(100, 100) 6.1456 9.3155 24.563 5.8163
(100, 10000) 25.384 30.212 88.531 27.185
(1000, 1000) 26.517 32.702 209.31 26.537
```
Speedup about 3-8x depending on the size of the data (increasing with bigger inputs).
The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel.
It required a lot of refactoring and bug fixes on the way:
* Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs).
* Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores.
* Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case).
* Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs.
* Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf.
* Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change.
* Added simplification of simple Division patterns to the IRSimplifier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878
Reviewed By: glaringlee
Differential Revision: D23382499
Pulled By: nickgg
fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43728
Trace back from the weight node util we hit getattr, reconstruct the graph module with the traced nodes
and run the graph module to pack the weight. then replace the original chain of ops with the packed weight.
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23385090
fbshipit-source-id: 11341f0af525a02ecec36f163a9cd35dee3744a1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43581
Add similar APIs like eager and graph mode on torchscript
- fuse_fx
- quantize_fx (for both post training static and qat)
- quantize_dynamic_fx (for post training dynamic)
- prepare_fx (for both post training static and qat)
- prepare_dynamic_fx (for post training dynamic)
- convert_fx (for all modes)
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23385091
fbshipit-source-id: b789e54e1a0f3af6b026fd568281984e253e0433
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42629
How to approach reviewing this diff:
- The new codegen itself lives in `tools/codegen`. Start with `gen.py`, then read `model.py` and them the `api/` folder. The comments at the top of the files describe what is going on. The CLI interface of the new codegen is similar to the old one, but (1) it is no longer necessary to explicitly specify cwrap inputs (and now we will error if you do so) and (2) the default settings for source and install dir are much better; to the extent that if you run the codegen from the root source directory as just `python -m tools.codegen.gen`, something reasonable will happen.
- The old codegen is (nearly) entirely deleted; every Python file in `aten/src/ATen` was deleted except for `common_with_cwrap.py`, which now permanently finds its home in `tools/shared/cwrap_common.py` (previously cmake copied the file there), and `code_template.py`, which now lives in `tools/codegen/code_template.py`. We remove the copying logic for `common_with_cwrap.py`.
- All of the inputs to the old codegen are deleted.
- Build rules now have to be adjusted to not refer to files that no longer exist, and to abide by the (slightly modified) CLI.
- LegacyTHFunctions files have been generated and checked in. We expect these to be deleted as these final functions get ported to ATen. The deletion process is straightforward; just delete the functions of the ones you are porting. There are 39 more functions left to port.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D23183978
Pulled By: ezyang
fbshipit-source-id: 6073ba432ad182c7284a97147b05f0574a02f763
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43405.
This pull request adds a feature of printing all tracebacks if a `detect_anomaly` mode detects `nan` in nested backward operations.
The way I did it is by assigning a node as a parent to all nodes it produces during its backward calculation. Then if one of the children produces `nan`, it will print the traceback from the parent and grand parents (if any).
The parent is assigned in `parent_node_` member in `Node` class which is accessible in C++ by function `node->parent()` and in Python by `node.parent_function`.
A node has a parent iff:
1. it is created from a backward operation, and
2. created when anomaly mode and grad mode are both enabled.
An example of this feature:
import torch
def example():
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(1e-8, requires_grad=True) # small to induce nan in n-th backward
a = x * y
b = x * y
z1 = a / b # can produce nan in n-th backward as long as https://github.com/pytorch/pytorch/issues/43414 is unsolved
z = z1 * z1
gy , = torch.autograd.grad( z , (y,), create_graph=True)
gy2, = torch.autograd.grad(gy , (y,), create_graph=True)
gy3, = torch.autograd.grad(gy2, (y,), create_graph=True)
gy4, = torch.autograd.grad(gy3, (y,), create_graph=True)
return gy4
with torch.autograd.detect_anomaly():
gy4 = example()
with output:
example.py:16: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
with torch.autograd.detect_anomaly():
/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning: Error detected in DivBackward0. Traceback of forward call that caused the error:
File "example.py", line 17, in <module>
gy4 = example()
File "example.py", line 12, in example
gy3, = torch.autograd.grad(gy2, (y,), create_graph=True)
File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
return Variable._execution_engine.run_backward(
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:61.)
return Variable._execution_engine.run_backward(
/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning:
Traceback of forward call that induces the previous calculation:
File "example.py", line 17, in <module>
gy4 = example()
File "example.py", line 11, in example
gy2, = torch.autograd.grad(gy , (y,), create_graph=True)
File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
return Variable._execution_engine.run_backward(
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:65.)
return Variable._execution_engine.run_backward(
/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning:
Traceback of forward call that induces the previous calculation:
File "example.py", line 17, in <module>
gy4 = example()
File "example.py", line 8, in example
z1 = a / b # can produce nan in n-th backward as long as https://github.com/pytorch/pytorch/issues/43414 is unsolved
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:65.)
return Variable._execution_engine.run_backward(
Traceback (most recent call last):
File "example.py", line 17, in <module>
gy4 = example()
File "example.py", line 13, in example
gy4, = torch.autograd.grad(gy3, (y,), create_graph=True)
File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
return Variable._execution_engine.run_backward(
RuntimeError: Function 'DivBackward0' returned nan values in its 1th output.
cc & thanks to albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43626
Reviewed By: malfet
Differential Revision: D23397499
Pulled By: albanD
fbshipit-source-id: aa7435ec2a7f0d23a7a02ab7db751c198faf3b7d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43664
This PR implements the test runner for batched gradient computation with
vmap. It also implements the batching rule for sigmoid_backward and
tests that one can compute batched gradients with sigmoid (and batched
2nd gradients).
Test Plan: - New tests: `python test/test_vmap.py -v`
Reviewed By: ezyang
Differential Revision: D23358555
Pulled By: zou3519
fbshipit-source-id: 7bb05b845a41b638b7cca45a5eff1fbfb542a51f
Summary:
It is often that the conversion from torch operator to onnx operator requires input rank/dtype/shape to be known. Previously, the conversion depends on tracer to provide these info, leaving a gap in conversion of scripted modules.
We are extending the export with support from onnx shape inference. If enabled, onnx shape inference will be called whenever an onnx node is created. This is the first PR introducing the initial look of the feature. More and more cases will be supported following this PR.
* Added pass to run onnx shape inference on a given node. The node has to have namespace `onnx`.
* Moved helper functions from `export.cpp` to a common place for re-use.
* This feature is currently experimental, and can be turned on through flag `onnx_shape_inference` in internal api `torch.onnx._export`.
* Currently skipping ONNX Sequence ops, If/Loop and ConstantOfShape due to limitations. Support will be added in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40628
Reviewed By: mrshenli
Differential Revision: D22709746
Pulled By: bzinodev
fbshipit-source-id: b52aeeae00667e66e0b0c1144022f7af9a8b2948
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43539
Move the two source files out of the base internal mobile library to the app level. Make it ready for app-based selective build. Opensource build should not be affected. The file list change in build_variables.bzl affects internal build only.
ghstack-source-id: 111006135
Test Plan: CI
Reviewed By: ljk53
Differential Revision: D23287661
fbshipit-source-id: 9b2d688544e79e0fca9c84730ef0259952cd8abe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43676
This is one part of https://github.com/pytorch/pytorch/issues/41574 to
ensure we consolidate everything around ivalue::Future.
I've removed the use of torch/csrc/utils/future.h from the autograd engines and
used ivalue::Future instead.
ghstack-source-id: 110895545
Test Plan: waitforbuildbot.
Reviewed By: albanD
Differential Revision: D23362415
fbshipit-source-id: aa109b3f8acf0814d59fc5264a85a8c27ef4bdb6
Summary:
This PR adds API to package unoptimized/fallback blocks as function calls. It's mainly meant to be used by TensorExpressionsFuser and SpecializeAutogradZero passes as both specialize the original graph but would also like to provide a fallback path in case the assumptions under which the graph was specialized do not hold for some inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43274
Reviewed By: malfet
Differential Revision: D23406961
Pulled By: Krovatkin
fbshipit-source-id: ef21fc9ad886953461b09418d02c75c58375490c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647
Nothing fancy, just a basic implementation of the graph executor without using stack machine.
Reviewed By: bwasti
Differential Revision: D23208413
fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43481
Apply OperatorGenerator for prim and special operator registration. It does not affect the existing build by default. However, if a whitelist of operator exists, only the operators in the whitelist will be registered. It has the potential to save up to 200 KB binary size, depending on the usage.
Test Plan: Imported from OSS
Reviewed By: ljk53
Differential Revision: D23287251
Pulled By: iseeyuan
fbshipit-source-id: 3ca39fbba645bad8d69e69195f3680e4f6d633c5
Summary:
Fixed grammatical errors and punctuation so that it be can more understandable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43779
Reviewed By: ZolotukhinM
Differential Revision: D23407849
Pulled By: malfet
fbshipit-source-id: 09c064ce68d0f37f8023c2ecae8775fc00541a2c
Summary:
In case we want to store binary files using `ScriptModule.save(..., _extra_files=...)` functionality. With python3 we can just use bytes only and not bother about it.
I had to do a copy-pasta from pybind sources, maybe we should upstream it, but it'd mean adding a bunch of template arguments to `bind_map` which is a bind untidy.
Let me know if there's a better place to park this function (it seems to be the only invocation of `bind_map` so I put it in the same file)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43241
Reviewed By: zdevito
Differential Revision: D23205244
Pulled By: dzhulgakov
fbshipit-source-id: 8f291eb4294945fe1c581c620d48ba2e81b3dd9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43310
In this diff, we prepared some example DDP communication hooks [#40848](https://github.com/pytorch/pytorch/pull/40848):
1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.
2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.
3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.
4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.
5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110923269
Test Plan:
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py
Couldn't download test skip set, leaving all tests enabled...
.....
----------------------------------------------------------------------
Ran 4 tests in 26.724s
OK
Internal testing:
```
buck run mode/dev-nosan //caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks
```
Reviewed By: malfet
Differential Revision: D22937999
fbshipit-source-id: 274452e7932414570999cb978ae77a97eb3fb0ec
Summary:
It's useful if we add additional attributed to nodes in the graph - it's easier to set the attribute on all nodes, even if the value would happen to be None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43432
Reviewed By: jamesr66a
Differential Revision: D23276433
Pulled By: dzhulgakov
fbshipit-source-id: c69e7cb723bbbb4dba3b508a3d6c0e456fe610df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43447
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.
Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.
ghstack-source-id: 110909401
Test Plan:
Perf trace runs to validate the desired behavior:
See the dedicated stream 152 is running the then callback operations:
{F299759342}
I run pytorch.benchmark.main.workflow using resnet50 and 32 GPUs registering allreduce with then hook.
See f213777896 [traces](https://www.internalfb.com/intern/perfdoctor/results?run_id=26197585)
After updates, same observation: see f214890101
Reviewed By: malfet
Differential Revision: D23277575
fbshipit-source-id: 67a89900ed7b70f3daa92505f75049c547d6b4d9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43086
This PR changes the format of `ConvPackedParam` in a nearly backwards-compatible way:
* a new format is introduced which has more flexibility and a lower on-disk size
* custom pickle functions are added to `ConvPackedParams` which know how to load the old format
* the custom pickle functions are **not** BC because the output type of `__getstate__` has changed. We expect this to be acceptable as no user flows are actually broken (loading a v1 model with v2 code works), which is why we whitelist the failure.
Test plan (TODO finalize):
```
// adhoc testing of saving v1 and loading in v2: https://gist.github.com/vkuzo/f3616c5de1b3109cb2a1f504feed69be
// test that loading models with v1 conv params format works and leads to the same numerics
python test/test_quantization.py TestSerialization.test_conv2d_graph
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph
// test that saving and loading models with v2 conv params format works and leads to same numerics
python test/test_quantization.py TestSerialization.test_conv2d_graph_v2
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph_v2
// TODO before land:
// test numerics for a real model
// test legacy ONNX path
```
Note: this is a newer copy of https://github.com/pytorch/pytorch/pull/40003
Test Plan: Imported from OSS
Reviewed By: dreiss
Differential Revision: D23347832
Pulled By: vkuzo
fbshipit-source-id: 06bbe4666421ebad25dc54004c3b49a481d3cc92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43524
1. adds support for testing BC on data format and numerics for graph mode
quantized modules
2. using the above, adds coverage for quantized conv2d on graph mode
Test Plan:
```
python test/test_quantization.py TestSerialization.test_conv2d_nobias
python test/test_quantization.py TestSerialization.test_conv2d_graph
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph
```
Imported from OSS
Reviewed By: supriyar
Differential Revision: D23335222
fbshipit-source-id: 0c9e93a940bbf6c676c2576eb62fcc725247588b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42531
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.
**Broadcasting**
At this point we don't support broadcasting.
**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.
---------------
**In this PR**
- Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API
- Resolving some additional comments from previous [PR](https://github.com/pytorch/pytorch/pull/41554).
**Tests**
Tested via unit tests
**TODO**
1. Properly handle empty lists
**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops
2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23331892
Pulled By: izdeby
fbshipit-source-id: c585b72e1e87f6f273f904f75445618915665c4c
Summary:
Adds two more "missing" NumPy aliases: arctanh and arcsinh, and simplifies the dispatch of other arc* aliases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43762
Reviewed By: ngimel
Differential Revision: D23396370
Pulled By: mruberry
fbshipit-source-id: 43eb0c62536615fed221d460c1dec289526fb23c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43792
This fixes the issue we had with the nightlies not being uploaded
properly, basically what was happening was that `aws s3 cp` doesn't
automatically distinguish between prefixes that are already
"directories" vs a single file with the same name.
This means that if you'd like to upload a file to a "directory" in S3
you need to suffix your destination with a slash.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D23402074
Pulled By: seemethere
fbshipit-source-id: 6085595283fcbbbab0836ccdfe0f8aa2a6abd7c8
Summary:
Add a max/min operator that only return values.
## Some important decision to discuss
| **Question** | **Current State** |
|---------------------------------------|-------------------|
| Expose torch.max_values to python? | No |
| Remove max_values and only keep amax? | Yes |
| Should amax support named tensors? | Not in this PR |
## Numpy compatibility
Reference: https://numpy.org/doc/stable/reference/generated/numpy.amax.html
| Parameter | PyTorch Behavior |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| `axis`: None or int or tuple of ints, optional. Axis or axes along which to operate. By default, flattened input is used. If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before. | Named `dim`, behavior same as `torch.sum` (https://github.com/pytorch/pytorch/issues/29137) |
| `out`: ndarray, optional. Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output. | Same |
| `keepdims`: bool, optional. If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array. | implemented as `keepdim` |
| `initial`: scalar, optional. The minimum value of an output element. Must be present to allow computation on empty slice. | Not implemented in this PR. Better to implement for all reductions in the future. |
| `where`: array_like of bool, optional. Elements to compare for the maximum. | Not implemented in this PR. Better to implement for all reductions in the future. |
**Note from numpy:**
> NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax.
PyTorch has the same behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43092
Reviewed By: ngimel
Differential Revision: D23360705
Pulled By: mruberry
fbshipit-source-id: 5bdeb08a2465836764a5a6fc1a6cc370ae1ec09d
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43732.
Requires importing the fft namespace in the C++ API, just like the Python API does, to avoid clobbering torch::fft the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43749
Reviewed By: glaringlee
Differential Revision: D23391544
Pulled By: mruberry
fbshipit-source-id: d477d0b6d9a689d5c154ad6c31213a7d96fdf271
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43719
Accidentally this slipped through: with guard did not update the current
context
Test Plan: cpu_caching_allocator_test
Reviewed By: linbinyu
Differential Revision: D23374453
fbshipit-source-id: 1d3ef21cc390d0a8bde98fb1b5c2175b40ab571b
Summary:
1) Manifold raises StorageException when it see's an error: https://fburl.com/diffusion/kit3me8a
2) torch re-raises exception: https://fburl.com/diffusion/zbw9wmpu
Issue here, that in StorageException first argument is bool canRetry while re-raising happens with first argument being str as in all Python exceptions.
Test Plan:
Existing tests should pass. +
```
In [1]: from manifold.clients.python import StorageException
In [2]: getattr(StorageException, "message", None)
Out[2]: <attribute 'message' of 'manifold.blobstore.blobstore.types.StorageException' objects>
In [3]: getattr(Exception, "message", None) is None
Out[3]: True
Reviewed By: haijunz
Differential Revision: D23195514
fbshipit-source-id: baa1667dbba4086db6ec93f009e400611ac9b938
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43233
XNNPack is already being used for the convolution2d operation. Add the
ability for it to be used with transpose convolution.
Test Plan: buck run caffe2/test:xnnpack_integration
Reviewed By: kimishpatel
Differential Revision: D23184249
fbshipit-source-id: 3fa728ce1eaca154d24e60f800d5e946d768c8b7
Summary:
This prevents confusing errors when the interpreter encounters some
syntax errors in the middle.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42870
Reviewed By: albanD
Differential Revision: D23269265
Pulled By: ezyang
fbshipit-source-id: 61f62cbe294078ad4a909fa87aa93abd08c26344
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43705
This was causing fb-internal flakiness. I'm surprised that the ASAN
builds don't catch this behavior.
The problem is that dereferencing the end() pointer of a vector is
undefined behavior. This PR fixes one callsite where BatchedFallback
dereferences the end() pointer and adds an assert to make sure another
callsite doesn't do that.
Test Plan:
- Make sure all tests pass (`pytest test/test_vmap.py -v`)
- It's hard to write a new test for this because most of the time this
doesn't cause a crash. It really depends on what lives at the end()
pointer.
Reviewed By: ezyang
Differential Revision: D23373352
Pulled By: zou3519
fbshipit-source-id: 61ea0be80dc006f6d4e73f2c5badd75096f63e56
Summary:
Move `code_coverage_tool` from `experimental` folder to `caffe2/tools` folder.
Not sure if the fb related code is something we don't want to share with the oss. Can reviewers please help me check with `fbcode_coverage.py` and files in `fbcode/` folder?
Test Plan: Test locally
Reviewed By: malfet
Differential Revision: D23379383
fbshipit-source-id: f6782389ebb1b147eaf6d3664b5955db79d24ff3
Summary:
These started failing since **https://github.com/pytorch/pytorch/pull/43633** for indecipherable reasons; temporarily disable. The errors on the PRs were
```
Downloading workspace layers
workflows/workspaces/3ca9ca71-7449-4ae1-bb7b-b7612629cc62/0/8607ba99-5ced-473b-b60a-0025b48739a6/0/105.tar.gz - 8.4 MB
Applying workspace layers
8607ba99-5ced-473b-b60a-0025b48739a6
```
which is not too helpful...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43750
Reviewed By: ZolotukhinM
Differential Revision: D23388060
Pulled By: eellison
fbshipit-source-id: 96afa0160ec948049f3e194787a0a7ddbeb5124a
Summary:
`torch.scatter` allows `src` to be of different type when `src` is a scalar. This requires a an explicit cast op to be inserted in the ONNX graph because ONNX `ScatterElements` does not allow different types. This PR updates the export of `torch.scatter` with this logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43440
Reviewed By: hl475
Differential Revision: D23352317
Pulled By: houseroad
fbshipit-source-id: c9eeddeebb67fc3c40ad01def134799ef2b4dea6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41371
**Summary**
This commit enables the use of `torch.no_grad()` in a with item of a
with statement within JIT. Note that the use of this context manager as
a decorator is not supported.
**Test Plan**
This commit adds a test case to the existing with statements tests for
`torch.no_grad()`.
**Fixes**
This commit fixes#40259.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D22649519
Pulled By: SplitInfinity
fbshipit-source-id: 7fa675d04835377666dfd0ca4e6bc393dc541ab9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42708
Add rowwise prune pytorch op.
This operator introduces sparsity to the 'weights' matrix with the help
of the importance indicator 'mask'.
A row is considered important and not pruned if the mask value for that
particular row is 1(True) and not important otherwise.
Test Plan:
buck test caffe2/torch/fb/sparsenn:test -- rowwise_prune
buck test caffe2/test:pruning
Reviewed By: supriyar
Differential Revision: D22849432
fbshipit-source-id: 456f4f77c04158cdc3830b2e69de541c7272a46d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564
Static dispatch was originally introduced for mobile selective build.
Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23324452
Pulled By: ljk53
fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43570
Add the default op dependency graph to the source tree - use it if user runs
custom build in dynamic dispatch mode without providing the graph.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23326988
Pulled By: ljk53
fbshipit-source-id: 5fefe90ca08bb0ca20284e87b70fe1dba8c66084
Summary:
Suggested by Shoaib Meenai, we should use mode/ndk_libcxx to replace mode/gnustl.
This diff updated all build flags for caffe2 and pytorch in aibench. For easy management, I created two mode files in xplat/caffe2/mode, and delete buckconfig.ptmobile.pep.
Test Plan:
caffe2
```
buck run aibench:run_bench -- -b aibench/specifications/models/caffe2/squeezenet/squeezenet.json --remote --devices s9f
```
https://our.intern.facebook.com/intern/aibench/details/433604719423848
full jit
```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/fbnet_mobile_inference.json --platform android/full_jit --framework pytorch --remote --devices SM-G960F-8.0.0-26
```
https://our.intern.facebook.com/intern/aibench/details/189359776958060
lite interpreter
```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/fbnet_mobile_inference.json --platform android --framework pytorch --remote --devices s9f
```
https://our.intern.facebook.com/intern/aibench/details/568178969092066
Reviewed By: smeenai
Differential Revision: D23338089
fbshipit-source-id: 62f4ae2beb004ceaab1f73f4de8ff9e0c152d5ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43633
In the backward graph, _grad_sum_to_size is inserted whenever a possibly broadcasting op is called:"
`"aten::_grad_sum_to_size(Tensor(a) self, int[]? size) -> Tensor(a)"`
If a broadcast occurred, a sum is called, otherwise the second input is None and it is a no-op. Most of the time, it's a no-op (in the fast RNNs benchmark > 90% of the time).
We can get rid of this op by profiling the optionality of the second input. I added `prim::profile_optional` to do this, which counts the number of times it saw a None value and the number of times it saw a value present. When specializing the backward graph, we insert checks for values we profiled as None, and in the optimized block can remove the grad_sum_to_size calls that use those values.
In the future we may revisit this when NNC supports reductions and we want to replace grad_sum_to_size with sums as well, but I think this is worth landing now.
Test Plan: Imported from OSS
Reviewed By: bwasti, ZolotukhinM
Differential Revision: D23358809
Pulled By: eellison
fbshipit-source-id: a30a148ca581370789d57ba082d23cbf7ef2cd4d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43632
Specialize the backward graph by guarding on the undefinedness of the input tensors. The graph will look like:
```
ty1, ty2, succesful_checks = prim::TypeCheck(...)
if (succesful_checks)
-> optimized graph
else:
-> fallback graph
```
Specializing on the undefinedness of tensors allows us to clean up the
```
if any_defined(inputs):
outputs = <original_computation>
else:
outputs = autograd zero tensors
```
blocks that make up the backward graph, so that we can fuse the original_computation nodes together.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23358808
Pulled By: eellison
fbshipit-source-id: f5bb28f78a4a3082ecc688a8fe0345a8a098c091
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43631
I added a new test for just profiler stuff - I don't think the test should go in test_jit.py. Maybe this should just go in test_tensorexpr_fuser, but I'm not really testing tensorexpr stuff either... LMK
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358810
Pulled By: eellison
fbshipit-source-id: 074238e1b60e4c4a919a052b7a5312b790ad5d82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43630
No functional changes here - just refactoring specialize autograd zero to a class, and standardizing its API to take in a shared_ptr<Graph>
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358805
Pulled By: eellison
fbshipit-source-id: 42e19ef2e14df66b44592252497a47d03cb07a7f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43629
We have a few places where we count the size a block / subgraph - it's nice to have a shared API to ignore operators that are not executed in the optimized graph (will be used when i add a new profiling node in PR ^^)
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358807
Pulled By: eellison
fbshipit-source-id: 62c745d9025de94bdafd9f748f7c5a8574cace3f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43721
We can combine optimization pass and save_for_mobile together to reduce friction. Since lite interpreter model can also be used in full JIT, I don't think we need the option to save it as full JIT model.
Also
- improved usage message
- print op list before and after optimization pass
Test Plan:
```
buck run //xplat/caffe2:optimize_for_mobile -- --model=/home/linbin/sparkspot.pt
Building: finished in 12.4 sec (100%) 2597/2597 jobs, 2 updated
Total time: 12.5 sec
pt_operator_library(
name = "old_op_library",
ops = [
"aten::_convolution",
"aten::adaptive_avg_pool2d",
"aten::add_.Tensor",
"aten::batch_norm",
"aten::mul.Tensor",
"aten::relu_",
"aten::softplus",
"aten::sub.Tensor",
],
)
pt_operator_library(
name = "new_op_library",
ops = [
"aten::adaptive_avg_pool2d",
"aten::add_.Tensor",
"aten::batch_norm",
"aten::mul.Tensor",
"aten::relu_",
"aten::softplus",
"aten::sub.Tensor",
"prepacked::conv2d_clamp_run",
],
)
The optimized model for lite interpreter was saved to /home/linbin/sparkspot_mobile_optimized.bc
```
```
buck run //xplat/caffe2:optimize_for_mobile -- --model=/home/linbin/sparkspot.pt --backend=vulkan
```
Reviewed By: kimishpatel
Differential Revision: D23363533
fbshipit-source-id: f7fd61aaeda5944de5bf198e7f93cacf8368babd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43653
When nodes are created without an explicit name, a name is generated for
it based on the target. In these cases, we need to avoid shadowing
builtin names. Otherwise, code like:
```
a.foo.bar
```
results in pretty-printed code like:
```
getattr = a.foo
getattr_1 = getattr.bar
```
While this is technically allowed in Python, it's probably a bad idea,
and more importantly is not supported by TorchScript (where `getattr` is
hardcoded).
This PR changes the name generation logic to avoid shadowing all
builtins and langauge keywords. We already do this for PyTorch
built-ins, so just extend that logic. So now the generated code will
look like:
```
getattr_1 = a.foo
getattr_2 = getattr_1.bar
```
Fixes#43522
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23357420
Pulled By: suo
fbshipit-source-id: 91e9974adc22987eca6007a2af4fb4fe67f192a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43587
Add tests for graph mode quantization on torchvision and make sure it matches
current eager mode quantization
Test Plan:
Imported from OSS
Imported from OSS
Reviewed By: z-a-f
Differential Revision: D23331253
fbshipit-source-id: 0445a44145d99837a2c975684cd0a0b7d965c8f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43456
Introduce the template OperatorGenerator, which returns an optional Operator. It's null if the templated bool value is null.
RegisterOperators() is updated to take the optional Operator. A null will not be registered.
With this update the selective operator registration can be done at compile time. Tests are added to show an operator can be registered if it's in a whitelist and it will not be registered if it's not in the whitelist.
Test Plan: Imported from OSS
Reviewed By: ljk53
Differential Revision: D23283563
Pulled By: iseeyuan
fbshipit-source-id: 456e0c72b2f335256be800aeabb797bd83bcf0b3
Summary:
To reduce the chance of conflicts, not all ops are fixed. Ops starting with letter `f` will be fixed in separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43583
Reviewed By: ZolotukhinM
Differential Revision: D23330347
Pulled By: mruberry
fbshipit-source-id: 3387cb1e495faebd16fb183039197c6d90972ad4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42259
**Summary**
This commit modifies IR generation to insert explicit cast that cast
each return value to `Any` when a function is annotated as returning `Any`.
This precludes the failure in type unification (see below) that caused
this issue.
Issue #41962 reported that the use of an `Any` return type in
combination with different code paths returning values of different
types causes a segmentation fault. This is because the exit transform
pass tries to unify the different return types, fails, but silently sets
the type of the if node to c10::nullopt. This causes problems later in
shape analysis when that type object is dereferenced.
**Test Plan**
This commit adds a unit test that checks that a function similar to the
one in #41962 can be scripted and executed.
**Fixes**
This commit fixes#41962.
Differential Revision: D22883244
Test Plan: Imported from OSS
Reviewed By: eellison, yf225
Pulled By: SplitInfinity
fbshipit-source-id: 523d002d846239df0222cd07f0d519956e521c5f
Summary:
fmax/fmin propagate the number if one argument is NaN, which doesn't match the eager mode behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43590
Reviewed By: mruberry
Differential Revision: D23338664
Pulled By: bertmaher
fbshipit-source-id: b0316a6f01fcf8946ba77621efa18f339379b2d0
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349
Implement NumPy-like functions `maximum` and `minimum`.
The `maximum` and `minimum` functions compute input tensors element-wise, returning a new array with the element-wise maxima/minima.
If one of the elements being compared is a NaN, then that element is returned, both `maximum` and `minimum` functions do not support complex inputs.
This PR also promotes the overloaded versions of torch.max and torch.min, by re-dispatching binary `torch.max` and `torch.min` to `torch.maximum` and `torch.minimum`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42579
Reviewed By: mrshenli
Differential Revision: D23153081
Pulled By: mruberry
fbshipit-source-id: 803506c912440326d06faa1b71964ec06775eac1
Summary:
Replace `test` with `coverage_test` stage for `pytorch-linux-bionic-py3.8-gcc9` configuration
Add `coverage.xml` to the list of ignored files
Add `codecov.yml` that maps installed pytorch folders back to original locations
Cleanup coverage option utilization in `run_test.py` and adapt it towards combining coverage reports across the runs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43600
Reviewed By: seemethere
Differential Revision: D23351877
Pulled By: malfet
fbshipit-source-id: acf78ae4c8f3e23920a76cce1d50f2821b83eb06
Summary:
These tests are failing on one of my system that does not have lapack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43566
Reviewed By: ZolotukhinM
Differential Revision: D23325378
Pulled By: mruberry
fbshipit-source-id: 5d795e460df0a2a06b37182d3d4084d8c5c8e751
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43554
Move function implementations in the TensorIteratorConfig Class from TensorIterator.h to TensorIterator.cpp to avoid this issue: https://github.com/pytorch/pytorch/issues/43300
Reviewed By: malfet
Differential Revision: D23319007
fbshipit-source-id: 6cc3474994ea3094a294f795ac6998c572d6fb9b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43640
+ added a `self.checkGraphModule` utility function to wrap the common
test assert pattern.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23356262
Pulled By: suo
fbshipit-source-id: a50626dcb01246d0dbd442204a8db5958cae23ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43559
- remove mkl strided gemm since it was acting weird in some cases, use the plain for loop for gemm for now, it will have performance implications but this closes the gap for the ctr_instagram_5x model
- reproduced the failure scenario of batchmatmul on ctr_instagram_5x by increasing the dimensions of the inputs
- added an option in netrunner to skip bmm if needed
Test Plan:
- net runner passes with ctr_instagram 5x
- bmm unit test repros the discrepancy fixed
Reviewed By: amylittleyang
Differential Revision: D23320857
fbshipit-source-id: 7d5cfb23c1b0d684e1ef766f1c1cd47bb86c9757
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43307
I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.
Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for `bucket_view`:
```
// bucket_views_in[i].copy_(grad) and
// grad.copy_(bucket_views_out[i])
// provide convenient ways to move grad data in/out of contents.
std::vector<at::Tensor> bucket_views_in;
std::vector<at::Tensor> bucket_views_out;
```
I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.
ghstack-source-id: 110728299
Test Plan:
Run `python test/distributed/test_c10d.py`, some perf&accuracy benchmarks.
New tests:
`test_accumulate_gradients_no_sync_allreduce_hook`
`test_accumulate_gradients_no_sync_allreduce_with_then_hook`
Acc benchmark results look okay:
f214188350
Reviewed By: agolynski
Differential Revision: D23229309
fbshipit-source-id: 329470036cbc05ac12049055828495fdb548a082
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43584
1. add `metadata.pkl` to `.bc` file which includes the model info that we are interested in
2. load `metadata.pkl` as a attribute `unordered_map<string, string>` in the module
ghstack-source-id: 110730013
Test Plan:
- CI
```buck build //xplat/caffe2:jit_module_saving
```
```buck build //xplat/caffe2:torch_mobile_core
```
Reviewed By: xcheng16
Differential Revision: D23330080
fbshipit-source-id: 5d65bd730b4b566730930d3754fa1bf16aa3957e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43247
`torch.cuda.nccl` APIs didn't throw appropriate errors when called
with inputs/outputs that were of the wrong type and it resulted in some cryptic
errors instead.
Adding some error checks with explicit error messages for these APIs.
ghstack-source-id: 110683546
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D23206069
fbshipit-source-id: 8107b39d27f4b7c921aa238ef37c051a9ef4d65b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43573
We recently updated the Stella NLU model in D23307228, and the App started to crash with `Following ops cannot be found:{aten::str, }`.
Test Plan: Verified by installing the assistant-playground app on Android.
Reviewed By: czlx0701
Differential Revision: D23325409
fbshipit-source-id: d670242868774bb0aef4be5c8212bc3a3f2f667c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43591
100 randomized inputs vs 50 doesn't change the balance that much but speed up test runtime
Test Plan: CI
Reviewed By: orionr, seemethere
Differential Revision: D23332393
fbshipit-source-id: 7a8ff9127ee3e045a83658a7a670a844f3862987
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43468
Closes https://github.com/pytorch/pytorch/issues/41378.
https://github.com/pytorch/pytorch/pull/41973 enhanced the skip decorators to
report the right no. of GPUs required, but this information was not passed to
the main process where the message is actually displayed. This PR uses a
`multiprocessing.Manager()` so that the dictionary modification is reflected
correctly in the main process.
ghstack-source-id: 110684228
Test Plan:
With this diff, we can run a test in such as in https://github.com/pytorch/pytorch/pull/42577 that requires 4 GPUs on a 2 GPU machine, and we get the expected message:
```
test_ddp_uneven_inputs_replicated_error (test_distributed.TestDistBackend) ... skipped 'Need at least 4 CUDA devices'
```
Reviewed By: mrshenli
Differential Revision: D23285790
fbshipit-source-id: ac32456ef3d0b1d8f1337a24dba9f342c736ca18
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43182
We should avoid using `deepcopy` on the module because it involves copying the weights.
Comparing the implementation of `c10::ivalue::Object::copy()` vs `c10::ivalue::Object::deepcopy()`, the only difference is `deepcopy` copies the attributes (slots) while `copy` does not.
Reviewed By: bwasti
Differential Revision: D23171770
fbshipit-source-id: 3cd711c6a2a19ea31d1ac1ab2703a0248b5a4ef3
Summary:
Fixes gh-42282
This adds a device-mismatch check to `addmm` on CPU and CUDA. Although it seems like the dispatcher is always selecting the CUDA version here if any of the inputs are on GPU. So in theory the CPU check is unnecessary, but probably better to err on the side of caution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43505
Reviewed By: mruberry
Differential Revision: D23331651
Pulled By: ngimel
fbshipit-source-id: 8eb2f64f13d87e3ca816bacec9d91fe285d83ea0
Summary:
openssh should be installed by either the circleci machines or from the
jenkins workers so we shouldn't need to install it ourselves in order to
get ssh functionality
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43597
Reviewed By: ezyang
Differential Revision: D23333479
Pulled By: seemethere
fbshipit-source-id: 17a1ad0200a9df7d4818ab1ed44c8488ec8888fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43603
We are in the midst of landing a big reword of profiling executor and
benchmarks are expected to fail while we are in the transitional state.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23334818
Pulled By: ZolotukhinM
fbshipit-source-id: 99ff17c6f8ee18d003f6ee76ff0e719cea68c170
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43173
With this change the fuser starts to generate typechecks for inputs of
fusion group. For each fusion group we generate a typecheck and an if
node: the true block contains the fused subgraph, the false block
contains unoptimized original subgraph.
Differential Revision: D23178230
Test Plan: Imported from OSS
Reviewed By: eellison
Pulled By: ZolotukhinM
fbshipit-source-id: f56e9529613263fb3e6575869fdb49973c7a520b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43235
This functionality is needed when we want to not lose track of
nodes/values as we merge and unmerge them into other nodes. For
instance, if we have a side data structure with some meta information
about values or nodes, this new functionality would allow to keep that
metadata up to date after merging and unmerging nodes.
Differential Revision: D23202648
Test Plan: Imported from OSS
Reviewed By: eellison
Pulled By: ZolotukhinM
fbshipit-source-id: 350d21a5d462454166f8a61b51d833551c49fcc9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43365
We don't have shape inference for them yet.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23253418
Pulled By: ZolotukhinM
fbshipit-source-id: 9c38778b8a616e70f6b2cb5aab03d3c2013b34b0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43557
backout the diff that caused some errors in pytext distributed training
Test Plan: Tested by rayhou who verified reverting the diff works
Differential Revision: D23320238
fbshipit-source-id: caa0fe74404059e336cd95fdb41373f58ecf486e
Summary:
Original commit changeset: f368d00f7bae
Back out "[2/3][lite interpreter] add metadata when saving and loading models for mobile"
D23047144 (e37f871e87)
Pull Request: https://github.com/pytorch/pytorch/pull/43516
(Note: this ignores all push blocking failures!)
Test Plan: CI
Reviewed By: xcheng16
Differential Revision: D23304639
fbshipit-source-id: 970ca3438c1858f8656cbcf831ffee2c4a551110
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43526
Add tests for graph mode quantization on torchvision and make sure it matches
current eager mode quantization
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23306683
fbshipit-source-id: 30d27e225d4557bfc1d9aa462086e416aa9a9c0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43354
Instead of assigning a thread to an input index for repeating that index, we assign a warp to an index. This helps us in avoiding the costly uncoaelesced memory accesses and brach divergence which occur when each thread is repeating the index.
Test Plan: Run trainer to test
Reviewed By: ngimel
Differential Revision: D23230917
fbshipit-source-id: 731e912c844f1d859b0384fcaebafe69cb4ab56a
Summary:
PyTorch uses f-string in its python codes.
Python support for f-string started with version 3.6
Using python version 3.5 or older fails the build with latest release/master.
This patch checks the version of the python used for build and mandates it to be 3.6 or higher.
Signed-off-by: Parichay Kapoor <kparichay@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43105
Reviewed By: glaringlee
Differential Revision: D23301481
Pulled By: malfet
fbshipit-source-id: e9b4f7bffce7384c8ade3b7d131b10cf58f5e8a0
Summary:
https://github.com/pytorch/pytorch/issues/22990 added a multiprocessing_context argument to DataLoader, but a typo in the test causes the wrong DataLoader class to be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43343
Reviewed By: glaringlee
Differential Revision: D23299452
Pulled By: malfet
fbshipit-source-id: 9489c48b83bce36f46d350cad902f7ad96e1eec4
Summary:
- `torch._VF` is a hack to work around the lack of support for `torch.functional` in the JIT
- that hack hides `torch._VF` functions from Mypy
- could be worked around by re-introducing a stub file for `torch.functional`, but that's undesirable
- so instead try to make both happy at the same time: the type ignore comments are needed for Mypy, and don't seem to affect the JIT after excluding them from the `get_type_line()` logic
Encountered this issue while trying to make `mypy` run on `torch/functional.py` in gh-43446.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43454
Reviewed By: glaringlee
Differential Revision: D23305579
Pulled By: malfet
fbshipit-source-id: 50e490693c1e53054927b57fd9acc7dca57e88ca
Summary:
Take care of the state of autocast in `parallel_apply`, so there is no need to decorate model implementations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43102
Reviewed By: ngimel
Differential Revision: D23294610
Pulled By: mrshenli
fbshipit-source-id: 0fbe0c79de976c88cadf2ceb3f2de99d9342d762
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43531
It's useful for building some tooling out of tree to manipulate zip files in PyTorch-y way
Test Plan: contbuild
Reviewed By: houseroad
Differential Revision: D23277361
fbshipit-source-id: e15fad20e792d1e41018d32fd48295cfe74bea8c
Summary: per title, makes c2 wrappers safer as contiguity of torch inputs is not guaranteed
Test Plan: covered by existing tests
Reviewed By: dzhulgakov
Differential Revision: D23310137
fbshipit-source-id: 3fe12abc7e394b8762098d032200778018e5b591
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43027
Format db.h and db.cc using the default formatter.
This change was split off of D22705434.
Test Plan: Wait for sandcastle.
Reviewed By: rohithmenon, marksantaniello
Differential Revision: D23113765
fbshipit-source-id: 3f02d55bfb055bda0fcba5122336fa001562d42e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43445
changed the interface for checkGraphModule to make the arguments more explicit
as requested in https://github.com/pytorch/pytorch/pull/43437
Test Plan:
TestQuantizeFx
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23280586
fbshipit-source-id: 5b5859e326d149a5aacb1d15cbeee69667cc9109
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43155
Update the code_analyzer build.sh script to be able to take additional build flags in the mobile build/analysis
Test Plan:
Checkout associated PR or copy contents of build.sh into PyTorch repo (must be run from root of PyTorch repo)
To run with inclusion of autograd dependencies (note BUILD_MOBILE_AUTOGRAD is still an experimental build flag): `ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseopsfile MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`
Reviewed By: ljk53
Differential Revision: D23065754
fbshipit-source-id: d83a7ad62ad366a84725430ed020adf4d56687bd
Summary:
1. add `metadata.pkl` to `.bc` file which includes the model info that we are interested in
2. load `metadata.pkl` as a attribute `unordered_map<string, string>` in the module
Test Plan:
- CI
```buck build //xplat/caffe2:jit_module_saving
```
```buck build //xplat/caffe2:torch_mobile_core
```
Reviewed By: xcheng16
Differential Revision: D23047144
fbshipit-source-id: f368d00f7baef2d3d15f89473cdb146467aa1e0b
Summary:
Reland of the benchmark code that broke the slow tests because the GPU were running out of memory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43428
Reviewed By: ngimel
Differential Revision: D23296136
Pulled By: albanD
fbshipit-source-id: 0002ae23dc82f401604e33d0905d6b9eedebc851
Summary:
This doesn't fix any reported issue. We validate ROCm PyTorch on ubuntu and centos. For centos, we must modify the test.sh script to let it run on centos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42197
Reviewed By: ezyang, ngimel
Differential Revision: D23175669
Pulled By: malfet
fbshipit-source-id: 0da435de6fb17d2ca48e924bec90ef61ebbb5042
Summary:
[Re-review tips: nothing changed other than a type in python_ir.cpp to fix a windows build failure]
Adds code printing for enum type
Enhance enum type to include all contained enum names and values
Adds code parsing for enum type in deserialization
Enabled serialization/deserialization test in most TestCases. (With a few dangling issues to be addressed in later PRs to avoid this PR grows too large)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43460
Reviewed By: albanD
Differential Revision: D23284929
Pulled By: gmagogsfm
fbshipit-source-id: e3e81d6106f18b7337ac3ff5cd1eeaff854904f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43362
Batching rules implemented for: addition subtraction division
multiplication.
I refactored the original `mul_batching_rule` into a templated function
so that one can insert arbitrary binary operations into it.
add, sub, rsub, mul, and div all work the same way. However, other
binary operations work slightly differently (I'm still figuring out the
differences and why they're different) so those may need a different
implementation.
Test Plan: - "pytest test/test_vmap.py -v": new tests
Reviewed By: ezyang
Differential Revision: D23252317
Pulled By: zou3519
fbshipit-source-id: 6d36cd837a006a2fd31474469323463c1bd797fc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43083
This adds type annotations to all classes, arguments, and returns
for fx. This should make it easier to understand the code, and
encourage users of the library to also write typed code.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23145853
Pulled By: zdevito
fbshipit-source-id: 648d91df3f9620578c1c51408003cd5152e34514
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43082
Fixes all present errors in mypy. Does not try to add annotations everywhere.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23145854
Pulled By: zdevito
fbshipit-source-id: 18e483ed605e89ed8125971e84da1a83128765b7
Summary:
Separate user embeddings and ad embeddings in blobsOrder. New order:
1. meta_net_def
2. preload_blobs
3. user_embeddings (embeddings in remote request only net)
4. ad_embeddings (embeddings in remote other net)
Add a field requestOnlyEmbeddings in meta_net_def to record user_embeddings.
This is for flash verification.
Test Plan:
buck test dper3/dper3_backend/delivery/tests:blob_reorder_test
Run a flow with canary package f211282476
Check the net: n326826, request_only_embeddings are recorded as expected
Reviewed By: ipiszy
Differential Revision: D23008305
fbshipit-source-id: 9360ba3d078f205832821005e8f151b8314f0cf2
Summary:
As part of our continued refactoring of test_torch.py, this takes tests for tensor creation ops like torch.eye, torch.randint, and torch.ones_like and puts them in test_tensor_creation_ops.py. There hare three test classes in the new test suite: TestTensorCreation, TestRandomTensorCreation, TestLikeTensorCreation. TestViewOps and tests for construction of tensors from NumPy arrays have been left in test_torch.py. These might be refactored separately into test_view_ops.py and test_numpy_interop.py in the future.
Most of the tests ported from test_torch.py were left as is or received a signature change to make them nominally "device generic." Future work will need to review test coverage and update the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43104
Reviewed By: ngimel
Differential Revision: D23280358
Pulled By: mruberry
fbshipit-source-id: 469325dd1a734509dd478cc7fe0413e276ffb192
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42840
By caching input/output pointers and input parameters we enable the use
of caching allocator and check if we get the same input/output pointers.
If so we skip setup steps.
Test Plan:
python test/test_xnnpack_integration.py
Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D23044585
fbshipit-source-id: ac676cff77f264d8ccfd792d1a540c76816d5359
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42178
This otherwise introduces unnecessary calls to contiguous in the rest of
the network, where certain ops want channels last format.
Test Plan:
Quantization tests.
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22796479
fbshipit-source-id: f1ada1c2eeed84991b9b195120699b943ef6e421
Summary:
Optimize exported graph to export slice nodes for aten::split when the number of split outputs are fixed. Previously under some cases these are exported as onnx::SplitToSequence, which is dynamic in tensor output count.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42744
Reviewed By: houseroad
Differential Revision: D23172465
Pulled By: bzinodev
fbshipit-source-id: 11e432b4ac1351f17e48356c16dc46f877fdf7da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43239
This is an incremental step as part of the process to migrate caffe2 random number generator off of std::mt19937 and to instead use at::mt19937+at::CPUGeneratorImpl. The ATen variants are much more performant (10x faster).
This adds a way to get the CPUContext RandSeed for tail use cases that require a std::mt19937 and borrow the CPUContext one.
Test Plan: This isn't used anywhere within the caffe2 codebase. Compile should be sufficient.
Reviewed By: dzhulgakov
Differential Revision: D23203280
fbshipit-source-id: 595c1cb447290604ee3ef61d5b5fc079b61a4e14
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42008
With caching allocator we have increased the likelihood of getting the
same input pointer. With that we can cache qnnpack operator and input
pointer and check if the input pointer is the same. If so we can skip
setup step.
Test Plan:
Ran one of the quantized models to observe
1. No pagefaults due to indirection buffer reallocation.
2. Much less time spent in indirection buffer population.
Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D22726973
fbshipit-source-id: 2dd2a6a6ecf1b5cfa7dde65e384b36a6eab052d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42007
zero buffer and indirection pointers are allocatoed on every iterations.
With this refactor we create op once for qnnpackconv struct and keep
repopulating indirection pointer as necessary.
For deconv moved much of op creation outside so that we can avoid creating and
destroying ops every time.
Test Plan:
CI quantization tests.
deconvolution-test
Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D22726972
fbshipit-source-id: 07c03a4e90b397c36aae537ef7c0b7d81d4adc1a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42006
This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only.
Test Plan:
android test: cpu_caching_allocator_test
Imported from OSS
Reviewed By: dreiss
Differential Revision: D22726976
fbshipit-source-id: 9a38b1ce34059d5653040a1c3d035bfc97609e6c
Summary:
This patch allows to freeze model that utilizes interfaces. Freezing works
under the user assumption that the interfase module dones not aliases with
any value used in the model.
To enable freezing of such modules, added an extra pramater:
torch._C._freeze_module(module, ignoreInterfaces = True)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41860
Reviewed By: eellison
Differential Revision: D22670566
Pulled By: bzinodev
fbshipit-source-id: 41197a724bc2dca2e8495a0924c224dc569f62a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42963
* Adds code printing for enum type
* Enhance enum type to include all contained enum names and values
* Adds code parsing for enum type in deserialization
* Enabled serialization/deserialization test in most TestCases. (With a few dangling issues to be addressed in later PRs to avoid this PR grows too large)
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23223281
Pulled By: gmagogsfm
fbshipit-source-id: 716d1866b7770dfb7bd8515548cfe7dc4c4585f7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43355
There seem to be some bugs where we cannot guarantees that blobs in `PARAMETERS_BLOB_TYPE_FULLY_REMOTE_REQUEST_ONLY` and `PARAMETERS_BLOB_TYPE_DISAGG_ACC_REMOTE_OTHER` are disjoint. Hence we need to walk around this.
Also make the msg more informative.
Test Plan:
```
flow-cli test-locally --mode opt dper.workflows.evaluation.eval_workflow --parameters-file=/mnt/shared/yinghai/v0_ctr_mbl_feed_1120_onnx.json
```
Reviewed By: ehsanardestani
Differential Revision: D23141538
fbshipit-source-id: 8e311f8fc0e40eff6eb2c778213f78592e6bf079
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42413
When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, https://github.com/pytorch/pytorch/pull/40737. To make bytecode BC in this case, we
1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion.
2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4).
3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function.
Test Plan: Imported from OSS
Reviewed By: xcheng16
Differential Revision: D22898314
Pulled By: iseeyuan
fbshipit-source-id: 90d339f8e1365f4bb178db8db7c147390173372b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43372
So that adding more binary op tests are easier
Test Plan: Imported from OSS
Reviewed By: z-a-f
Differential Revision: D23257046
fbshipit-source-id: 661acd4c38abdc892c9db8493b569226b13e0d0d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43293
'docker run' has the capability to use a file for environment variables,
we should prefer to use that instead of having it be sourced per command
in the docker container.
Also opens the door for cutting down on the total number of commands we
need to echo into a script to then execute as a 'docker exec' command.
Plus side of this approach is that the BASH_ENV is persisted through all
of the steps so there's no need to do any exports / worry about
environment variables not persisting through jobs.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23227059
Pulled By: seemethere
fbshipit-source-id: be425aa21b420b9c6e96df8b2177f508ee641a20
Summary:
This diff normalizes for-loops that have non 0 loop starts to always start from 0. Given a for-loop, this normalization changes the loop start to be 0 and adjusts the loop end and all accesses to the index variable within the loop body appropriately.
This diff also adds tests for several cases of normalization and also tests normalization in conjunction with `splitwithTail` transformation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43179
Reviewed By: nickgg
Differential Revision: D23220534
Pulled By: navahgar
fbshipit-source-id: 64be0c72e4dbc76906084f7089dea81ae07d6020
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43090
To match the floating point module
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23167518
fbshipit-source-id: 29db596e10731be4cfed7efd18f33a0b3dbd0ca7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43088
Create quantized module that the user can use to perform embedding bag quantization
The module uses the EmbeddingPackedParams to store the weights which can be serialized /deserialized
using TorchBind custom classes (C++ get/setstate code)
Following PR will add support for `from_float` to convert from float to quantized module
Test Plan:
python test/test_quantization.py TestDynamicQuantizedModule.test_embedding_bag_api
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23167519
fbshipit-source-id: 029d7bb44debf78c4ef08bfebf267580ed94d033
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43317
Previous version was returning the path with a prefix so subsequent `getRecord` would fail.
There's only one place in PyTorch codebase that uses this function (introduced in https://github.com/pytorch/pytorch/pull/29339 ) and it's unlikely that anyone else is using it - it's not a public API anyway.
Test Plan: unittest
Reviewed By: houseroad
Differential Revision: D23235241
fbshipit-source-id: 6f7363e6981623aa96320f5e39c54e65d716240b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43242
This causes "std::isnan" to produce confusing error messages (std::std has not been declared).
Instead, simply let isnan be exposed in the global namespace.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D23214374
Pulled By: ezyang
fbshipit-source-id: 9615116a980340e36376a20f2e546e4d36839d4b
Summary:
They were likely copied from some macro definition, but they do not
belong to macro definitions here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43318
Reviewed By: pbelevich
Differential Revision: D23241526
Pulled By: mrshenli
fbshipit-source-id: e0b5eddfde2c882bb67f56d84ee79281cc5fc941
Summary:
This PR:
- ports the tests in TestTorchMathOps to test_unary_ufuncs.py
- removes duplicative tests for the tested unary ufuncs from test_torch.py
- adds a new test, test_reference_numerics, that validates the behavior of our unary ufuncs vs. reference implementations on empty, scalar, 1D, and 2D tensors that are contiguous, discontiguous, and that contain extremal values, for every dtype the unary ufunc supports
- adds support for skipping tests by regex, this behavior is used to make the test suite pass on Windows, MacOS, and ROCm builds, which have a variety of issues, and on Linux builds (see https://github.com/pytorch/pytorch/issues/42952)
- adds a new OpInfo helper, `supports_dtype`, to facilitate test writing
- extends unary ufunc op info to include reference, domain, and extremal value handling information
- adds OpInfos for `torch.acos` and `torch.sin`
These improvements reveal that our testing has been incomplete on several systems, especially with larger float values and complex values, and several TODOs have been added for follow-up investigations. Luckily when writing tests that cover many ops we can afford to spend additional time crafting the tests and ensuring coverage.
Follow-up PRs will:
- refactor TestTorchMathOps into test_unary_ufuncs.py
- continue porting tests from test_torch.py to test_unary_ufuncs.py (where appropriate)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42965
Reviewed By: pbelevich
Differential Revision: D23238083
Pulled By: mruberry
fbshipit-source-id: c6be317551453aaebae9d144f4ef472f0b3d08eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43286
We need to use this in graph mode quantization on fx
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23221734
fbshipit-source-id: 7c3c3840ce5bdc185b962e081aff1618f4c58e85
Summary:
These were added accidentally (probably by an IDE) during a refactor.
These files have always been Open Source.
Test Plan: CI
Reviewed By: xcheng16
Differential Revision: D23250761
fbshipit-source-id: 4974430c0e28dd3269424d38edb36f4f71508157
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39401
This uses the technique proposed by smessmer in D16451848 to selectively
register operators without codegen. See the Note inside for more
details.
This PR has feature parity with the old selective build apparatus:
it can whitelist schema def()s, impl()s, and on a per dispatch key
basis. It has expanded dispatch key whitelisting, whereas previously
manually written registrations were not whitelisted at all. (This
means we may be dropping dispatch keys where we weren't previously!)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D21905593
Pulled By: ezyang
fbshipit-source-id: d4870f800c66be5ce57ec173c9b6e14a52c4a48b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43341
This is to remove the empty pretty_print() since it overrides the impl within Module base which is not as designed here.
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D23244616
Pulled By: glaringlee
fbshipit-source-id: 94b8dfd3697dfc450f53b3b4eee6e9c13cafba7b
Summary:
Add ComplexHalf case to toValueType, which fixes the logic how view_as_real and view_as_complex slices complex tensor to the floating point one, as it is used to generate tensor of random complex values, see:
018b4d7abb/aten/src/ATen/native/DistributionTemplates.h (L200)
Also add ability to convert python complex object to `c10::complex<at::Half>`
Add `torch.half` and `torch.complex32` to the list of `test_randn` dtypes
Fixes https://github.com/pytorch/pytorch/issues/43143
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43279
Reviewed By: mrshenli
Differential Revision: D23230296
Pulled By: malfet
fbshipit-source-id: b4bb66c4c81dd867e72ab7c4563d73f6a4d80a44
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43175
This PR added graph mode quantization on fx: https://github.com/pytorch/pytorch/pull/42741
Currently it matches eager mode quantization for torchvision with static/dynamic/qat
ddp/synbn test is still wip
Test Plan:
python test/test_quantization.py TestQuantizeFx
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23178602
fbshipit-source-id: 8e7e0322846fbda2cfa79ad188abd7235326f879
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43218
Previously, `vmap(lambda x: x * 0.1)(torch.ones(3))` would return a
float64 tensor(!!). This is because there is a subtle bug in the
batching rule: the batching rule receives:
- A batched tensor for x
- a scalar tensor: tensor(0.1, dtype=torch.float64).
The batching rule decides to expand the scalar tensor to be the same
size as x and then multiplies the two tensors, promoting the output to
be a float64 tensor. However, this isn't correct: we should treat the
scalar tensor like a scalar tensor. When adding a FloatTensor to a
Double scalar tensor, we don't promote the type usually.
Another example of a bug this PR fixes is the following:
`vmap(torch.mul)(torch.ones(3), torch.ones(3, dtype=torch.float64))`
Multiplying a scalar float tensor with a scalar double tensor produces a
float tensor, but the above produced a float64 before this PR due to
mistakingly type-promoting the tensors.
Test Plan:
- new test: `pytest test/test_vmap.py -v`
- I refactored some tests a bit.
Reviewed By: cpuhrsch
Differential Revision: D23195418
Pulled By: zou3519
fbshipit-source-id: 33b7da841e55b47352405839f1f9445c4e0bc721
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43154
Adds the build flag `BUILD_MOBILE_AUTOGRAD` which toggles whether autograd files should be included for a PyTorch mobile build (default off).
ghstack-source-id: 110369406
Test Plan: CI
Reviewed By: ljk53
Differential Revision: D23061913
fbshipit-source-id: bc3d6683ab17f158990d83e4fae0a011d5adeca1
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41314 among other things.
This PR streamlines layout propagation logic in TensorIterator and removes almost all cases of channels-last hardcoding. The new rules and changes are as follows:
1) behavior of undefined `output` and defined output of the wrong (e.g. 0) size is always the same (before this PR the behavior was divergent)
2) in obvious cases (unary operation on memory-dense tensors, binary operations on memory-dense tensors with the same layout) strides are propagated (before propagation was inconsistent) (see footnote)
3) in other cases the output permutation is obtained as inverse permutation of sorting inputs by strides. Sorting is done with comparator obeying the following rules: strides of broadcasted dimensions are set to 0, and 0 compares equal to anything. Strides of not-broadcasted dimensions (including dimensions of size `1`) participate in sorting. Precedence is given to the first input, in case of a tie in the first input, first the corresponding dimensions are considered, and if that does not indicate that swap is needed, strides of the same dimension in subsequent inputs are considered. See changes in `reorder_dimensions` and `compute_strides`. Note that first inspecting dimensions of the first input allows us to better recover it's permutation (and we select this behavior because it more reliably propagates channels-last strides) but in some rare cases could result in worse traversal order for the second tensor.
These rules are enough to recover previously hard-coded behavior related to channels last, so all existing tests are passing.
In general, these rules will produce intuitive results, and in most cases permutation of the full size input (in case of broadcasted operation) will be recovered, or permutation of the first input (in case of same sized inputs) will be recovered, including cases with trivial (1) dimensions. As an example of the latter, the following tensor
```
x=torch.randn(2,1,3).permute(1,0,2)
```
will produce output with the same stride (3,3,1) in binary operations with 1d tensor. Another example is a tensor of size N1H1 that has strides `H,H,1,1` when contiguous and `H, 1, 1, 1` when channels-last. The output retains these strides in binary operations when another 1d tensor is broadcasted on this one.
Footnote: for ambiguous cases where all inputs are memory dense and have the same physical layout that nevertheless can correspond to different permutations, such as e.g. NC11-sized physically contiguous tensors, regular contiguous tensor is returned, and thus permutation information of the input is lost (so for NC11 channels-last input had the strides `C, 1, C, C`, but output will have the strides `C, 1, 1, 1`). This behavior is unchanged from before and consistent with numpy, but it still makes sense to change it. The blocker for doing it currently is performance of `empty_strided`. Once we make it on par with `empty` we should be able to propagate layouts in these cases. For now, to not slow down common contiguous case, we default to contiguous.
The table below shows how in some cases current behavior loses permutation/stride information, whereas new behavior propagates permutation.
| code | old | new |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------|
| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) | (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) | (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |
| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) | (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1) | (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42922
Reviewed By: ezyang
Differential Revision: D23148204
Pulled By: ngimel
fbshipit-source-id: 670fb6188c7288e506e5ee488a0e11efc8442d1f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43277
Docker added native support for GPUs with the release of 19.03 and
CircleCI's infrastructure is all on Docker 19.03 as of now.
This also removes all references to `nvidia-docker` in the `.circleci` fodler.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23217570
Pulled By: seemethere
fbshipit-source-id: af297c7e82bf264252f8ead10d1a154354b24689
Summary:
Globs don't get expanded if you quote them in a bash script...
apparently.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43297
Reviewed By: malfet
Differential Revision: D23227626
Pulled By: seemethere
fbshipit-source-id: d124025cfcaacbfb68167a062ca487c08f7f6bc9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43272
Was missing kwarg-onlyness.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D23215506
Pulled By: ezyang
fbshipit-source-id: 2c282c9a534fa8ea1825c31a24cb2441f0d6b234
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43205
A number of tests that forward to `TestLoadSaveBase.load_save` are all marked as flaky due to them regularly taking much longer to start up than hypothesis' default timeout of 200ms. This diff fixes the problem by removing the timeout for `load_save`. This is alright as these tests aren't meant to be testing the performance of these operators.
I would set the deadline to 60s if I could however it appears the that caffe2 github CI uses a different version of hypothesis that doesn't allow using `dateutil.timedelta` so instead of trying to figure out an approach that works on both I've just removed the deadline time.
I've also tagged all existing tasks WRT these failures.
Differential Revision: D23175752
fbshipit-source-id: 324f9ff034df1ac4874797f04f50067149a6ba48
Summary:
According to the correlation analysis, CUDA-10.1 vs CUDA-11 test failures are quite dependent on each other
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43234
Reviewed By: ezyang, seemethere
Differential Revision: D23204289
Pulled By: malfet
fbshipit-source-id: c53c5f87e55f2dabbb6735a0566c314c204ebc69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869
We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.
The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.
In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.
ghstack-source-id: 110208431
Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues.
Reviewed By: pritamdamania87
Differential Revision: D23055807
fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40084
This is just a nit diff (got merge conflict) while writing some unit-tests.
This move was nit as part of D21628596 (655f1ea176).
Test Plan: buck test test:quantization -- test_qlinear_legacy
Reviewed By: supriyar
Differential Revision: D22065463
fbshipit-source-id: 96ceaa53355349af7157f38b3a6366c550eeec6f
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 685149bbc0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43251
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: YazhiGao
Differential Revision: D23207016
fbshipit-source-id: 54e13b246bb5189260ed11316ddf3d26d52c6b24
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43206
The batching rule is the same as the unary pointwise batching rules:
given a BatchedTensor, we unwrap it, call Tensor.to, and then re-wrap
it.
Test Plan: - `pytest test/test_vmap.py -v -k`
Reviewed By: ezyang
Differential Revision: D23189053
Pulled By: zou3519
fbshipit-source-id: 51b4e41b1cd34bd082082ec4fff3c643002edbaf
Summary:
Should fix binary build issue on Windows, and promptly error out if images are updated to a different version of VC++
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43220
Reviewed By: ezyang
Differential Revision: D23198530
Pulled By: malfet
fbshipit-source-id: 0c80361ad7dcfb7aaffccc306b7d741671bedc11
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43217
Dataclasses is part of standard library in Python 3.7 and there
is a backport for it in Python 3.6. Our code generation will
start using it, so add it to the default library set.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D23214028
Pulled By: ezyang
fbshipit-source-id: a2ae20b9fa8f0b22966ae48506d4ddea203e7459
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4812
if no compilation options are passed, default to c-step
fixed the FC and batchmatmul implementations to match C-step
fixed the fakelowp map calling to make sure we use the fp32 substitution of operators
updated the accumulator test to make it pass with fp32
Test Plan:
fakelowp tests
glow/test/numerics
net_runner
Reviewed By: jfix71
Differential Revision: D23086534
fbshipit-source-id: 3fbb8c4055bb190becb39ce8cdff6671f8558734
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43077
2 Bit Embedding weight conversion operation is quite similar to
4 bit embedding weight conversion.
The diff contains both the
1. 2bit packing op `embedding_bag_2bit_prepack`.
2. 2bit unpacking op `embedding_bag_2bit_unpack`.
Comments about the op are inline with the op definition.
Test Plan: buck test caffe2/test:quantization -- test_embedding_bag_2bit_unpack
Reviewed By: supriyar
Differential Revision: D23143262
fbshipit-source-id: fd8877f049ac1f7eb4bc580e588dc95f8b1edef0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43156
- supports_named_tensor no longer does anything, so I have removed
it. I'm guessing these were cargo culted from some old occurrences
of it in native_functions.yaml
- comma, not period, in variants
In my upcoming codegen rewrite, there will be strict error checking
for these cases (indeed, that is how I found these problems), so
I do not add error testing here.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23183977
Pulled By: ezyang
fbshipit-source-id: a47d342152badfb8aea248a819ad94fd93dd6ab2
Summary:
This relaxes the assumption that test.sh will be run in the CI environment by the CI user.
CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43236
Reviewed By: colesbury
Differential Revision: D23205981
Pulled By: ezyang
fbshipit-source-id: 302743cb03c9e9c6bfcdd478a6cd920b536dc29b
Summary:
Make CUDA-10.1 configs build-only, as CUDA-10.1 and CUDA-10.2 test matrix is almost identical, and now, since CUDA-11 is out perhaps it's time to stop testing CUDA-10.1.
Make CUDA-9.2+GCC_5.4 an important (i.e. running on PR) build only config, because of the big overlap between CUDA-9.2-GCC7 and CUDA-9.2-GCC5.4 test coverage.
Make CUDA-11 libtorch tests important rather that CUDA-10.2.
As result of the change, every PR will be built against CUDA-9.2, CUDA-10.2 and CUDA-11 and tested against CUDA-10.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43240
Reviewed By: ezyang
Differential Revision: D23205129
Pulled By: malfet
fbshipit-source-id: 70932e8b2167cce9fd621115c8bf24b1c81ed621
Summary:
Most of the fixes is the same old enum-is-not-hasheable error
In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision
This regression was introduced by https://github.com/pytorch/pytorch/pull/43129
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43223
Reviewed By: albanD, seemethere
Differential Revision: D23198330
Pulled By: malfet
fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43097
Boolean arguments weren't promoted, so if you tried to write a comparison with
types such as `Tensor(Bool) == Int` you'd fail typechecking inside the TE
engine.
Test Plan: Imported from OSS
Reviewed By: protonu, zheng-xq
Differential Revision: D23167926
Pulled By: bertmaher
fbshipit-source-id: 47091a815d5ae521637142a5c390e8a51a776906
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42884
I did some additional research and considering the first few lines of the docs (`Creates a criterion that measures the loss given inputs x1, x2, two 1D mini-batch Tensors, and a label 1D mini-batch tensor y (containing 1 or -1`) and the provided tests, this loss should be used primarily with 1-D tensors. More advanced users (that may use this loss in non-standard ways) can easily check the source and see that the definition accepts inputs/targets of arbitrary dimension as long as they match in shape or are broadcastable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43131
Reviewed By: colesbury
Differential Revision: D23192011
Pulled By: mrshenli
fbshipit-source-id: c412c28daf9845c0142ea33b35d4287e5b65fbb9
Summary:
Should close https://github.com/pytorch/pytorch/issues/36428.
The cudnn RNN API expects weights to occupy a flat buffer in memory with a particular layout. This PR implements a "speed of light" fix: [`_cudnn_rnn_cast_reflatten`](https://github.com/pytorch/pytorch/pull/42385/files#diff-9ef93b6a4fb5a06a37c562b83737ac6aR327) (the autocast wrapper assigned to `_cudnn_rnn`) copies weights to the right slices of a flat FP16 buffer with a single read/write per weight (as opposed to casting them to FP16 individually then reflattening the individual FP16 weights, which would require 2 read/writes per weight).
It isn't pretty but IMO it doesn't make rnn bindings much more tortuous than they already are.
The [test](https://github.com/pytorch/pytorch/pull/42385/files#diff-e68a7bc6ba14f212e5e7eb3727394b40R2683) tries a forward under autocast and a backward for the full cross product of RNN options and input/weight/hidden dtypes. As for all FP16list autocast tests, forward output and backward grads are checked against a control where inputs (including RNN module weights in this case) are precasted to FP16 on the python side.
Not sure who to ask for review, tagging ezyang and ngimel because Ed wrote this file (almost 2 years ago) and Natalia did the most recent major [surgery](https://github.com/pytorch/pytorch/pull/12600).
Side quests discovered:
- Should we update [persistent RNN heuristics](dbdd28207c/aten/src/ATen/native/cudnn/RNN.cpp (L584)) to include compute capability 8.0? Could be another PR but seems easy enough to include.
- Many (maybe all?!) the raw cudnn API calls in [RNN.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/RNN.cpp) are deprecated in cudnn 8. I don't mind taking the AI to update them since my mental cache is full of rnn stuff, but that would be a substantial separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42385
Reviewed By: zhangguanheng66
Differential Revision: D23077782
Pulled By: ezyang
fbshipit-source-id: a2afb1bdab33ba0442879a703df13dc87f03ec2e
Summary:
I think you want to push rewrapped `rets`, not `args`, back to the stack.
Doesn't matter for test purposes because tests only check if/when fallbacks were called, they don't check outputs for correctness. But it avoids reader confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42990
Reviewed By: mrshenli
Differential Revision: D23168277
Pulled By: ezyang
fbshipit-source-id: 2559f0707acdca2e3deac09006bc66ce3c788ea3
Summary:
A small change that adds a docstring that can be found with
`getattr(nn.Module, nn.Module.forward.__name__, None).__doc__`
Fixes https://github.com/pytorch/pytorch/issues/43057
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43063
Reviewed By: mrshenli
Differential Revision: D23161782
Pulled By: ezyang
fbshipit-source-id: 95456f858e2b6a0e41ae551ea4ec2e78dd35ee3f
Summary:
The changes are minor.
1. Add back the external links so that readers can find out more about external tools on how to accelerate PyTorch.
2. Fix typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43100
Reviewed By: colesbury
Differential Revision: D23192251
Pulled By: mrshenli
fbshipit-source-id: dde54b7942ebff5bbe3d58ad95744c6d95fe60fe
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39968
tested with `TORCH_CUDA_ARCH_LIST='3.5 5.2 6.0 6.1 7.0 7.5 8.0+PTX'`, before this PR, it was failing, and with this PR, the build succeed.
With `TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0+PTX'`, `libtorch_cuda.so` with symbols changes from 2.9GB -> 2.2GB
cc: ptrblck mcarilli jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43074
Reviewed By: mrshenli
Differential Revision: D23176095
Pulled By: malfet
fbshipit-source-id: 7b3e6d049fc080e519f21e80df05ef68e7bea57e
Summary:
Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.
**Overall:**
- Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.
**Integration:**
- Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
- Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
- 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic
**Code Generation:**
- More generic support in code generation for computeAt
- Full rework of loop nest generation and Indexing to more generically handle broadcast operations
- Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
- Symbolic (runtime) tilling on grid/block dimensions is supported
- Simplified index generation based on user-defined input contiguity
- Automatic broadcast support (similar to numpy/pytorch semantics)
- Support for compile time constant shared memory buffers
- Parallelized broadcast support (i.e. block reduction -> block broadcast support)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43129
Reviewed By: mrshenli
Differential Revision: D23162207
Pulled By: soumith
fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2
Summary:
The ONNX spec for the Squeeze operator:
> Remove single-dimensional entries from the shape of a tensor. Takes a parameter axes with a list of axes to squeeze. If axes is not provided, all the single dimensions will be removed from the shape. If an axis is selected with shape entry not equal to one, an error is raised.
Currently, as explained in issue https://github.com/pytorch/pytorch/issues/36796, it is possible to export such a model to ONNX, and this results in an exception from ONNX runtime.
Fixes https://github.com/pytorch/pytorch/issues/36796.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38476
Reviewed By: hl475
Differential Revision: D22158024
Pulled By: houseroad
fbshipit-source-id: bed625f3c626eabcbfb2ea83ec2f992963defa19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43151
Using `torch.all` instead of `torch.sum` and length check.
It's unclear whether the increase in perf (~5% for small inputs) is
real, but should be a net benefit, especially for larger channel inputs.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23170426
fbshipit-source-id: ee5c25eb93cee1430661128ac9458a9c525df8e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43150
The current logic was expensive because it created tensors on CUDA.
Switching to clamp since it can work without needing to create tensors.
Test Plan:
benchmarks
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23170427
fbshipit-source-id: 6fe3a728e737aca9f6c2c4d518c6376738577e21
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43149
This value doesn't change, making it a buffer to only pay
the cost of creating a tensor once.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23170428
fbshipit-source-id: 6b963951a573efcc5b5a57649c814590b448dd72
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42956
In preparation for observer perf improvement, cleans up the
micro benchmarks:
* disable CUDA for histogram observers (it's too slow)
* add larger shapes for better representation of real workloads
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qobserver_test
```
Imported from OSS
Reviewed By: supriyar
Differential Revision: D23093996
fbshipit-source-id: 5dc477c9bd5490d79d85ff8537270cd25aca221a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42511
DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.
To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".
For more context please see:
https://github.com/pytorch/pytorch/issues/40255#issuecomment-663298062.
ghstack-source-id: 109997718
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D22917579
fbshipit-source-id: c634b6c97f3051f071fd7b994333e6ecb8c54155
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43093
without this it's hard to tell which module is going wrong
Test Plan:
```
> TypeError:
> 'numpy.int64' object in attribute 'Linear.in_features' is not a valid constant.
> Valid constants are:
> 1. a nn.ModuleList
> 2. a value of type {bool, float, int, str, NoneType, torch.device, torch.layout, torch.dtype}
> 3. a list or tuple of (2)
```
Reviewed By: eellison
Differential Revision: D23148516
fbshipit-source-id: b86296cdeb7b47c9fd69b5cfa479914c58ef02e6
Summary:
This PR:
- Adds a method variant to movedim
- Fixes the movedim docs so it will actually appear in the documentation
- Fixes three view doc links which were broken
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43122
Reviewed By: ngimel
Differential Revision: D23166222
Pulled By: mruberry
fbshipit-source-id: 14971585072bbc04b5366d4cc146574839e79cdb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43059
This PR implements batching rules for some unary ops. In particular, it
implements the batching rules for the unary ops that take a single
tensor as input (and nothing else).
The batching rule for a unary op is:
(1) grab the physical tensor straight out of the BatchedTensor
(2) call the unary op
(3) rewrap the physical tensor in a BatchedTensor
Test Plan: - new tests `pytest test/test_vmap.py -v -k "Operators"`
Reviewed By: ezyang
Differential Revision: D23132277
Pulled By: zou3519
fbshipit-source-id: 24b9d7535338207531d767155cdefd2c373ada77
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43028
There was a bug where we always tried to grab the `__name__` attribute of
the function passed in by the user. Not all Callables have the
`__name__` attribute, an example being a Callable produced by
functools.partial.
This PR modifies the error-checking code to use `repr` if `__name__` is
not available. Furthermore, it moves the "get the name of this function"
functionality to the actual error sites as an optimization so we don't
spend time trying to compute `__repr__` for the Callable if there is no
error.
Test Plan: - `pytest test/test_vmap.py -v`, added new tests.
Reviewed By: yf225
Differential Revision: D23130235
Pulled By: zou3519
fbshipit-source-id: 937f3640cc4d759bf6fa38b600161f5387a54dcf
Summary:
LLVM builds took a large amount of time and bogged down docker builds in
general. Since we build it the same for everything let's just copy it
from a pre-built image instead of building it from source every time.
Builds are defined in https://github.com/pytorch/builder/pull/491
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43038
Reviewed By: malfet
Differential Revision: D23119513
Pulled By: seemethere
fbshipit-source-id: f44324439d45d97065246caad07c848e261a1ab6
Summary:
Since OpenMP is not available on some platforms, or might be disabled by user, set default `ATEN_THREADING` based on USE_OPENMP and USE_TBB options
Fixes https://github.com/pytorch/pytorch/issues/43036
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43067
Reviewed By: houseroad
Differential Revision: D23138856
Pulled By: malfet
fbshipit-source-id: cc8f9ee59a5559baeb3f19bf461abbc08043b71c
Summary:
During cleanup phase, calling recordReferencedAttrs would record
the attributes which are referenced and hence kept.
However, if you have two instances of the same type which are preserved
through freezing process, as the added testcase shows, then during
recording the attributes which are referenced, we iterate through the
type INSTANCES that we have seen so far and record those ones.
Thus if we have another instance of the same type, we will just look at
the first instance in the list, and record that instances.
This PR fixes that by traversing the getattr chains and getting the
actual instance of the getattr output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42457
Test Plan:
python test/test_jit.py TestFreezing
Fixes #{issue number}
Reviewed By: gchanan
Differential Revision: D23106921
Pulled By: kimishpatel
fbshipit-source-id: ffff52876938f8a1fedc69b8b24a3872ea66103b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43069
The transformer c++ impl need to put TransformerEncoderLayer/DecoderLayer and TransformerEncoder/TransformerDecoder in different header since TransformerEncoder/Decoder's options class need TransformerEncoderLayer/DecoderLayer as input parameter. Split header files to avoid cycle includsion.
Test Plan: Imported from OSS
Reviewed By: yf225
Differential Revision: D23139437
Pulled By: glaringlee
fbshipit-source-id: 3c752ed7702ba18a9742e4d47d049e62d2813de0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42880
Enable switching between and checking for training and eval mode for torch::jit::mobile::Module using train(), eval(), and is_training(), like exists for torch::jit::Module.
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D23063006
Pulled By: ann-ss
fbshipit-source-id: b79002148c46146b6e961cbef8aaf738bbd53cb2
Summary:
This adds the torch.arccosh alias and updates alias testing to validate the consistency of the aliased and original operations. The alias testing is also updated to run on CPU and CUDA, which revealed a memory leak when tracing (see https://github.com/pytorch/pytorch/issues/43119).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43107
Reviewed By: ngimel
Differential Revision: D23156472
Pulled By: mruberry
fbshipit-source-id: 6155fac7954fcc49b95e7c72ed917c85e0eabfcd
Summary:
This changes profiled types from being represented as:
`%23 : Float(4:256, 256:1, requires_grad=0, device=cpu) = prim::profile(%0)`
->
`%23 : Tensor = prim::profile[profiled_type=Float(4:256, 256:1, requires_grad=0, device=cpu)](%0)`
Previously, by representing the profiled type in the IR directly it was very easy for optimizations to accidentally use profiled types without inserting the proper guards that would ensure that the specialized type would be seen.
It would be a nice follow up to extend this to prim::Guard as well, however we have short term plans to get rid of prim::Guard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43035
Reviewed By: ZolotukhinM
Differential Revision: D23120226
Pulled By: eellison
fbshipit-source-id: c78d7904edf314dd65d1a343f2c3a947cb721b32
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42822
These ops arent supported with NCCL backend and used to silently error.
We disabled them as part of addressing https://github.com/pytorch/pytorch/issues/41362, so
document that here.
ghstack-source-id: 109957761
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D23023046
fbshipit-source-id: 45d69028012e0b6590c827d54b35c66cd17e7270
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42637
This commit enables sending non-CPU tensors through RPC using
TensorPipe backend. Users can configure device mappings by calling
set_map_location on `TensorPipeRpcBackendOptions`. Internally,
the `init_rpc` API verifies the correctness of device mappings. It
will shutdown RPC if the check failed, or proceed and pass global
mappings to `TensorPipeAgent` if the check was successful. For serde,
we added a device indices field to TensorPipe read and write buffers,
which should be either empty (all tensors must be on CPU) or match
the tensors in order and number in the RPC message. This commit
does not yet avoid zero-copy, the tensor is always moved to CPU
on the sender and then moved to the specified device on the receiver.
Test Plan: Imported from OSS
Reviewed By: izdeby
Differential Revision: D23011572
Pulled By: mrshenli
fbshipit-source-id: 62b617eed91237d4e9926bc8551db78b822a1187
Summary:
Add "asan" node to a `CONFIG_TREE_DATA` rather than hardcoded that non-xla clang-5 is ASAN
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43048
Reviewed By: houseroad
Differential Revision: D23126296
Pulled By: malfet
fbshipit-source-id: 22f02067bb2f5435a0e963a6c722b9c115ccfea4
Summary:
https://github.com/pytorch/pytorch/issues/40980
I have a few questions during implementing Polygamma function...
so, I made PR prior to complete it.
1. some code blocks brought from cephes library(and I did too)
```
/*
* The following function comes with the following copyright notice.
* It has been released under the BSD license.
*
* Cephes Math Library Release 2.8: June, 2000
* Copyright 1984, 1987, 1992, 2000 by Stephen L. Moshier
*/
```
is it okay for me to use cephes code with this same copyright notice(already in the Pytorch codebases)
2. There is no linting in internal Aten library. (as far as I know, I read https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md)
How do I'm sure my code will follow appropriate guidelines of this library..?
3. Actually, there's a digamma, trigamma function already
digamma is needed, however, trigamma function becomes redundant if polygamma function is added.
it is okay for trigamma to be there or should be removed?
btw, CPU version works fine with 3-rd order polygamma(it's what we need to play with variational inference with beta/gamma distribution) now and I'm going to finish GPU version soon.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42499
Reviewed By: gchanan
Differential Revision: D23110016
Pulled By: albanD
fbshipit-source-id: 246f4c2b755a99d9e18a15fcd1a24e3df5e0b53e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42991
Have Node both be a record of the operator in the graph, and the
way we _build_ the graph made it difficult to keep the IR datastructure
separate from the proxying logic in the build.
Among other issues this means that typos when using nodes would add
things to the graph:
```
for node in graph.nodes:
node.grph # does not error, returns an node.Attribute object!
```
This separates the builder into a Proxy object. Graph/Node no longer
need to understand `delegate` objects since they are now just pure IR.
This separates the `symbolic_trace` (proxy.py/symbolic_trace.py) from
the IR (node.py, graph.py).
This also allows us to add `create_arg` to the delegate object,
allowing the customization of how aggregate arguments are handled
when converting to a graph.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23099786
Pulled By: zdevito
fbshipit-source-id: 6f207a8c237e5eb2f326b63b0d702c3ebcb254e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42881
This enables serialization/de-serialization of embedding packed params using getstate/setstate calls.
Added version number to deal with changes to serialization formats in future.
This can be extended in the future to support 4-bit/2-bit once we add support for that.
Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingBag
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23070634
fbshipit-source-id: 2ca322ab998184c728be6836f9fd12cec98b2660
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42762
Use a prepack function that accepts qtensor as an input. The output is a byte tensor with packed data.
This is currently implemented only for 8-bit. In the future once we add 4-bit support this function will be extended to support that too.
Note -In the following change I will add TorchBind support for this to support serialization of packed weights.
Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingBag
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23070632
fbshipit-source-id: 502aa1302dffec1298cdf52832c9e2e5b69e44a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42927
added fp16 fusion to net transforms
refactored the transforms as well as glow_transform to get out of opt/custom so that the OSS builds passed
Test Plan: added net runner tests for this
Reviewed By: yinghai
Differential Revision: D23080881
fbshipit-source-id: ee6451811fedfd07c6560c178229854bca29301f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42389
**Summary**
This commit adds support for properties to TorchScript classes,
specifically for getters and setters. They are implemented essentially
as pointers to the methods that the corresponding decorators decorate,
which are treated like regular class methods. Deleters for properties
are considered to be out of scope (and probably useless for TorchScript
anyway).
**Test Plan**
This commit adds a unit test for a class with a property that has both
getter and setter and one that has only a getter.
`python test/test_jit.py TestClassType.test_properties`
Test Plan: Imported from OSS
Reviewed By: eellison, ppwwyyxx
Differential Revision: D22880232
Pulled By: SplitInfinity
fbshipit-source-id: 4828640f4234cb3b0d4f3da4872a75fbf519e5b0
Summary:
No type annotations can be added to the script, as it still have to be Python-2 compliant.
Make changes to avoid variable type redefinition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43062
Reviewed By: zou3519
Differential Revision: D23132991
Pulled By: malfet
fbshipit-source-id: 360c02e564398f555273e5889a99f834a5467059
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42153
As [documented](https://docs.nvidia.com/cuda/curand/device-api-overview.html) (search for `curand_uniform` on the page), `curand_uniform` returns "from 0.0 to 1.0, where 1.0 is included and 0.0 is excluded." These endpoints are different than the CPU equivalent, and makes the calculation in the PR fail when the value is 1.0.
The test from the issue is added, it failed for me consistently before the PR even though I cut the number of samples by 10.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42702
Reviewed By: gchanan
Differential Revision: D23107451
Pulled By: ngimel
fbshipit-source-id: 3575d5b8cd5668e74b5edbecd95154b51aa485a1
Summary:
Closes gh-42998
The issue is marked for 1.6.1, if there's anything I need to do for a backport please tell me what that is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43053
Reviewed By: izdeby
Differential Revision: D23131708
Pulled By: malfet
fbshipit-source-id: 2744bacce6bdf6ae463c17411b672f09707e0887
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42860
The `__cuda_array_interface__` tensor specification is missing the appropriate datatypes for the newly merged complex64 and complex128 tensors. This PR addresses this issue by casting:
* `torch.complex64` to 'c8'
* `torch.complex128` to 'c16'
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42918
Reviewed By: izdeby
Differential Revision: D23130219
Pulled By: anjali411
fbshipit-source-id: 5f8ee8446a71cad2f28811afdeae3a263a31ad11
Summary:
**`torch.nn.Hardsigmoid`** and **`torch.nn.Hardswish`** classes currently do not support `inplace` operations as it uses `torch.nn.functional.hardsigmoid` and `torch.nn.functional.hardswish` functions with their default inplace argument which is `False`.
So, I added `inplace` argument for `torch.nn.Hardsigmoid` and `torch.nn.Hardswish` classes so that forward operation can be done inplace as well while using these layers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42346
Reviewed By: izdeby
Differential Revision: D23108487
Pulled By: albanD
fbshipit-source-id: 0767334fa10e5ecc06fada2d6469f3ee1cacd957
Summary:
test_e2e_tensorpipe depends on ProcessGroupGloo, therefore it could not be tested with Gloo disabled
Otherwise, it re-introduces https://github.com/pytorch/pytorch/issues/42776
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43041
Reviewed By: lw
Differential Revision: D23122101
Pulled By: malfet
fbshipit-source-id: a8a088b6522a3bc888238ede5c2d589b83c6ea94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43037
In the previous version of mish_op.cc, the output would be 'nan' for large inputs. We re-write mish_op.cc to solve this problem.
Test Plan:
Unit test
buck test //dper3/dper3/modules/tests:core_modules_test -- test_linear_compress_embedding_with_attention_with_activation_mish
{F284052906}
buck test mode/opt //dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_with_mish
{F284224158}
## Workflow
f212113434
{F285281318}
Differential Revision: D23102644
fbshipit-source-id: 98f1ea82f8c8e05b655047b4520c600fc1a826f4
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 29d5eb9f3c
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42834
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: jspark1105
Differential Revision: D23040145
fbshipit-source-id: 1d7209ea1910419b7837703122b8a4c76380ca4a
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42133
Test Plan:
We save a module with module debugging information as follows.
```
import torch
m = torch.jit.load('./detect.pt')
# Save module without debug info
m._save_for_lite_interpreter('./detect.bc')
# Save module with debug info
m._save_for_lite_interpreter('./detect.bc', _save_debug_info_in_bytecode=True)
```
Size of the file without module debugging information: 4.508 MB
Size of the file with module debugging information: 4.512 MB
Reviewed By: kimishpatel
Differential Revision: D22803740
Pulled By: taivu1998
fbshipit-source-id: c82ea62498fde36a1cfc5b073e2cea510d3b7edb
Summary:
1. Fix illegal memory access issue for SplitByLengths operator in the CUDA context.
2. Add support to scaling lengths vector for SplitByLengths operator.
3. Add support to test SplitByLengths operator in the CUDA context.
Example for SplitByLengths operator processing scaling lengths vector:
value vector A = [1, 2, 3, 4, 5, 6]
length vector B = [1, 2]
after execution of SplitByLengths operator,
the output should be [1,2] and [3,4,5,6]
Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:concat_split_op_test
Reviewed By: kennyhorror
Differential Revision: D23079841
fbshipit-source-id: 3700e7f2ee0a5a2791850071fdc16e5b054f8400
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42938
1. Structure the logic in a more straight-forward way: instead of magic
tricks with node iterators in a block we now have a function that
tries to create a fusion group starting from a given node (and pull
everything it can into it).
2. The order in which we're pulling nodes into a fusion group is now
more apparent.
3. The new pass structure automatically allows us to support fusion
groups of size=1.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23084409
Pulled By: ZolotukhinM
fbshipit-source-id: d59fc00c06af39a8e1345a4aed8d829494db084c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42602
In this diff, clearer semantics and namings for are introduced by splitting the original `init_dynamic_qrange` into 2 separate `Optional[int]` types `qmin` and `qmax` to avoid the confusion of the parameters with dynamic quantization.
The `qmin` and `qmax` parameters allow customers to specify their own customary quantization range and enables specific use cases for lower bit quantization.
Test Plan:
To assert the correctness and compatibility of the changes with existing observers, on a devvm, execute the following command to run the unit tests:
`buck test //caffe2/test:quantization -- observer`
Reviewed By: vkuzo, raghuramank100
Differential Revision: D22948334
fbshipit-source-id: 275bc8c9b5db4ba76fc2e79ed938376ea4f5a37c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43015
Currently activation_post_process are inserted by default in qat modules, which is not
friendly to automatic quantization tools, this PR removes them.
Test Plan:
Imported from OSS
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23105059
fbshipit-source-id: 3439ac39e718ffb0390468163bcbffd384802b57
Summary:
This PR whitelists and simplifies graphs to help with development later on. Key to note in this PR is the use of both a pattern substitution and the registration of custom operators. This will likely be one of the main optimization types done in this folder.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43024
Reviewed By: hlu1
Differential Revision: D23114262
Pulled By: bwasti
fbshipit-source-id: e25aa3564dcc8a2b48cfd1561b3ee2a4780ae462
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42723
This PR is addressing https://github.com/pytorch/pytorch/issues/39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:
1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.
ghstack-source-id: 109805368
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D22993909
fbshipit-source-id: 9f1c1e0a58b58b97125f41090601e967f96f70c6
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40829
This is cross-platform but I have only tried it on linux, personally. Also, I am not fully certain of the usage pattern, so if there are any additional features / adjustments / tests that you want me to add, please just let me know!
CC ezyang rgommers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42635
Reviewed By: zhangguanheng66
Differential Revision: D23078663
Pulled By: ezyang
fbshipit-source-id: 5c8c8abebd1d462409c22dc4301afcd8080922bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43014
changing this behavior mimics the behavior of the hold hypothesis
testing library
Test Plan: ran all tests on devserver
Reviewed By: hl475
Differential Revision: D23085949
fbshipit-source-id: 433fdfbb04b6a609b738eb7c319365049a49579b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43018
In this diff, a fix is added where the original non-learnable fake quantize is provided with trainable scale and zero point, whereas the requires_grad for both parameters should be completely disabled.
Test Plan:
Use the following command to execute the benchmark test:
`buck test mode/dev-nosan pt:quantization_test`
Reviewed By: vkuzo
Differential Revision: D23107846
fbshipit-source-id: d2213983295f69121e9e6ae37c84d1f37d78ef39
Summary:
Fix typos in torch.utils/_benchmark/README.md
Add empty __init__.py to examples folder to make example invocations from README.md correct
Fixed uniform distribution logic generation when mixval and maxval are None
Fixes https://github.com/pytorch/pytorch/issues/42984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42960
Reviewed By: seemethere
Differential Revision: D23095399
Pulled By: malfet
fbshipit-source-id: 0546ce7299b157d9a1f8634340024b10c4b7e7de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42946
There are 3 options for the executor and fuser and some of them aren't
super interesting so I've combined the options into a single parameter, but
made it fairly easy to expand the set if there are other configs we might care
about.
Test Plan:
Benchmark it
Imported from OSS
Reviewed By: zheng-xq
Differential Revision: D23090177
fbshipit-source-id: bd93a93c3fc64e5a4a847d1ce7f42ce0600a586e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42612
Add a new Quantizer that supports an input zero point (bias) that can be float.
The quantization equation in this case is
Xq = (Xf - bias) * inv_scale, where bias is float zero_point value
We start with per-row implementation and can extend to per-tensor in the future, if necessary
Test Plan:
python test/test_quantization.py TestQuantizedTensor
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22960142
fbshipit-source-id: ca9ab6c5b45115d3dcb1c4358897093594313706
Summary:
When working on the Cuda Codegen, I found that running the IRSimplifier before generating code lead to test fails. This was due to a bug in Round+Mod simplification (e.g. (x / y * y) + (x % y) => x) to do with the order in which the terms appeared. After fixing it and writing a few tests around those cases, I found another bug in simplification of the same pattern and have fixed it (with some more test coverage).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42934
Reviewed By: zhangguanheng66
Differential Revision: D23085548
Pulled By: nickgg
fbshipit-source-id: e780967dcaa7a5fda9f6d7d19a6b7e7b4e94374b
Summary:
A simple differentiable abstraction to allow testing of full training graphs.
Included in this 1st PR is an example of trivial differentiation.
If approved, I can add a full MLP and demonstrate convergence using purely NNC (for performance testing) in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42548
Reviewed By: ZolotukhinM
Differential Revision: D23057920
Pulled By: bwasti
fbshipit-source-id: 4a239852c5479bf6bd20094c6c35f066a81a832e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42924
offsets is an optional paramter in the python module currently. So we update the operator to follow suit
in order to avoid bad optional access
Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag
Imported from OSS
Reviewed By: radkris-git
Differential Revision: D23081152
fbshipit-source-id: 847b58f826f5a18e8d4978fc4afc6f3a96dc4230
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42343
Currently activation_post_process are inserted by default in qat modules, which is not
friendly to automatic quantization tools, this PR removes them.
Test Plan: Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D22856816
fbshipit-source-id: 988a43bce46a992b38fd0d469929f89e5b046131
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42841
There is nothing using those APIs anymore. While we still have ops that require an unboxedOnly implementation (i.e. that aren't c10-full yet), those are all already migrated to the new op registration API and use `.impl_UNBOXED()`.
ghstack-source-id: 109693705
Test Plan: waitforsandcastle
Reviewed By: bhosmer
Differential Revision: D23045335
fbshipit-source-id: d8e15cea1888262135e0d1d94c515d8a01bddc45
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42712
Previously, operators taking Tensor& as arguments or returning it couldn't be c10-full because the unboxing logic didn't support it.
This adds temporary support for that. We're planning to remove this again later, but for now we need it to make those ops c10-full.
See https://docs.google.com/document/d/19thMVO10yMZA_dQRoB7H9nTPw_ldLjUADGjpvDmH0TQ for the full plan.
This PR also makes some ops c10-full that now can be.
ghstack-source-id: 109693706
Test Plan: unit tests
Reviewed By: bhosmer
Differential Revision: D22989242
fbshipit-source-id: 1bd97e5fa2b90b0860784da4eb772660ca2db5a3
Summary:
A small clarity improvement to the cuda init docstring
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42923
Reviewed By: zhangguanheng66
Differential Revision: D23080693
Pulled By: mrshenli
fbshipit-source-id: aad5ed9276af3b872c1def76c6175ee30104ccb2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42563
Moved logic for non-named unflatten from python nn module to aten/native to be reused by the nn module later. Fixed some inconsistencies with doc and code logic.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23030301
Pulled By: heitorschueroff
fbshipit-source-id: 7c804ed0baa5fca960a990211b8994b3efa7c415
Summary:
The premise of this approach is that a small subset of neural networks are well represented by a data flow graph. The README contains more information.
The name is subject to change, but I thought it was a cute reference to fire.
suo let me know if you'd prefer this in a different spot. Since it lowers a JIT'd module directly I assumed the JIT folder would be appropriate. There is no exposed Python interface yet (but is mocked up in `test_accelerant.py`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42753
Reviewed By: zou3519
Differential Revision: D23043771
Pulled By: bwasti
fbshipit-source-id: 5353731e3aae31c08b5b49820815da98113eb551
Summary:
Introducing `//xplat/caffe2:aten_vulkan` target which contains pytorch Vulkan backend and its ops.
`//xplat/caffe2:aten_vulkan` depends on ` //xplat/caffe2:aten_cpu`
Just inclusion it to linking registers Vulkan Backend and its ops.
**Code generation:**
1. `VulkanType.h`, `VulkanType.cpp`
Tensor Types for Vulkan backend are generated by `//xplat/caffe2:gen_aten_vulkan` which runs aten code generation (`aten/src/ATen/gen.py`) with `--vulkan` argument.
2. Shaders compilation
`//xplat/caffe2:gen_aten_vulkan_spv` genrule runs `//xplat/caffe2:gen_aten_vulkan_spv_bin` which is a wrapper on `aten/src/ATen/native/vulkan/gen_spv.py`
GLSL files are listed in `aten/src/ATen/native/vulkan/glsl/*` and to compile them `glslc` (glsl compiler) is required.
`glslc` is in opensource https://github.com/google/shaderc , that also has a few dependencies on other libraries, that porting this build to BUCK will take significant amount of time.
To use `glslc` in BUCK introducing
dotslash `xplat/caffe2/fb/vulkan/dotslash/glslc` which is stored on manifold the latest prebuilt binaries of `glslc` from ANDROID_NDK for linux, macos and windows.
Not using it from ANDROID_NDK directly allows to update it without dependency on ndk.
Test Plan:
Building aten_vulkan target:
```
buck build //xplat/caffe2:aten_vulkan
```
Building vulkan_test that contains vulkan unittests for android:
```
buck build //xplat/caffe2:pt_vulkan_test_binAndroid#android-armv7
```
And running it on the device with vulkan support.
Reviewed By: iseeyuan
Differential Revision: D22770299
fbshipit-source-id: 843af8df226d4b5395b8e480eb47b233d57201df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42876
Previously, the error messages were pretty bad. This PR adds nice
error messages for the following cases:
- user attempts to call .backward() inside vmap for any reason
whatsoever
- user attempts to call autograd.grad(outputs, inputs, grad_outputs),
where outputs or inputs is being vmapped over (so they are
BatchedTensors).
The case we do support is calling autograd.grad(outputs, inputs,
grad_outputs) where `grad_outputs` is being vmapped over. This is the
case for batched gradient support (e.g., user passes in a batched
grad_output).
Test Plan: - new tests: `pytest test/test_vmap.py -v`
Reviewed By: ezyang
Differential Revision: D23059836
Pulled By: zou3519
fbshipit-source-id: 2fd4e3fd93f558e67e2f0941b18f0d00d8ab439f
Summary:
As name suggests, this function should always return a writable path
Call `mkdtemp` to create temp folder if path is not writable
This fixes `TestNN.test_conv_backcompat` if PyTorch is installed in non-writable location
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42895
Reviewed By: dzhulgakov
Differential Revision: D23070320
Pulled By: malfet
fbshipit-source-id: ed6a681d46346696a0de7e71f0b21cba852a964e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42691
fix quantization of FC bias to match nnpi
quantize biases to fp16
Test Plan: improved the unit test to have input tensors in fp32
Reviewed By: tracelogfb
Differential Revision: D22941521
fbshipit-source-id: 00afb70610f8a149110344d52595c39e3fc988ab
Summary:
This PR aims at improving `LayerNorm` performance on CPU for both forward and backward.
Results on Xeon 6248:
1. single socket inference **1.14x** improvement
2. single core inference **1.77x** improvement
3. single socket training **6.27x** improvement
The fine tuning of GPT2 on WikiTest2 dataset time per iteration on dual socket reduced from **4.69s/it** to **3.16s/it**, **1.48x** improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35750
Reviewed By: zhangguanheng66
Differential Revision: D20810026
Pulled By: glaringlee
fbshipit-source-id: c5801bd76eb944f2e46c2fe4991d9ad4f40495c3
Summary:
This is a follow-up PR for https://github.com/pytorch/pytorch/issues/37091, fixing some of the quirks of that PR as that one was landed early to avoid merge conflicts.
This PR addresses the following action items:
- [x] Use error-handling macros instead of a `try`-`catch`.
- [x] Renamed and added comments to clarify the use of `HANDLED_FUNCTIONS_WRAPPERS` in tests. `HANDLED_FUNCTIONS_NAMESPACES` was already removed in the last PR as we had a way to test for methods.
This PR does NOT address the following action item, as it proved to be difficult:
- [ ] Define `__module__` for whole API.
Single-line repro-er for why this is hard:
```python
>>> torch.Tensor.grad.__get__.__module__ = "torch.Tensor.grad"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'method-wrapper' object has no attribute '__module__'
```
Explanation: Methods defined in C/properties don't always have a `__dict__` attribute or a mutable `__module__` slot for us to modify.
The documentation action items were addressed in the following commit, with the additional future task of adding the rendered RFCs to the documentation: 552ba37c05
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42806
Reviewed By: smessmer
Differential Revision: D23031501
Pulled By: ezyang
fbshipit-source-id: b781c97f7840b8838ede50a0017b4327f96bc98a
Summary:
Addresses some comments that were left unaddressed after PR https://github.com/pytorch/pytorch/issues/41377 was merged:
* Use `check_output` instead of `Popen` to run each subprocess sequentially
* Use f-strings rather than old python format string style
* Provide environment variables to subprocess through the `env` kwarg
* Check for correct error behavior inside the subprocess, and raise another error if incorrect. Then the main process fails the test if any error is raised
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42627
Reviewed By: malfet
Differential Revision: D22969231
Pulled By: ezyang
fbshipit-source-id: 38d5f3f0d641c1590a93541a5e14d90c2e20acec
Summary:
During cleanup phase, calling recordReferencedAttrs would record
the attributes which are referenced and hence kept.
However, if you have two instances of the same type which are preserved
through freezing process, as the added testcase shows, then during
recording the attributes which are referenced, we iterate through the
type INSTANCES that we have seen so far and record those ones.
Thus if we have another instance of the same type, we will just look at
the first instance in the list, and record that instances.
This PR fixes that by traversing the getattr chains and getting the
actual instance of the getattr output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42457
Test Plan:
python test/test_jit.py TestFreezing
Fixes #{issue number}
Reviewed By: zou3519
Differential Revision: D22898051
Pulled By: kimishpatel
fbshipit-source-id: 8b1d80f0eb40ab99244f931d4a1fdb28290a4683
Summary:
This PR:
- updates test_op_normalization.py, which verifies that aliases are correctly translated in the JIT
- adds torch.linalg.det as an alias for torch.det
- moves the torch.linalg.outer alias to torch.outer (to be consistent with NumPy)
The torch.linalg.outer alias was put the linalg namespace erroneously as a placeholder since it's a "linear algebra op" according to NumPy but is actually still in the main NumPy namespace.
The updates to test_op_normalization are necessary. Previously it was using method_tests to generate tests, and method_tests assumes test suites using it also use the device generic framework, which test_op_normalization did not. For example, some ops require decorators like `skipCPUIfNoLapack`, which only works in device generic test classes. Moving test_op_normalization to the device generic framework also lets these tests run on CPU and CUDA.
Continued reliance on method_tests() is excessive since the test suite is only interested in testing aliasing, and a simpler and more readable `AliasInfo` class is used for the required information. An example impedance mismatch between method_tests and the new tests, for example, was how to handle ops in namespaces like torch.linalg.det. In the future this information will likely be folded into a common 'OpInfo' registry in the test suite.
The actual tests performed are similar to what they were previously: a scripted and traced version of the op is run and the test verifies that both graphs do not contain the alias name and do contain the aliased name.
The guidance for adding an alias has been updated accordingly.
cc mattip
Note:
ngimel suggests:
- deprecating and then removing the `torch.ger` name
- reviewing the implementation of `torch.outer`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42802
Reviewed By: zou3519
Differential Revision: D23059883
Pulled By: mruberry
fbshipit-source-id: 11321c2a7fb283a6e7c0d8899849ad7476be42d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42576
Previously we have qconfig propagate list and we only attach qconfig for modules
in the list, this works when everything is quantized in the form of module.
but now we are expanding quantization for functional/torch ops, we'll need to attach qconfig
to all modules
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D22939453
fbshipit-source-id: 7d6a1f73ff9bfe461b3afc75aa266fcc8f7db517
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42769
Some of the quantized add and mul can have the same name
Test Plan: Imported from OSS
Reviewed By: supriyar
Differential Revision: D23054822
fbshipit-source-id: c1300f3f0f046eaf0cf767d03b957835e22cfb4b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42871
old version of hypothesis.testing was not enforcing deadlines
after the library got updated, default deadline=200ms, but even with 1s or
more, tests are flaky. Changing deadline to non-enforced which is the same
behavior as the old version
Test Plan: tested fakelowp/tests
Reviewed By: hl475
Differential Revision: D23059033
fbshipit-source-id: 79b6aec39a2714ca5d62420c15ca9c2c1e7a8883
Summary:
aten::sorted.str output type was incorrectly set to bool[] due to a copy-paste error. This PR fixes it.
Fixes https://fburl.com/0rv8amz7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42853
Reviewed By: yf225
Differential Revision: D23054907
Pulled By: gmagogsfm
fbshipit-source-id: a62968c90f0301d4a5546e6262cb9315401a9729
Summary: I found out that without exporting to public format IDEEP transpose operator in the middle of convolution net produces incorrect results (probably reading some out-of-bound memory). Exporting to public format might not be the most efficient solution, but at least it ensures correct behavior.
Test Plan: Running ConvFusion followed by transpose should give identical results on CPU and IDEEP
Reviewed By: bwasti
Differential Revision: D22970872
fbshipit-source-id: 1ddca16233e3d7d35a367c93e72d70632d28e1ef
Summary:
They are double, but they are supposed to be of accscalar_t or a faster type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42846
Reviewed By: zou3519
Differential Revision: D23049405
Pulled By: mruberry
fbshipit-source-id: 29bb5d5419dc7556b02768f0ff96dfc28676f257
Summary:
This PR adds:
- an "OpInfo" class in common_method_invocations that can contain useful information about an operator, like what dtypes it supports
- a more specialized "UnaryUfuncInfo" class designed to help test the unary ufuncs
- the `ops` decorator, which can generate test variants from lists of OpInfos
- test_unary_ufuncs.py, a new test suite stub that shows how the `ops` decorator and operator information can be used to improve the thoroughness of our testing
The single test in test_unary_ufuncs.py simply ensures that the dtypes associated with a unary ufunc operator in its OpInfo entry are correct. Writing a test like this previously, however, would have required manually constructing test-specific operator information and writing a custom test generator. The `ops` decorator and a common place to put operator information make writing tests like this easier and allows what would have been test-specific information to be reused.
The `ops` decorator extends and composes with the existing device generic test framework, allowing its decorators to be reused. For example, the `onlyOnCPUAndCUDA` decorator works with the new `ops` decorator. This should keep the tests readable and consistent.
Future PRs will likely:
- continue refactoring the too large test_torch.py into more verticals (unary ufuncs, binary ufuncs, reductions...)
- add more operator information to common_method_invocations.py
- refactor tests for unary ufuncs into test_unary_ufunc
Examples of possible future extensions are [here](616747e50d), where an example unary ufunc test is added, and [here](d0b624f110), where example autograd tests are added. Both tests leverage the operator info in common_method_invocations to simplify testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41662
Reviewed By: ngimel
Differential Revision: D23048416
Pulled By: mruberry
fbshipit-source-id: ecce279ac8767f742150d45854404921a6855f2c
Summary:
Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write.
For example it can replace:
```
A[0] = 0;
for (int x = 0; x < 10; x++) {
A[0] = (A[0]) + x;
}
```
with:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
A_ = x + A_;
}
A[0] = A_;
```
This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x.
This diff got a bit unwieldy with the integration code so that will come in a follow up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42606
Reviewed By: bertmaher
Differential Revision: D22970969
Pulled By: nickgg
fbshipit-source-id: 831fd213f486968624b9a4899a331ea9aeb40180
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42767
Same as previous PR, forcing the qlinear benchmark to follow the fp one
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23013937
fbshipit-source-id: fffaa7cfbfb63cea41883fd4d70cd3f08120aaf8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42761
Makes the qconv benchmark follow the conv benchmark exactly. This way
it will be easy to compare q vs fp with the same settings.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test
python -m pt.conv_test
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23012533
fbshipit-source-id: af30ee585389395569a6322f5210828432963077
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42837
Originally we use
```
list(APPEND CMAKE_C_FLAGS -fprofile-instr-generate -fcoverage-mapping)
list(APPEND CMAKE_CXX_FLAGS -fprofile-instr-generate -fcoverage-mapping)
```
But when compile project on mac with Coverage On, it has the error:
`clang: error: no input files
/bin/sh: -fprofile-instr-generate: command not found
/bin/sh: -fcoverage-mapping: command not found`
The reason behind it, is `list(APPEND CMAKE_CXX_FLAGS` will add an additional `;` to the variable. This means, if we do `list(APPEND foo a)` and then `list(APPEND foo b)`, then `foo` will be `a;b` -- with the additional `;`. Since we have `CMAKE_CXX_FLAGS` defined before in the `CMakeList.txt`, we can only use `set(...)` here
After changing it to
```
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fprofile-instr-generate -fcoverage-mapping")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fprofile-instr-generate -fcoverage-mapping")
```
Test successufully in local mac machine.
Test Plan: Test locally on mac machine
Reviewed By: malfet
Differential Revision: D23043057
fbshipit-source-id: ff6f4891b35b7f005861ee2f8e4c550c997fe961
Summary:
fixes https://github.com/pytorch/pytorch/issues/39566
`typing.Final` is a thing since python 3.8, and on python 3.8, `typing_extensions.Final` is an alias of `typing.Final`, therefore, `ann.__module__ == 'typing_extensions'` will become False when using 3.8 and `typing_extensions` is installed.
~~I don't know why the test is skipped, seems like due to historical reason when python 2.7 was still a thing?~~ Edit: I know now, the `Final` for `<3.7` don't have `__origin__`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39568
Reviewed By: smessmer
Differential Revision: D23043388
Pulled By: malfet
fbshipit-source-id: cc87a9e4e38090d784e9cea630e1c543897a1697
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42810
In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`). In addition, vectorization is used such that scale and zero point are expanded to share the same shape and the element-wise corresponding values to X along the channel axis.
In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance:
**Speedup from python operator**: ~10x
**Speedup from original learnable kernel**: ~5.4x
**Speedup from non-backprop kernel**: ~1.8x
Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command
`buck test //caffe2/test:quantization -- learnable_backward_per_channel`
To benchmark the operators, on a devvm, enter the command
1. Set the kernel size to 3x3x256x256 or a reasonable input size.
2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test`
3. The relevant outputs for CPU are as follows:
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 989024.686
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 95654.079
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 176948.970
```
4. The relevant outputs for GPU are as follows:
The relevant outputs are as follows
**Pre-optimization**:
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 6795.173
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 4321.351
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 1052.066
```
**Post-optimization**:
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 6737.106
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 2112.484
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 1078.79
Reviewed By: vkuzo
Differential Revision: D22946853
fbshipit-source-id: 1a01284641480282b3f57907cc7908d68c68decd
Summary:
Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead.
Fixes https://github.com/pytorch/pytorch/issues/41780
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550
Reviewed By: smessmer
Differential Revision: D23040744
Pulled By: albanD
fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42617
While we figure out the random plan, I want to initially disable
support for random operations. This is because there is an ambiguity in
what randomness means. For example,
```
tensor = torch.zeros(B0, 1)
vmap(lambda t: t.normal_())(tensor)
```
in the above example, should tensor[0] and tensor[1] be equal (i.e.,
use the same random seed), or should they be different?
The mechanism for disabling random support is as follows:
- We add a new dispatch key called VmapMode
- Whenever we're inside vmap, we enable VmapMode for all tensors.
This is done via at::VmapMode::increment_nesting and
at::VmapMode::decrement_nesting.
- DispatchKey::VmapMode's fallback kernel is the fallthrough kernel.
- We register kernels that raise errors for all random functions on
DispatchKey::VmapMode. This way, whenever someone calls a random
function on any tensor (not just BatchedTensors) inside of a vmap block,
an error gets thrown.
Test Plan: - pytest test/test_vmap.py -v -k "Operators"
Reviewed By: ezyang
Differential Revision: D22954840
Pulled By: zou3519
fbshipit-source-id: cb8d71062d4087e10cbf408f74b1a9dff81a226d
Summary:
Added a new option in AutogradContext to tell autograd to not materialize output grad tensors, that is, don't expand undefined/None tensors into tensors full of zeros before passing them as input to the backward function.
This PR is the second part that closes https://github.com/pytorch/pytorch/issues/41359. The first PR is https://github.com/pytorch/pytorch/pull/41490.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41821
Reviewed By: albanD
Differential Revision: D22693163
Pulled By: heitorschueroff
fbshipit-source-id: a8d060405a17ab1280a8506a06a2bbd85cb86461
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42763
add the fp16 fusions as net transforms:
-layernorm fused with mul+add
-swish int8
Test Plan: added unit test, ran flows
Reviewed By: yinghai
Differential Revision: D23002043
fbshipit-source-id: f0b13d51d68c240b05d2a237a7fb8273e996328b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42706
Different backends accept different type of length to, like MPI_Alltoallv, nccSend/Recv(), gloo::alltoallv(). So to make computeLengthsAndOffsets() template
Test Plan:
Sandcastle
CI
HPC: ./trainer_cmd.sh -p 16 -n 8 -d nccl
Reviewed By: osalpekar
Differential Revision: D22961459
fbshipit-source-id: 45ec271f8271b96f2dba76cd9dce3e678bcfb625
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42750
All of these tests fail under TSAN since we fork in a multithreaded
environment.
ghstack-source-id: 109566396
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D23007746
fbshipit-source-id: 65571607522b790280363882d61bfac8a52007a1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42700
I was about to use `isBatched` somewhere not in the files used to
implement vmap but then realized how silly that sounds due to
ambiguity. This PR renames some of the BatchedTensor APIs to make a bit
more sense to onlookers.
- isBatched(Tensor) -> isBatchedTensor(Tensor)
- unsafeGetBatched(Tensor) -> unsafeGetBatchedImpl(Tensor)
- maybeGetBatched(Tensor) -> maybeGetBatchedImpl(Tensor)
Test Plan: - build Pytorch, run tests.
Reviewed By: ezyang
Differential Revision: D22985868
Pulled By: zou3519
fbshipit-source-id: b8ed9925aabffe98085bcf5c81d22cd1da026f46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42628
This PR extends the BatchedTensor fallback to support operators with
multiple Tensor returns. If an operator has multiple returns, we stack
shards of each return to create the full outputs.
Test Plan:
- `pytest test/test_vmap.py -v`. Added a new test for an operator with
multiple returns (torch.var_mean).
Reviewed By: izdeby
Differential Revision: D22957095
Pulled By: zou3519
fbshipit-source-id: 5c0ec3bf51283cc4493b432bcfed1acf5509e662
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42725
This diff changes pt_defs.bzl to pt_defs.py, so that it can be included as python source file.
The reason is if we remove base ops, pt_defs.bzl becomes too big (8k lines) and we cannot pass its content to gen_oplist (python library). The easy solution is to change it to a python source file so that it can be used in gen_oplist.
Test Plan: sandcastle
Reviewed By: ljk53, iseeyuan
Differential Revision: D22968258
fbshipit-source-id: d720fe2e684d9a2bf5bd6115b6e6f9b812473f12
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42400
mcarilli spotted that in the original DDP communication hook design described in [39272](https://github.com/pytorch/pytorch/issues/39272), the hooks receive grads that are already predivided by world size.
It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.
We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.
ghstack-source-id: 109548696
**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.
Test Plan: python test/distributed/test_c10d.py and perf benchmark tests.
Reviewed By: ezyang
Differential Revision: D22883905
fbshipit-source-id: 3277323fe9bd7eb6e638b7ef0535cab1fc72f89e
Summary:
`torch.scatter` supports two overloads – one where `src` input tensor is same size as the `index` tensor input, and second, where `src` is a scalar. Currrently, ONNX exporter only supports the first overload. This PR adds export support for the second overload of `torch.scatter`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42765
Reviewed By: hl475
Differential Revision: D23025189
Pulled By: houseroad
fbshipit-source-id: 5c2a3f3ce3b2d69661a227df8a8e0ed7c1858dbf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42766
**Summary**
Some python tests are missing in `caffe2/test/TARGETS`, add them to be more comprehension.
According to [run_test.py](https://github.com/pytorch/pytorch/blob/master/test/run_test.py#L125), some tests are slower. Slow tests are added as independent targets and others are put together into one `others` target. The reason is because we want to reduce overhead, especially for code covarge collection. Tests in one target can be run as a bundle, and then coverage can be collected together. Typically coverage collection procedure is time-expensive, so this helps us save time.
Test Plan:
Run all the new test targets locally in dev server and record the time they cost.
**Statistics**
```
# jit target
real 33m7.694s
user 653m1.181s
sys 58m14.160s
--------- Compare to Initial Jit Target runtime: ----------------
real 32m13.057s
user 613m52.843s
sys 54m58.678s
```
```
# others target
real 9m2.920s
user 164m21.927s
sys 12m54.840s
```
```
# serialization target
real 4m21.090s
user 23m33.501s
sys 1m53.308s
```
```
# tensorexpr
real 11m28.187s
user 33m36.420s
sys 1m15.925s
```
```
# type target
real 3m36.197s
user 51m47.912s
sys 4m14.149s
```
Reviewed By: malfet
Differential Revision: D22979219
fbshipit-source-id: 12a30839bb76a64871359bc024e4bff670c5ca8b
Summary:
Add centos Dockerfile and support to circleci docker builds, and allow generic image names to be parsed by build.sh, so both hardcoded images and custom images can be built.
Currently only adds a ROCm centos Dockerfile.
CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41255
Reviewed By: mrshenli
Differential Revision: D23003218
Pulled By: malfet
fbshipit-source-id: 562c53533e7fb9637dc2e81edb06b2242afff477
Summary:
Not sure what happened, but possibly I landed a PR on PyTorch which updated the TensorPipe submodule to a commit hash of a *PR* of TensorPipe. Now that the latter PR has been merged though that same commit has a different hash. The commit referenced by PyTorch, therefore, has become orphaned. This is causing some issues.
Hence here I am updating the commit, which however does not change a single line of code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42789
Reviewed By: houseroad
Differential Revision: D23023238
Pulled By: lw
fbshipit-source-id: ca2dcf6b7e07ab64fb37e280a3dd7478479f87fd
Summary:
Always promote type casts for comparison operators, regardless if the input is tensor or scalar. Unlike arithmetic operators, where scalars are implicitly cast to the same type as tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37787
Reviewed By: hl475
Differential Revision: D21440585
Pulled By: houseroad
fbshipit-source-id: fb5c78933760f1d1388b921e14d73a2cb982b92f
Summary:
Per title. Also updates our guidance for adding aliases to clarify interned_string and method_test requirements. The alias is tested by extending test_clamp to also test clip.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42770
Reviewed By: ngimel
Differential Revision: D23020655
Pulled By: mruberry
fbshipit-source-id: f1d8e751de9ac5f21a4f95d241b193730f07b5dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42740
Adds a pass to hoist conv packed params to root module.
The benefit is that if there is nothing else in the conv module,
subsequent passes will delete it, which will reduce module size.
For context, freezing does not handle this because conv packed
params is a custom object.
Test Plan:
```
PYTORCH_JIT_LOG_LEVEL=">hoist_conv_packed_params.cpp" python test/test_mobile_optimizer.py TestOptimizer.test_hoist_conv_packed_params
```
Imported from OSS
Reviewed By: kimishpatel
Differential Revision: D23005961
fbshipit-source-id: 31ab1f5c42a627cb74629566483cdc91f3770a94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42739
This is a test case which fails with ASAN on at the module freezing
step.
Test Plan:
```
USE_ASAN=1 USE_CUDA=0 python setup.py develop
LD_PRELOAD=/usr/lib64/libasan.so.4 python test/test_mobile_optimizer.py TestOptimizer.test_optimize_for_mobile_asan
// output tail: https://gist.github.com/vkuzo/7a0018b9e10ffe64dab0ac7381479f23
```
Imported from OSS
Reviewed By: kimishpatel
Differential Revision: D23005962
fbshipit-source-id: b7d4492e989af7c2e22197c16150812bd2dda7cc
Summary:
add a fuse path for deq->swish->quant
update swish fake op interface to take arguments accordingly
Test Plan:
net_runner passes
unit tests need to be updated
Reviewed By: venkatacrc
Differential Revision: D22962064
fbshipit-source-id: cef79768db3c8af926fca58193d459d671321f80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42619
Added missing entries to `DispatchKey::toString()` and reordered to match declaration order in `DispatchKey.h`
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D22963407
Pulled By: bhosmer
fbshipit-source-id: 34a012135599f497c308ba90ea6e8117e85c74ac
Summary:
This function was always expecting to return a `size_t` value
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42454
Reviewed By: ezyang
Differential Revision: D22993168
Pulled By: ailzhang
fbshipit-source-id: 044df8ce17983f04681bda8c30cd742920ef7b1e
Summary:
Backout D22800959 (f30ac66e79). This one is causing the timeout (machine stuck) issues for dedup kernels. Reverting it make the unit test pass. Still need to investigate why this is the culprit...
Original commit changeset: 641d52a51070
Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```
Reviewed By: jspark1105
Differential Revision: D23008389
fbshipit-source-id: 4f1b9a41c78eaa5541d57b9d8aa12401e1d495f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42714
Change two unit tests for the lite trainer to register two instances/objects of the same submodule type instead of the same submodule object twice.
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D22990736
Pulled By: ann-ss
fbshipit-source-id: 2bf56b5cc438b5a5fc3db90d3f30c5c431d3ae77
Summary:
This diff adds FakeQuantizeWithBackward. This works the same way as the regular FakeQuantize module, allowing QAT to occur in the forward pass, except it has an additional quantize_backward parameter. When quantize_backward is enabled, the gradients are fake quantized as well (dynamically, using hard-coded values). This allows the user to see whether there would be a significant loss of accuracy if the gradients were quantized in their model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40532
Test Plan: The relevant test for this can be run using `python test/test_quantization.py TestQATBackward.test_forward_and_backward`
Reviewed By: supriyar
Differential Revision: D22217029
Pulled By: durumu
fbshipit-source-id: 7055a2cdafcf022f1ea11c3442721ae146d2b3f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42694
The old implementation allowed calling SmallVector constructor and operator= for any type without restrictions,
but then failed with a compiler error when the type wasn't a collection.
Instead, we should only use it if Container follows a container concept and just not match the constructor otherwise.
This fixes an issue kimishpatel was running into.
ghstack-source-id: 109370513
Test Plan: unit tests
Reviewed By: kimishpatel, ezyang
Differential Revision: D22983020
fbshipit-source-id: c31264f5c393762d822f3d64dd2a8e3279d8da44
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42756
Similar to ELU, CELU was also broken in the quantized benchmark, fixing.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qactivation_test
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23010863
fbshipit-source-id: 203e63f9cff760af6809f6f345b0d222dc1e9e1b
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: a989b99279
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42713
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: amylittleyang
Differential Revision: D22990108
Pulled By: jspark1105
fbshipit-source-id: 3252a0f5ad9546221ef2fe908ce6b896252e1887
Summary: Add Python type annotations for the `caffe2.distributed.python` module.
Test Plan: Will check sandcastle results.
Reviewed By: jeffdunn
Differential Revision: D22994012
fbshipit-source-id: 30565cc41dd05b5fbc639ae994dfe2ddd9e56cb1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42611
**Summary**
This commit modifies the Python frontend to ignore static functions on
Torchscript classes when compiling them. They are currently included
along with methods, which causes the first argument of the
staticfunction to be unconditionally inferred to be of the type of the
class it belongs to (regardless of how it is annotated or whether it is
annotated at all). This can lead to compilation errors depending on
how that argument is used in the body of the function.
Static functions are instead imported and scripted as if they were
standalone functions.
**Test Plan**
This commit augments the unit test for static methods in `test_class_types.py`
to test that static functions can call each other and the class
constructor.
**Fixes**
This commit fixes#39308.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D22958163
Pulled By: SplitInfinity
fbshipit-source-id: 45c3c372792299e6e5288e1dbb727291e977a2af
Summary:
I noticed that `TensorIteratorDynamicCasting.h` defines a helper meta-function `CPPTypeToScalarType` which does exactly the same thing as the `c10::CppTypeToScalarType` meta-function I added in gh-40927. No need for two identical definitions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42640
Reviewed By: malfet
Differential Revision: D22969708
Pulled By: ezyang
fbshipit-source-id: 8303c7f4a75ae248f393a4811ae9d2bcacab44ff
Summary:
Awhile back when commonizing the Let and LetStmt nodes, I ended up removing both and adding a separate VarBinding section the Block. At the time I couldn't find a counter example, but I found it today: Local Vars and Allocations dependencies may go in either direction and so we need to support interleaving of those statements.
So, I've removed all the VarBinding logic and reimplemented Let statements. ZolotukhinM I think you get to say "I told you so". No new tests, existing tests should cover this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42634
Reviewed By: mruberry
Differential Revision: D22969771
Pulled By: nickgg
fbshipit-source-id: a46c5193357902d0f59bf30ab103fe123b1503f1
Summary:
This PR adds the `torch.linalg` namespace as part of our continued effort to be more compatible with NumPy. The namespace is tested by adding a single function, `torch.linalg.outer`, and testing it in a new test suite, test_linalg.py. It follows the same pattern that https://github.com/pytorch/pytorch/pull/41911, which added the `torch.fft` namespace, did.
Future PRs will likely:
- add more functions to torch.linalg
- expand the testing done in test_linalg.py, including legacy functions, like torch.ger
- deprecate existing linalg functions outside of `torch.linalg` in preference to the new namespace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42664
Reviewed By: ngimel
Differential Revision: D22991019
Pulled By: mruberry
fbshipit-source-id: 39258d9b116a916817b3588f160b141f956e5d0b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42383
Test Plan - Updated existing tests to run for complex dtypes as well.
Also added tests for `torch.addmm`, `torch.badmm`
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D22960339
Pulled By: anjali411
fbshipit-source-id: 0805f21caaa40f6e671cefb65cef83a980328b7d
Summary:
If argumenets in set_target_properties are not separated by whitespace, cmake raises a warning:
```
CMake Warning (dev) at cmake/public/cuda.cmake:269:
Syntax Warning in cmake code at column 54
Argument not separated from preceding token by whitespace.
```
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42707
Reviewed By: ailzhang
Differential Revision: D22988055
Pulled By: malfet
fbshipit-source-id: c3744f23b383d603788cd36f89a8286a46b6c00f
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4787
Resurrect ONNX as a backend through onnxifiGlow (was killed as part of D16215878). Then look for the `use_glow_aot` argument in the Onnxifi op. If it's there and true, then we override whatever `backend_id` is set and use the ONNX backend.
Reviewed By: yinghai, rdzhabarov
Differential Revision: D22762123
fbshipit-source-id: abb4c3458261f8b7eeae3016dda5359fa85672f0
Summary:
This PR canonicalizes our (current) pattern for adding aliases to PyTorch. That pattern is:
- Copy the original functions native_functions.yaml entry, but replace the original function's name with their own.
- Implement the corresponding functions and have them redispatch to the original function.
- Add docstrings to the new functions that reference the original function.
- Update the alias_map in torch/csrc/jit/passes/normalize_ops.cpp.
- Update the op_alias_mappings in torch/testing/_internal/jit_utils.py.
- Add a test validating the alias's behavior is the same as the original function's.
An alternative pattern would be to use Python and C++ language features to alias ops directly. For example in Python:
```
torch.absolute = torch.abs
```
Let the pattern in this PR be the "native function" pattern, and the alternative pattern be the "language pattern." There are pros/cons to both approaches:
**Pros of the "Language Pattern"**
- torch.absolute is torch.abs.
- no (or very little) overhead for calling the alias.
- no native_functions.yaml redundancy or possibility of "drift" between the original function's entries and the alias's.
**Cons of the "Language Pattern"**
- requires manually adding doc entries
- requires updating Python alias and C++ alias lists
- requires hand writing alias methods on Tensor (technically this should require a C++ test to validate)
- no single list of all PyTorch ops -- have to check native_functions.yaml and one of the separate alias lists
**Pros of the "Native Function" pattern**
- alias declarations stay in native_functions.yaml
- doc entries are written as normal
**Cons of the "Native Function" pattern**
- aliases redispatch to the original functions
- torch.absolute is not torch.abs (requires writing test to validate behavior)
- possibility of drift between original's and alias's native_functions.yaml entries
While either approach is reasonable, I suggest the "native function" pattern since it preserves "native_functions.yaml" as a source of truth and minimizes the number of alias lists that need to be maintained. In the future, entries in native_functions.yaml may support an "alias" argument and replace whatever pattern we choose now.
Ops that are likely to use aliasing are:
- div (divide, true_divide)
- mul (multiply)
- bucketize (digitize)
- cat (concatenate)
- clamp (clip)
- conj (conjugate)
- rad2deg (degrees)
- trunc (fix)
- neg (negative)
- deg2rad (radians)
- round (rint)
- acos (arccos)
- acosh (arcosh)
- asin (arcsin)
- asinh (arcsinh)
- atan (arctan)
- atan2 (arctan2)
- atanh (arctanh)
- bartlett_window (bartlett)
- hamming_window (hamming)
- hann_window (hanning)
- bitwise_not (invert)
- gt (greater)
- ge (greater_equal)
- lt (less)
- le (less_equal)
- ne (not_equal)
- ger (outer)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42586
Reviewed By: ngimel
Differential Revision: D22991086
Pulled By: mruberry
fbshipit-source-id: d6ac96512d095b261ed2f304d7dddd38cf45e7b0
Summary: Put user embedding before ads embedding in blobReorder, for flash verification reason.
Test Plan:
```
buck run mode/opt-clang -c python.package_style=inplace sigrid/predictor/scripts:enable_large_model_loading -- --model_path_src="/home/$USER/models/" --model_path_dst="/home/$USER/models_modified/" --model_file_name="182560549_0.predictor"
```
https://www.internalfb.com/intern/anp/view/?id=320921 to check blobsOrder
Reviewed By: yinghai
Differential Revision: D22964332
fbshipit-source-id: 78b4861476a3c889a5ff62492939f717c307a8d2
Summary:
[5/N] Implement Enum JIT support
Implement Enum class iteration
Add aten.ne for EnumType
Supported:
Enum-typed function arguments
using Enum type and comparing them
Support getting name/value attrs of enums
Using Enum value as constant
Support Enum-typed return values
Support iterating through Enum class (enum value list)
TODO:
Support serialization and deserialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42661
Reviewed By: SplitInfinity
Differential Revision: D22977364
Pulled By: gmagogsfm
fbshipit-source-id: 1a0216f91d296119e34cc292791f9aef1095b5a8
Summary:
in `_jit_pass_onnx`, symbolic functions are called for each node for conversion. However, there are nodes that cannot be converted without additional context. For example, the number of outputs from split (and whether it is static or dynamic) is unknown until the point where it is unpacked by listUnpack node. This pass does a preprocess, and prepares the nodes such that enough context can be received by the symbolic function.
* After preprocessing, `_jit_pass_onnx` should have enough context to produce valid ONNX nodes, instead of half baked nodes that replies on fixes from later postpasses.
* `_jit_pass_onnx_peephole` should be a pass that does ONNX specific optimizations instead of ONNX specific fixes.
* Producing more valid ONNX nodes in `_jit_pass_onnx` enables better utilization of the ONNX shape inference https://github.com/pytorch/pytorch/issues/40628.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41832
Reviewed By: ZolotukhinM
Differential Revision: D22968334
Pulled By: bzinodev
fbshipit-source-id: 8226f03c5b29968e8197d242ca8e620c6e1d42a5
Summary:
`torch.where` supports `ByteTensor` and `BoolTensor` types for the first input argument (`condition` predicate). Currently, ONNX exporter assumes that the first argument is `BoolTensor`. This PR updates the export for `torch.where` to correctly support export when first argument is a `ByteTensor`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42264
Reviewed By: houseroad
Differential Revision: D22968473
Pulled By: bzinodev
fbshipit-source-id: 7306388c8446ef3faeb86dc89d72d1f72c1c2314
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42565
After recent changes to the record function we record more
ranges in profiler output and also keep emitting sequence numbers for
all ranges.
Sequence numbers are used by external tools to correlate forward
and autograd ranges and with many ranges having the same sequence number
it becomes impossible to do this.
This PR ensures that we set sequence numbers only for the top-level
ranges and only in case when autograd is enabled.
Test Plan:
nvprof -fo trace.nvvp --profile-from-start off python test_script.py
test_script
https://gist.github.com/ilia-cher/2baffdd98951ee2a5f2da56a04fe15d0
then examining ranges in nvvp
Reviewed By: ngimel
Differential Revision: D22938828
Pulled By: ilia-cher
fbshipit-source-id: 9a5a076706a6043dfa669375da916a1708d12c19
Summary:
Test takes 5 min to finish and 5 min to spin up the environment, so it doesn't make much sense to keep it as separate config
Limit those tests to be run only when `USE_CUDA` environment variable is set to tru
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42650
Reviewed By: ailzhang
Differential Revision: D22967817
Pulled By: malfet
fbshipit-source-id: c6c26df140059491e7ff53ee9cbbc93433d2f36f
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41656
For the CPU version, this is a regression introduced in https://github.com/pytorch/pytorch/issues/10980 which vectorized the `grid_sampler_2d` implementation. It uses the AVX2 gather intrinsic which for `float` requires 32-bit indexing to match the number of floats in the AVX register. There is also an `i64gather_ps` variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.
For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple `TORCH_CHECK(canUse32BitIndexMath(...))` is used instead. So, there is a decision to be made here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41923
Reviewed By: glaringlee
Differential Revision: D22925931
Pulled By: zou3519
fbshipit-source-id: 920816107aae26360c5e7f4e9c729fa9057268bb
Summary:
Per title. ROCm CI doesn't have MKL so this adds a couple missing test annotations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42701
Reviewed By: ngimel
Differential Revision: D22986273
Pulled By: mruberry
fbshipit-source-id: efa717e2e3771562e9e82d1f914e251918e96f64
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42688
Both the profiling executor and the legacy executor have the debug
loggin now.
Ideally, if we had a pass manager, this could be done as a part of it,
but since we have none, I had to insert the debug statements manually.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D22981675
Pulled By: ZolotukhinM
fbshipit-source-id: 22b8789e860aa90d5802fc72a4113b22c6fc4da5
Summary:
Previous when inferring Int8FC, we failed to carry over the scale and zero point properly.
Also fixed int8 FC weight data type to be int8 instead of uint8 as that's what C2 actually uses.
Test Plan: Use net_runner to lower a single Int8Dequantize op. Previous scale and bias would always be 1 and 0. Now the proper value is set.
Reviewed By: yinghai
Differential Revision: D22912186
fbshipit-source-id: a6620c3493e492bdda91da73775bfc9117db12d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42263
Allow a way to get a reference to the stored string in a `List<optional<string>>` without having to copy the string.
This for example improves perf of the map_lookup op by 3x.
ghstack-source-id: 109162026
Test Plan: unit tests
Reviewed By: ezyang
Differential Revision: D22830381
fbshipit-source-id: e6af2bc8cebd6e68794eb18daf183979bc6297ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42189
Rehash of https://github.com/pytorch/pytorch/pull/28811, which was several months old.
As part of addressing https://github.com/pytorch/pytorch/issues/23232, this PR adds support for the following APIs:
`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.
The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load
Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.
If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 109322433
Reviewed By: mrshenli
Differential Revision: D22785387
fbshipit-source-id: a265a44ec0aa3aaffc3c6966023400495904c7d8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40719
This is a follow up patch to turn on this feature in order to handle breaking
forward compatibility.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D22457952
Pulled By: bzinodev
fbshipit-source-id: fac0dfed8b8b5fa2d52d342ee8cf06742959b3c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42656
Thing change will allow us to more freely experiment with pass pipelines
in the profiling executor without affecting passes in the legacy
executor. Also, it somewhat helps to keep all passes in one place to be
able to tell what's going on.
Currently this change should not affect any behavior as I copied the
passes exactly as they've been invoked before, but we will probably want
to change these pipelines in a near future.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D22971050
Pulled By: ZolotukhinM
fbshipit-source-id: f5bb60783a553c7b51c5343eec7f8fe40037ff99
Summary:
Run fastrnns benchmark using pytest-benchmark infra, then parse its json format and upload to scribe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42030
Reviewed By: malfet
Differential Revision: D22970270
Pulled By: wconstab
fbshipit-source-id: 87da9b7ddf741da14b80d20779771d19123be3c5
Summary:
This diff NVMifies the NE Eval Flow.
- It defines a `LoadNVM` operator which either
- receives a list of nvm blobs, or
- extracts the blobs that could be NVMified from the model.
- dumps NVMified blobs into NVM
- and deallocates from DRAM
- NVMify the Eval net on dper and C2 backend
Specific NVMOp for SLS is pushed through different diffs.
Test Plan: flow-cli test-locally dper.workflows.evaluation.eval_workflow --parameters-file=/mnt/public/ehsaardestani/temp/small_model.json 2>&1 | tee log
Reviewed By: yinghai, amylittleyang
Differential Revision: D22469973
fbshipit-source-id: ed8379ad404e96d04ac05e580176d3aca984575b
Summary:
Because 2.7.3 has some bug on GA100 which is fixed in 2.7.6
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42645
Reviewed By: malfet
Differential Revision: D22977280
Pulled By: mrshenli
fbshipit-source-id: 74779eff90d7d660a988ff33659f3a2237ca7e29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42522
Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.
There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header. So instead we link those targets to the tensorpipe target in order for them to pick up the correct include directories.
I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.
Test Plan: CI
Reviewed By: malfet
Differential Revision: D22959472
fbshipit-source-id: 1959a41c4a66ef78bf0f3bd5e3964969a2a1bf67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42521
PyTorch's usage of TensorPipe is entirely wrapped within the RPC agent, which means we only need access to TensorPipe within the implementation (the .cpp file) and not in the interface (the .h file). We were however including the TensorPipe headers from the public PyTorch headers, which meant that PyTorch's downstream users had to have the TensorPipe include directories for that to work. By forward-declaring the symbols we need in the PyTorch header, and then including the TensorPipe header in the PyTorch implementation, we avoid "leaking" the dependency on TensorPipe, thus effectively keeping it private.
Test Plan: Imported from OSS
Reviewed By: beauby
Differential Revision: D22944238
Pulled By: lw
fbshipit-source-id: 2b12d59bd5beeaa439e50f9088a792c9d9bae9e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42570
ProfiledType doesn't do anything and is not used atm, removing
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D22938664
Pulled By: ilia-cher
fbshipit-source-id: 037c512938028f44258b702bbcde3f8c144f4aa0
Summary:
This PR creates a new namespace, torch.fft (torch::fft) and puts a single function, fft, in it. This function is analogous to is a simplified version of NumPy's [numpy.fft.fft](https://numpy.org/doc/1.18/reference/generated/numpy.fft.fft.html?highlight=fft#numpy.fft.fft) that accepts no optional arguments. It is intended to demonstrate how to add and document functions in the namespace, and is not intended to deprecate the existing torch.fft function.
Adding this namespace was complicated by the existence of the torch.fft function in Python. Creating a torch.fft Python module makes this name ambiguous: does it refer to a function or module? If the JIT didn't exist, a solution to this problem would have been to make torch.fft refer to a callable class that mimicked both the function and module. The JIT, however, cannot understand this pattern. As a workaround it's required to explicitly `import torch.fft` to access the torch.fft.fft function in Python:
```
import torch.fft
t = torch.randn(128, dtype=torch.cdouble)
torch.fft.fft(t)
```
See https://github.com/pytorch/pytorch/issues/42175 for future work. Another possible future PR is to get the JIT to understand torch.fft as a callable class so it need not be imported explicitly to be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41911
Reviewed By: glaringlee
Differential Revision: D22941894
Pulled By: mruberry
fbshipit-source-id: c8e0b44cbe90d21e998ca3832cf3a533f28dbe8d
Summary:
This PR introduces a variant of `gpu_kernel` for functions that return multiple values with `thrust::tuple`.
With this I simplified `prelu_cuda_backward_share_weights_kernel`.
### Why using `thrust::tuple`?
Because `std::tuple` does not support `operator=` on device code which makes the implementation complicated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37969
Reviewed By: paulshaoyuqiao
Differential Revision: D22868670
Pulled By: ngimel
fbshipit-source-id: eda0a29ac0347ad544b24bf60e3d809a7db1a929
Summary:
Unroll a loop with constant boundaries, replacing it with multiple
instances of the loop body. For example:
```
for x in 0..3:
A[x] = x*2
```
becomes:
```
A[0] = 0
A[1] = 2
A[2] = 4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42465
Test Plan: `test_tensorexpr` unit tests.
Reviewed By: agolynski
Differential Revision: D22914418
Pulled By: asuhan
fbshipit-source-id: 72ca10d7c0b1ac7f9a3688ac872bd94a1c53dc51
Summary:
According to pytorch/rfcs#3
From the goals in the RFC:
1. Support subclassing `torch.Tensor` in Python (done here)
2. Preserve `torch.Tensor` subclasses when calling `torch` functions on them (done here)
3. Use the PyTorch API with `torch.Tensor`-like objects that are _not_ `torch.Tensor`
subclasses (done in https://github.com/pytorch/pytorch/issues/30730)
4. Preserve `torch.Tensor` subclasses when calling `torch.Tensor` methods. (done here)
5. Propagating subclass instances correctly also with operators, using
views/slices/indexing/etc. (done here)
6. Preserve subclass attributes when using methods or views/slices/indexing. (done here)
7. A way to insert code that operates on both functions and methods uniformly
(so we can write a single function that overrides all operators). (done here)
8. The ability to give external libraries a way to also define
functions/methods that follow the `__torch_function__` protocol. (will be addressed in a separate PR)
This PR makes the following changes:
1. Adds the `self` argument to the arg parser.
2. Dispatches on `self` as well if `self` is not `nullptr`.
3. Adds a `torch._C.DisableTorchFunction` context manager to disable `__torch_function__`.
4. Adds a `torch::torch_function_enabled()` and `torch._C._torch_function_enabled()` to check the state of `__torch_function__`.
5. Dispatches all `torch._C.TensorBase` and `torch.Tensor` methods via `__torch_function__`.
TODO:
- [x] Sequence Methods
- [x] Docs
- [x] Tests
Closes https://github.com/pytorch/pytorch/issues/28361
Benchmarks in https://github.com/pytorch/pytorch/pull/37091#issuecomment-633657778
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37091
Reviewed By: ngimel
Differential Revision: D22765678
Pulled By: ezyang
fbshipit-source-id: 53f8aa17ddb8b1108c0997f6a7aa13cb5be73de0
Summary:
Enforce counter value to double type in rowwise_counter.
**Context:**
The existing implementation is using float type for counter value. But due to the precision limit of a floating number [1], we observed that the counter value can't increment beyond 16777216.0 (i.e., the max value is 16777216.0) in our earlier experiments. We decide to enforce double type to avoid this issue.
[1] https://stackoverflow.com/questions/12596695/why-does-a-float-variable-stop-incrementing-at-16777216-in-c
Test Plan:
op test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/python/operator_test(f0b0b48c)$ buck test :rowwise_counter_test
Trace available for this run at /tmp/testpilot.20200728-083200.729292.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/7881299364977047
✓ caffe2/caffe2/python/operator_test:rowwise_counter_test - test_rowwise_counter (caffe2.caffe2.python.operator_test.rowwise_counter_test.TestRowWiseCounter) 0.265 1/1 (passed)
✓ caffe2/caffe2/python/operator_test:rowwise_counter_test - main 14.414 (passed)
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7881299364977047
Summary (total time 18.51s):
PASS: 2
FAIL: 0
SKIP: 0
FATAL: 0
TIMEOUT: 0
OMIT: 0
```
optimizer test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/python(7d66fbb9)$ buck test :optimizer_test
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7036874434841896
Summary (total time 64.87s):
PASS: 48
FAIL: 0
SKIP: 24
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestMomentumSgd)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestGFtrl)
caffe2/caffe2/python:optimizer_test - test_caffe2_cpu_vs_numpy (caffe2.caffe2.python.optimizer_test.TestYellowFin)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestSparseRAdam)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestRowWiseAdagradWithCounter)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestAdagrad)
caffe2/caffe2/python:optimizer_test - test_caffe2_gpu_vs_numpy (caffe2.caffe2.python.optimizer_test.TestYellowFin)
caffe2/caffe2/python:optimizer_test - testDense (caffe2.caffe2.python.optimizer_test.TestRowWiseAdagrad)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestFtrl)
caffe2/caffe2/python:optimizer_test - testSparse (caffe2.caffe2.python.optimizer_test.TestRmsProp)
...and 14 more not shown...
FATAL: 0
TIMEOUT: 0
OMIT: 0
```
param download test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/fb/net_transforms/tests(7ef20a38)$ sudo buck test :param_download_test
Finished test run: Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924481526935
```
e2e flow:
f208394929
f207991149
f207967273
ANP notebook to check the counter value loaded from the flows
https://fburl.com/anp/5fdcbnoi
screenshot of the loaded counter (note that counter max is larger than 16777216.0)
{F250926501}
Reviewed By: ellie-wen
Differential Revision: D22711514
fbshipit-source-id: 426fed7415270aa3f276dda8141907534734337f
Summary:
Make function from method
Since _forward_unimplemented is defined within the nn.Module class,
pylint (correctly) complains about not implementing this method in subclasses.
Fixes https://github.com/pytorch/pytorch/issues/42305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42356
Reviewed By: mruberry
Differential Revision: D22867255
Pulled By: ezyang
fbshipit-source-id: ccf3e45e359d927e010791fadf70b2ef231ddb0b
Summary:
This should fix CUDA-11 on Windows build issue
`defined` is not a function, and so it can not be used in macro substitution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42643
Reviewed By: pbelevich, xw285cornell
Differential Revision: D22963420
Pulled By: malfet
fbshipit-source-id: cccf7db0d03cd62b655beeb154db9e628aa749f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41934
The model exported from online training workflow with int8 quantization contains FCs with 4 inputs. The extra input is the quant_param blob. This diff is to adjust the bound_shape_inferencer and int8 op schema to get shape info for the quant_param input.
Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```
Reviewed By: yinghai
Differential Revision: D22683554
fbshipit-source-id: 684d1433212a528120aba1c37d27e26b6a31b403
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42491
Hooks up quantized batchnorm_1d to the quantized_bn kernel. Eager mode
hookup will be in a future PR, and graph mode should work after this PR.
Note: currently the implementation is ~2x slower on the benchmark than q_batch_norm2d
because we convert back to contiguous memory format at the end, since
channels_last is only defined for rank >= 4. If further optimization is
needed, that can be a separate PR (will need the NHWC folks to see if
there is a workaround). Meanwhile, having this is better than not having anything.
Context: There have been both internal and external requests for various
quantized BN1d use cases.
Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d
python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d_relu
python test/test_quantization.py TestQuantizeJitOps.test_qbatch_norm
// performance:
// https://gist.github.com/vkuzo/73a07c0f24c05f5804990d9ebfaecf5e
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22926254
fbshipit-source-id: 2780e6a81cd13a7455f6ab6e5118c22850a97a12
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41603
Pull Request resolved: https://github.com/pytorch/glow/pull/4704
Previously in the glow onnxifi path, when an error is encountered, we log it to stderr then just return ONNXIFI_STATUS_INTERNAL_ERROR to C2. C2 then does CAFFE2_ENFORCE_EQUAL(return_code, ONNXIFI_STATUS_SUCCESS). The error message that eventually went to the user is something like
[enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0
This diff adds plumbing to get human readable error message out of glow into C2.
Test Plan:
Run ads replayer. Overload it with traffic. Now the error message sent back to the client used to be
E0707 00:57:45.697196 3709559 Caffe2DisaggAcceleratorTask.cpp:493] During running REMOTE_OTHER net: [enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0 (Error from operator:....
Now it's
```
E0707 16:46:48.366263 1532943 Client.cpp:966] Exception when calling caffe2_run_disagg_accelerator on remote predictor for model 190081310_0 : apache::thrift::TApplicationException: c10::Error: [enforce fail at onnxifi_op.cc:556] .
Error code: RUNTIME_REQUEST_REFUSED
Error message: The number of allowed queued requests has been exceeded. queued requests: 100 allowed requests: 100
Error return stack:
glow/glow/lib/Runtime/HostManager/HostManager.cpp:673
glow/glow/lib/Onnxifi/HostMana (Error from operator:...
```
Reviewed By: gcatron, yinghai
Differential Revision: D22416857
fbshipit-source-id: 564bc7644d9666eb660725c2dca5637affae9b73
Summary:
Refactor comnon pattern of (torch.cuda.version and [int(x) for x in torch.cuda.version.split(".")] >= [a, b]) into `_get_torch_cuda_version()` function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42626
Reviewed By: seemethere
Differential Revision: D22956149
Pulled By: malfet
fbshipit-source-id: 897c55965e53b477cd20f69e8da15d90489035de
Summary: this breaks if we cut the net at certain int8 ops boundary.
Test Plan: with net_runner to lower a single Int8Quantize op. It used to break. Now it works.
Reviewed By: yinghai
Differential Revision: D22912178
fbshipit-source-id: ca306068c9768df84c1cfa8b34226a1330e19912
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42528
It seems it was an oversight that they weren't run. This allows to simplify our auto-generation logic as now all test suites are run in both modes.
ghstack-source-id: 109229969
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D22922151
fbshipit-source-id: 0766a6970c927efb04eee4894b73d4bcaf60b97f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40823
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
As it is now easier to spot that the TensorPipe agent wasn't being run on some test suite, we fix that. We keep this change for last so that if those tests turn out to be flaky and must be reverted this won't affect the rest of the stack.
ghstack-source-id: 109229469
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22309432
fbshipit-source-id: c433a6a49a7b6737e0df4cd953f3dfde290f20b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40822
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This is the last step of removing TEST_CONFIG. As there was no one left using it, there is really not much to it.
ghstack-source-id: 109229471
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22307778
fbshipit-source-id: 0d9498d9367eec671e0a964ce693015f73c5638c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40821
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This change continues the work towards removing TEST_CONFIG, by taking a few functions that were accepting the agent name (as obtained from TEST_CONFIG) and then did a bunch of if/elses on it, and replace them by new abstract methods on the fixtures, so that these functions become "decentralized".
ghstack-source-id: 109229472
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22307776
fbshipit-source-id: 9e1f6edca79aacf0bcf9d83d50ce9e0d2beec0dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40820
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
Now that no one is using the generic fixture anymore (i.e., the fixture that looks up the agent's name in the global TEST_CONFIG) we can make it abstract, i.e., have its methods become no-ops and add decorators that will require all subclasses to provide new implementations of those methods. This is a first step towards removing TEST_CONFIG.
ghstack-source-id: 109229475
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22307777
fbshipit-source-id: e52abd915c37894933545eebdfdca3ecb9559926
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40819
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This diff removes the two decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which were used to skip tests. They were only used to prevent the TensorPipe agent from running tests that were using the process group agent's options. The converse (preventing the PG agent from using the TP options) is achieved by having those tests live in a `TensorPipeAgentRpcTest` class. So here we're doing the same for process group, by moving those tests to a `ProcessGroupAgentRpcTest` class.
ghstack-source-id: 109229473
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22283179
fbshipit-source-id: b9315f9fd67f35e88fe1843faa161fc53a4133c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40818
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This diff does the changes described above for the process group agent. It defines a fixture for it (instead of using the generic fixture in its default behavior) and then merges all the entry points into a single script. Note that after this change there won't be anymore a "vanilla" RPC test: all test scripts now specify what agent they are using. This puts all agents on equal standing.
ghstack-source-id: 109229474
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22283182
fbshipit-source-id: 7e3626bbbf37d88b892077a03725f0598576b370
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40817
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This diff does the changes described above for the faulty agent, which is its own strange beast. It merges all the test entry points (i.e., the combinations of agent, suite and fork/spawn) into a single file. It also modifies the test suites that are intended to be run only on the faulty agent, which used to inherit from its fixture, to inherit from the generic fixture, as they will be mixed in with the faulty fixture at the very end, inside the entry point script.
ghstack-source-id: 109229477
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22283178
fbshipit-source-id: 72659efe6652dac8450473642a578933030f2c74
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40816
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This diff does the changes described above for the TensorPipe agent. It fixes its fixture (making it inherit from the generic fixture) and merges all the entry point scripts into a single one, so that it's easier to have a clear overview of all the test suites which we run on TensorPipe (you'll notice that many are missing: the JIT ones, the remote module one, ...).
ghstack-source-id: 109229476
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22283180
fbshipit-source-id: d5e9f9f4e6d4bfd6fbcae7ae56eed63d2567a02f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42567
Before this change we didn't expand arguments, and thus in an expr
`sigmoid(sigmoid(x))` only the outer call was expanded.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D22936177
Pulled By: ZolotukhinM
fbshipit-source-id: 9c05dc96561225bab9a90a407d7bcf9a89b078a1
Summary:
For CUDA >= 10.2, the `CUBLAS_WORKSPACE_CONFIG` environment variable must be set to either `:4096:8` or `:16:8` to ensure deterministic CUDA stream usage. This PR adds some logic inside `torch.set_deterministic()` to raise an error if this environment variable is not set properly and CUDA >= 10.2.
Issue https://github.com/pytorch/pytorch/issues/15359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41377
Reviewed By: malfet
Differential Revision: D22758459
Pulled By: ezyang
fbshipit-source-id: 4b96f1e9abf85d94ba79140fd927bbd0c05c4522
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42322
Our current type checking rules are rather lax, and for
example don't force users to make sure they annotate all functions
with types. For code generation code, it would be better to force
100% typing. This PR introduces a new mypy configuration
mypy-strict.ini which applies rules from --strict. We extend
test_type_hints.py to test for this case. It only covers
code_template.py, which I have made strict clean in this PR.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D22846120
Pulled By: ezyang
fbshipit-source-id: 8d253829223bfa0d811b6add53b7bc2d3a4356b0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42591
We don't support lowering with 2-input Int8Quantize and 4-input Int8FC. Just do a conversion to absorb the quantization params into the op itself.
Test Plan:
```
buck test caffe2/caffe2/quantization/server:quantize_dnnlowp_op_test
```
Reviewed By: benjibc
Differential Revision: D22942673
fbshipit-source-id: a392ba2afdfa39c05c5adcb6c4dc5f814c95e449
Summary:
1. Fix illegal memory access issue for SplitByLengths operator in the CUDA context.
2. Add support to scaling lengths vector for SplitByLengths operator.
3. Add support to test SplitByLengths operator in the CUDA context.
Example for SplitByLengths operator processing scaling lengths vector:
value vector A = [1, 2, 3, 4, 5, 6]
length vector B = [1, 2]
after execution of SplitByLengths operator,
the output should be [1,2] and [3,4,5,6]
Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:concat_split_op_test
Reviewed By: kennyhorror
Differential Revision: D22780307
fbshipit-source-id: c5ca60ae16b24032cedfa045a421503b713daa6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.
Basic logic:
| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |
Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.
Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10
Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):
```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```
Reviewed By: ngimel
Differential Revision: D22824329
fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42569
We do not need to allocate buffers for Vulkan Tensors if they are not the forward input or output.
Removing allocate_storage() for outputs of operations by default, their image representation will have the result.
Allocating buffer only if it was requested for the operations (For some ops like concatenate, transpose) or copy to host.
`VulkanTensor.image()` if buffer was not allocated - just allocates texture skipping copy from buffer to texture.
As allocate storage was before for all operations - we are saving buffer allocation and buffer_to_image call.
MobilNetV2 on my Pixel4:
```
flame:/data/local/tmp $ ./speed_benchmark_torch --model=mnfp32-vopt.pt --input_type=float --input_dims=1,3,224,224 --warmup=3 --iter=20 --vulkan=true
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 305818. Iters per second: 3.26991
Segmentation fault
```
```
139|flame:/data/local/tmp $ ./speed_benchmark_torch_noas --model=mnfp32-vopt.pt --input_type=float --input_dims=1,3,224,224 --warmup=3 --iter=20 --vulkan=true
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 236768. Iters per second: 4.22355
Segmentation fault
```
Test Plan: Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D22946552
Pulled By: IvanKobzarev
fbshipit-source-id: ac0743bb316847632a22cf9aafb8938e50b2fb7b
Summary:
This PR removes manual registration in aten/native codebase.
And it separates manual device/catchall kernel registration from manual VariableType kernel registration.
The first one remains as manual_kernel_registration in native_functions.yaml.
The second one is moved to tools/ codegen.
Difference in generated TypeDefault.cpp: https://gist.github.com/ailzhang/897ef9fdf0c834279cd358febba07734
No difference in generated VariableType_X.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42386
Reviewed By: agolynski
Differential Revision: D22915649
Pulled By: ailzhang
fbshipit-source-id: ce93784b9b081234f05f3343e8de3c7a704a5783
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 4abc34af1a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42584
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D22941475
fbshipit-source-id: 29863cad7f77939edb44d337918693879b35cfaa
Summary:
This PR intends to fix https://github.com/pytorch/pytorch/issues/32983.
The initial (one-line) diff causes statically linked cudnn symbols in `libtorch_cuda.so` to have local linkage (such that they shouldn't be visible to external libraries during dynamic linking at load time), at least in my source build on Ubuntu 20.04.
Procedure I used to verify:
```
export USE_STATIC_CUDNN=ON
python3 setup.py install
...
```
then
```
mcarilli@mcarilli-desktop:~/Desktop/mcarilli_github/pytorch/torch/lib$ nm libtorch_cuda.so | grep cudnnCreate
00000000031ff540 t cudnnCreate
00000000031fbe70 t cudnnCreateActivationDescriptor
```
Before the diff they were marked with capital `T`s indicating external linkage.
Caveats:
- The fix is gcc-specific afaik. I have no idea how to enable it for Windows or other compilers.
- Hiding the cudnn symbols will break external C++ applications that rely on linking `libtorch.so` to supply cudnn symbol definitions. IMO this is "off menu" usage so I don't think it's a major concern. Hiding the symbols _won't_ break applications that call cudnn indirectly through torch functions, which IMO is the "on menu" way.
- I know _very little_ about the build system. The diff's intent is to add a link option that applies to any Pytorch `.so`s that statically link cudnn, and does so on Linux only. I'm blindly following soumith 's recommendation https://github.com/pytorch/pytorch/issues/32983#issuecomment-662056151, and post-checking the built libs (I also added `set(CMAKE_VERBOSE_MAKEFILE ON)` to the top-level CMakeLists.txt at one point to confirm `-Wl,--exclude-libs,libcudnn_static.a` was picked up by the command that linked `libtorch_cuda.so`).
- https://github.com/pytorch/pytorch/issues/32983 (which used a Pytorch 1.4 binary build) complained about `libtorch.so`, not `libtorch_cuda.so`:
```
nvpohanh@ubuntu:~$ nm /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch.so | grep ' cudnnCreate'
000000000f479c30 T cudnnCreate
000000000f475ff0 T cudnnCreateActivationDescriptor
```
In my source build, `libtorch.so` ends up small, containing no cudnn symbols (this is true with or without the PR's diff), which contradicts https://github.com/pytorch/pytorch/issues/32983. Maybe the symbol organization (what goes in `libtorch.so` vs `libtorch_cuda/cpu/whatever.so`) changed since 1.4. Or maybe the symbol organization is different for source vs binary builds, in which case I have no idea if this PR's diff has the same effect for a binary build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41986
Reviewed By: glaringlee
Differential Revision: D22934926
Pulled By: malfet
fbshipit-source-id: 711475834e0f8148f0e5f2fe28fca5f138ef494b
Summary: Current OutputColumnMaxHistogramObserver will output 2048 bins for each column. The file will be extremely large and the dumping time is quite long. However, we only use the min and max finally. This diff enables changing bin_nums by adding an argument. And the default value is set to 16 to reduce dumping overhead. When we need more bins to analyze the results, we only need to change this argument
Test Plan:
buck run caffe2/caffe2/quantization/server:observer_test
{F263843430}
Reviewed By: hx89
Differential Revision: D22918202
fbshipit-source-id: bda34449355b269b24c55802012450ebaa4d280c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42486
**Summary**
This commit fixes a small bug in which `torch.jit.is_tracing()` returns
`torch._C.is_tracing`, the function object, instead of calling the
function and returning the result.
**Test Plan**
Continuous integration?
**Fixes**
This commit fixes#42448.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D22911062
Pulled By: SplitInfinity
fbshipit-source-id: b94eca0c1c65ca6f22acc6c5542af397f2dc37f0
Summary:
Custom layer by `torch.autograd.Function` appears in the lower_tuple as `prim::PythonOp`. Adding this op type to the allowed list to enable lower_tuple pass. This helps with exporting custom layer with tuple outputs.
E.g.
```python
import torch
class CustomFunction(torch.autograd.Function):
staticmethod
def symbolic(g, input):
return g.op('CustomNamespace::Custom', input, outputs=2)
staticmethod
def forward(ctx, input):
return input, input
class Custom(torch.nn.Module):
def forward(self, input):
return CustomFunction.apply(input)
model = Custom()
batch = torch.FloatTensor(1, 3)
torch.onnx.export(model, batch, "test.onnx", verbose=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41548
Reviewed By: glaringlee
Differential Revision: D22926143
Pulled By: bzinodev
fbshipit-source-id: ce14d1d3c70a920154a8235d635ab31ddf0c46f3
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 87c378172a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42496
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D22911638
fbshipit-source-id: f20c83908b51ff56d8bf1d8b46961f70d023c81a
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42418.
The problem was that the non-contiguous batched matrices were passed to `gemmStridedBatched`.
The following code fails on master and works with the proposed patch:
```python
import torch
x = torch.tensor([[1., 2, 3], [4., 5, 6]], device='cuda:0')
c = torch.as_strided(x, size=[2, 2, 2], stride=[3, 1, 1])
torch.einsum('...ab,...bc->...ac', c, c)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42425
Reviewed By: glaringlee
Differential Revision: D22925266
Pulled By: ngimel
fbshipit-source-id: a72d56d26c7381b7793a047d76bcc5bd45a9602c
Summary:
Add space between double back quotes and left curly bracket
Otherwise doc generation failed with `Inline literal start-string without end-string.`
This regression was introduced by b56db305cf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42559
Reviewed By: glaringlee
Differential Revision: D22931527
Pulled By: malfet
fbshipit-source-id: 11c04a92dbba48592505f704d77222cf92a81055
Summary:
Initial PR for the Tensor List functionality.
**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.
**In this PR**
- Adding `multi_tensor_apply` mechanism which will help to efficiently apply passed functor on a given list of tensors on CUDA.
- Adding a first private API - `std::vector<Tensor> _foreach_add(TensorList tensors, Scalar scalar)`
**Tests**
Tested via unit tests
**Plan for the next PRs**
1. Cover these ops with `multi_tensor_apply` support
- exponent
- division
- mul_
- add_
- addcmul_
- addcdiv_
- Sqrt
2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41554
Reviewed By: cpuhrsch
Differential Revision: D22829724
Pulled By: izdeby
fbshipit-source-id: 47febdbf7845cf931958a638567b7428a24782b1
Summary:
Was getting an error when attempting to push to master for
pytorch/pytorch.github.io since the main branch on that repository is
actually site and not master.
Get rid of the loop too since the loop wasn't going to work with a
conditional and conditionals on a two variable loop just isn't worth the
readability concerns
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42552
Reviewed By: malfet
Differential Revision: D22929503
Pulled By: seemethere
fbshipit-source-id: acdd26b86718304eac9dcfc81761de0b3e609004
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42211
Helper functions for launching CUDA Sleep and Tensor Value Initialization for the collective test functions.
This is more of a code cleanup fix compared to the previous diffs.
ghstack-source-id: 109097243
Test Plan: working on devGPU and devvm
Reviewed By: jiayisuse
Differential Revision: D22782671
fbshipit-source-id: 7d88f568a4e08feae778669affe69c8d638973db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42209
This PR adds a TearDown function to the testing superclass to ensure that the NCCL_BLOCKING_WAIT environment variable is reset after each test case.
ghstack-source-id: 109097247
Test Plan: Working on devGPU and devvm.
Reviewed By: jiayisuse
Differential Revision: D22782672
fbshipit-source-id: 8f919a96d7112f9f167e90ce3df59886c88f3514
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42208
ProcessGroupNCCLTest is currently written without any testing framework, and all tests are simply called from the main function and throw exceptions upon failure. As a result, it is hard to debug and pinpoint which tests have succeeded/failed.
This PR moves ProcessGroupNCCLTest to gtest with appropriate setup and skipping functionality in the test superclass.
ghstack-source-id: 109097246
Test Plan: Working Correctly on devGPU and devvm.
Reviewed By: jiayisuse
Differential Revision: D22782673
fbshipit-source-id: 85bd407f4534f3d339ddcdd65ef3d2022aeb7064
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42494
Note that we're currently assuming that dtypes of all the arguments and
the return value is the same.
Test Plan: Imported from OSS
Reviewed By: nickgg
Differential Revision: D22910755
Pulled By: ZolotukhinM
fbshipit-source-id: 7f899692065428fbf2ad05d22b4ca39cab788ae5
Summary:
We have code snippet like below in VariableType_X.cpp
```
Tensor __and___Scalar(const Tensor & self, Scalar other) {
auto result = TypeDefault::__and___Scalar(self, other);
return result;
}
TORCH_LIBRARY_IMPL(aten, Autograd, m) {
m.impl("__and__.Scalar",
c10::impl::hacky_wrapper_for_legacy_signatures(TORCH_FN(VariableType::__and___Scalar))
);
```
We already register TypeDefault kernels as catchAll so they're not needed to be wrapped and register to Autograd key in VariableType.cpp. This PR removes the wrapper and registration in VariableType.cpp. (The ones in other files like TracedType.cpp remains the same).
Here's a [diff in generated VariableTypeEverything.cpp](https://gist.github.com/ailzhang/18876edec4dad54e43a1db0c127c5707)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42031
Reviewed By: agolynski
Differential Revision: D22903507
Pulled By: ailzhang
fbshipit-source-id: 04e6672b6c79e079fc0dfd95c409ebca7f9d76fc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42514
Add Alltoall and Alltoallv to PT NCCL process group using NCCL Send/Recv.
Reviewed By: mrshenli
Differential Revision: D22917967
fbshipit-source-id: 402f2870915bc237845864a4a27c97df4351d975
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42480
These are grouped together because they all return a tuple of multiple
tensors.
This PR implements batching rules for chunk, split, and unbind. It also
updates the testing logic. Previously, reference_vmap was not able to
handle multiple outputs, now, it does.
Test Plan: - `pytest test/test_vmap.py -v -k "Operators"`
Reviewed By: ezyang
Differential Revision: D22905401
Pulled By: zou3519
fbshipit-source-id: 9963c943d035e9035c866be74dbdf7ab1989f8c4
Summary:
Segfault happens when one tries to deallocate uninitialized generator.
Make `THPGenerator_dealloc` UBSAN-safe by moving implicit cast in the struct definition to reinterpret_cast
Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly
Fixes https://github.com/pytorch/pytorch/issues/42281
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42510
Reviewed By: pbelevich
Differential Revision: D22917469
Pulled By: malfet
fbshipit-source-id: 5eaa68eef10d899ee3e210cb0e1e92f73be75712
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42516
att. We need it for some scripts.
Reviewed By: houseroad
Differential Revision: D22918112
fbshipit-source-id: 8a1696ceeeda67a34114bc57cb52c925711cfb4c
Summary:
* move both under new file `fixup_onnx_controlflow`
* move the fixup to where the ONNX loop/if node is created, as oppose to running the fixup as postpass. This will help with enable onnx shape inference later.
* move `fuseSequenceSplitConcat` to `Peephole`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40943
Reviewed By: mrshenli
Differential Revision: D22709999
Pulled By: bzinodev
fbshipit-source-id: 51d316991d25dc4bb4047a6bb46ad1e2401d3d2d
Summary:
ONNX pass `torch._C._jit_pass_onnx_function_substitution(graph)` inlines the function with the compiled torch graph. But while it removes all connections with the compiled function node (e.g. see below - `%6 : Function = prim::Constant[name="f"]()`), it does not remove the function node itself. For example, if the input graph is:
```
graph(%0 : Long(requires_grad=0, device=cpu),
%1 : Long(requires_grad=0, device=cpu)):
%6 : Function = prim::Constant[name="f"]()
%7 : Tensor = prim::CallFunction(%6, %0, %1)
return (%7)
```
The output graph is:
```
graph(%0 : Long(requires_grad=0, device=cpu),
%1 : Long(requires_grad=0, device=cpu)):
%6 : Function = prim::Constant[name="f"]()
%8 : int = prim::Constant[value=1]()
%z.1 : Tensor = aten::sub(%0, %1, %8) # test/onnx/test_utility_funs.py:790:20
%10 : Tensor = aten::add(%0, %z.1, %8) # test/onnx/test_utility_funs.py:791:23
return (%10)
```
Note that the `%6 : Function = prim::Constant[name="f"]()` has not been removed (though it is not being used).
This PR updates the pass to remove the function node completely. The updated graph looks as follows:
```
graph(%0 : Long(requires_grad=0, device=cpu),
%1 : Long(requires_grad=0, device=cpu)):
%8 : int = prim::Constant[value=1]()
%z.1 : Tensor = aten::sub(%0, %1, %8) # test/onnx/test_utility_funs.py:790:20
%10 : Tensor = aten::add(%0, %z.1, %8) # test/onnx/test_utility_funs.py:791:23
return (%10)
```
A test point has also been added for this scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42146
Reviewed By: VitalyFedyunin
Differential Revision: D22845314
Pulled By: bzinodev
fbshipit-source-id: 81fb351f0a36f47204e5327b60b84d7a91d3bcd9
Summary:
`as_strided` creates a view of an existing tensor with specified `sizes`, `strides`, and `storage_offsets`. This PR supports the export of `as_strided` with static argument `strides`. The following scenarios will not be supported:
* Calling on tensor of dynamic shape, i.e. the tensor shape differs between model runs and different model inputs.
* In-place operations, i.e. updates to the original tensor that are expected to reflect in the `as_strided` output, and vice versa.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41569
Reviewed By: VitalyFedyunin
Differential Revision: D22845295
Pulled By: bzinodev
fbshipit-source-id: 7d1aa88a810e6728688491478dbf029f17ae7201
Summary:
This PR initiates the process of updating the torchsciprt backend interface used by ONNX exporter.
- Replace jit lower graph pass by freeze module pass
- Enable ScriptModule tests for ONNX operator tests (ORT backend) and model tests by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41413
Reviewed By: VitalyFedyunin
Differential Revision: D22845258
Pulled By: bzinodev
fbshipit-source-id: d57fd4086f27bd0c3bf5f70af7fd0daa39a2814a
Summary:
Export dynamic torch.eye, i.e. commonly from another tensor, shape for torch.eye is not known at export time.
Static torch.eye where n,m are constants is exported as constant tensor directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41357
Reviewed By: VitalyFedyunin
Differential Revision: D22845220
Pulled By: bzinodev
fbshipit-source-id: 6e5c331fa28ca542022ea16f9c88c69995a393b2
Summary:
* It originally failed to check for cases where highlight token appears more than once.
* Now it repeated tries to find highlight token if one doesn't seem correctly highlighted until end of error message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42417
Reviewed By: SplitInfinity
Differential Revision: D22889411
Pulled By: gmagogsfm
fbshipit-source-id: 994835db32849f3d7e98ab7f662bd5c6b8a1662e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42137
This PR implements an SGD optimizer class similar to torch::optim::SGD, but it doesn't inherit from torch::optim::Optimizer, for use on mobile devices (or other lightweight use case).
Adding Martin's comment for visibility: "SGD may be the only optimizer used in near future. If more client optimizers are needed, refactoring the full optim codes and reusing the existing code would be an option."
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D22846514
Pulled By: ann-ss
fbshipit-source-id: f5f46804aa021e7ada7c0cd3f16e24404d10c7eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42492
There's a potential for multiple tags to be created for the same digest
so we should iterate through all potential tags so that we're not
deleting digests that are associated with tags that we actually want.
Also, reduced the number of prints in this script to only the absolutely
necessary prints. (i.e. only the deleted images)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D22909248
Pulled By: seemethere
fbshipit-source-id: 7f2e540d133485ed6464e413b01ef67aa73df432
Summary:
Segfault happens when one tries to deallocate unintialized generator
Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly
Fixes https://github.com/pytorch/pytorch/issues/42281
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42490
Reviewed By: seemethere
Differential Revision: D22908795
Pulled By: malfet
fbshipit-source-id: c5b6a35db381738c0fc984aa54e5cab5ef2cbb76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42421
Previously, we can only feed shape info from Python with float dtype, and batch based dim type when we do onnxifi from Python. This diff removes this limitation and uses TensorBoundShapes protobuf as a generic shape info struct. This will make the onnxifi interface in Python more flexible.
Reviewed By: ChunliF
Differential Revision: D22889781
fbshipit-source-id: 1a89f3a68c215a0409738c425b4e0d0617d58245
Summary:
This is to unify how output scale calculation is to be done between
fbgemm and qnnpack (servers vs mobile).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41342
Test Plan: Quantization tests.
Reviewed By: vkuzo
Differential Revision: D22506347
Pulled By: kimishpatel
fbshipit-source-id: e14d22f13c6e751cafa3e52617e76ecd9d39dad5
Summary:
`ninputs` variable was always used as a `size_t` but declared as an `int32_t`
Now, some annoying warnings are fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42453
Reviewed By: agolynski
Differential Revision: D22898282
Pulled By: mrshenli
fbshipit-source-id: b62d6b07f0bc3717482906df6010d88762ae0ccd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38697
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):
```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
for n, t in [(40_000, 50000),
(400_000, 5000)]:
print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```
Before:
```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```
After:
```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D22291236
Pulled By: VitalyFedyunin
fbshipit-source-id: 134dd08b77b11e631d914b5500ee4285b5d0591e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42276
This commit converts `_wait_all_workers()` to `_all_gather()` by
allowing each worker to provide its own data object. The `_all_gather()`
function blocks and returns the gathered results. This API can be
converted to `rpc.barrier()` latter.
Test Plan: Imported from OSS
Reviewed By: lw
Differential Revision: D22853480
Pulled By: mrshenli
fbshipit-source-id: 9d506813b9fd5b7c144885e2b76a863cbd19466a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42248
Including:
- torch.diagonal
- torch.t
- torch.select
- Tensor.expand_as
- Tensor slicing.
Please let me know in the future if it would be easier to review these
separately (I put five operators into this PR because each
implementation is relatively simple).
Test Plan:
- new tests in `test/test_vmap.py`.
- I would like to have a more structured/automated way of testing but
my previous attempts at making something resulted in something very
complicated.
Reviewed By: ezyang
Differential Revision: D22846273
Pulled By: zou3519
fbshipit-source-id: 8e45ebe11174512110faf1ee0fdc317a25e8b7ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41943
If an operator doesn't have a batching rule implemented then we fallback
to this implementation. The fallback only works on out-of-place operators
that return only tensors with new memory. (e.g., no in-place operators,
no view operations).
The fallback effectively takes all of the BatchedTensors in `stack`,
slices them, and runs `op` on all of the corresponding slices to produce slices
of the outputs. The output slices then get `torch.stack`ed to create the
final returns.
The performance of the fallback is not very good because it introduces
an extra copy from stacking the sliced outputs. Because of this, we prefer
to write batching rules for operators whenever possible.
In the future, I'd like to disable the fallback kernel for random
functions until we have a better random story for vmap. I will probably
add a blocklist of operators to support that.
Test Plan: - `pytest test/test_vmap.py -v`
Reviewed By: ezyang
Differential Revision: D22764103
Pulled By: zou3519
fbshipit-source-id: b235833f7f27e11fb76a8513357ac3ca286a638b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42381
Introduce new tag to support distributed hogwild.
Reviewed By: boryiingsu
Differential Revision: D20484099
fbshipit-source-id: 5973495589e0a7ab185d3867b37437aa747f408a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42397
Since the autograd registration is unified to code-gen, we don't need to keep a manual registration file for mobile.
Remove it to avoid extra maintenance.
Test Plan: Imported from OSS
Reviewed By: ljk53
Differential Revision: D22883153
Pulled By: iseeyuan
fbshipit-source-id: 6db0bd89369beab9eed6e9a9692dd46f5bd1ff48
Summary:
`abs` doesn't have an signed overload across all compilers, so applying abs on uint8_t can be ambiguous: https://en.cppreference.com/w/cpp/numeric/math/abs
This may cause unexpected issue when the input is uint8 and is greater
than 128. For example, on MSVC, applying `std::abs` on an unsigned char
variable
```c++
#include <cmath>
unsigned char a(unsigned char x) {
return std::abs(x);
}
```
gives the following warning:
warning C4244: 'return': conversion from 'int' to 'unsigned char',
possible loss of data
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42254
Reviewed By: VitalyFedyunin
Differential Revision: D22860505
Pulled By: mruberry
fbshipit-source-id: 0076d327bb6141b2ee94917a1a21c22bd2b7f23a
Summary:
Many ufuncs (mostly unary ufuncs) in NumPy promote integer inputs to float. This typically occurs when the results of the function are not representable as integers.
For example:
```
a = np.array([1, 2, 3], dtype=np.int64)
np.sin(a)
: array([0.84147098, 0.90929743, 0.14112001])
```
In PyTorch we only have one function, `torch.true_divide`, which exhibits this behavior today, and it did it by explicitly pre-casting its inputs to the default (float) scalar type where necessary before calling TensorIterator.
This PR lets TensorIterator understand and implement this behavior directly, and it updates `torch.true_divide` to verify the behavior is properly implemented. This will be convenient when implementing more integer->float promotions later (like with `torch.sin`), and also saves copies on CUDA, where the cast from one dtype to another is fused with the computation.
The mechanism for this change is simple. A new flag, `promote_integer_inputs_to_float_` is added to TensorIteratorConfig, and it requires `promote_integer_inputs_to_float_` be true if it's set. When the new flag is set, after the TensorIterator's "common dtype" (AKA "computation type") is computed it's checked for being an integral (boolean included) type and, if it is, changed to the default (float) scalar type, instead. Only `torch.true_divide` sets this flag (for now).
In the future we'll likely...
- provide helpers (`binary_float_op`, `unary_float_op`) to more easily construct functions that promote int->float instead of requiring they build their own TensorIteratorConfigs.
- update torch.atan2 to use `binary_float_op`
- update many unary ufuncs, like `torch.sin` to use `unary_float_op` and support unary ops having different input and result type (this will also require a small modification to some of the "loops" code)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42359
Reviewed By: ngimel
Differential Revision: D22878394
Pulled By: mruberry
fbshipit-source-id: b8de01e46be859321522da411aed655e2c40e5b9
Summary:
Raise and assert used to have a hard-coded error message "Exception". User provided error message was ignored. This PR adds support to represent user's error message in TorchScript.
This breaks backward compatibility because now we actually need to script the user's error message, which can potentially contain unscriptable expressions. Such programs can break when scripting, but saved models can still continue to work.
Increased an op count in test_mobile_optimizer.py because now we need aten::format to form the actual exception message.
This is built upon an WIP PR: https://github.com/pytorch/pytorch/pull/34112 by driazati
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41907
Reviewed By: ngimel
Differential Revision: D22778301
Pulled By: gmagogsfm
fbshipit-source-id: 2b94f0db4ae9fe70c4cd03f4048e519ea96323ad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42380
[Caffe2] Remove explicitly divide by zero in SpatialBN training mode
Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:spatial_bn_op_test
Reviewed By: houseroad
Differential Revision: D22873214
fbshipit-source-id: 70b505391b5db02b45fc46ecd7feb303e50c6280
Summary:
Otherwise numba linking by clang-9 fails with:
```
ld: in /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/numpy/core/lib/libnpymath.a(npy_math.o), could not parse object file /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/numpy/core/lib/libnpymath.a(npy_math.o): 'Unknown attribute kind (61) (Producer: 'LLVM10.0.0' Reader: 'LLVM APPLE_1_902.0.39.2_0')', using libLTO version 'LLVM version 9.1.0, (clang-902.0.39.2)' for architecture x86_64
```
Because conda's numpy-1.19.1 is compiled with clang-10
This should fix MacOS regressions in CIrcleCI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42409
Reviewed By: xw285cornell
Differential Revision: D22887683
Pulled By: malfet
fbshipit-source-id: d58ee9bf53772b57c59e18f71151916d4f0a3c7d
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40986.
TensorIterator's test for a CUDA kernel getting too many CPU scalar inputs was too permissive. This update limits the check to not consider outputs and to only be performed if the kernel can support CPU scalars.
A test is added to verify the appropriate error message is thrown in a case where the old error message was thrown previously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42360
Reviewed By: ngimel
Differential Revision: D22868536
Pulled By: mruberry
fbshipit-source-id: 2bc8227978f8f6c0a197444ff0c607aeb51b0671
Summary:
**BC-Breaking Note:**
BC breaking changes in the case where keepdim=True. Before this change, when calling `torch.norm` with keepdim=True and p='fro' or p=number, leaving all other optional arguments as their default values, the keepdim argument would be ignored. Also, any time `torch.norm` was called with p='nuc', the result would have one fewer dimension than the input, and the dimensions could be out of order depending on which dimensions were being reduced. After the change, for each of these cases, the result has the same number and order of dimensions as the input.
**PR Summary:**
* Fix keepdim behavior
* Throw descriptive errors for unsupported sparse norm args
* Increase unit test coverage for these cases and for complex inputs
These changes were taken from part of PR https://github.com/pytorch/pytorch/issues/40924. That PR is not going to be merged because it overrides `torch.norm`'s interface, which we want to avoid. But these improvements are still useful.
Issue https://github.com/pytorch/pytorch/issues/24802
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41956
Reviewed By: albanD
Differential Revision: D22837455
Pulled By: mruberry
fbshipit-source-id: 509ecabfa63b93737996f48a58c7188b005b7217
Summary:
A heavy refactor of bounds inference to fix some issues and bugs blocking using it to analyze cross thread interactions:
* We were merging all accesses to a Buf into a single bounds info entry, even if they did not overlap. E.g. if we accessed a[0:2] and a[5:6] we would merge that into a bound of a[0:6]. I've changed this behaviour to merge only overlapping bounds.
* We were not separating bounds of different kinds (e.g. Load vs Store) and would merge a Store bounds into a Load bounds, losing the information about what kind of access it was. E.g. this loop would produce bounds: [{Load, 0, 10}] and now produces bounds [{Load, 0, 9}, {Store, 1, 10}]:
```
for i in 1 to 10...
x[i] = x[i-1]
```
* Both ComputeAt and Rfactor relied on the overzealous merging and only used a single entry in the bounds list to determine the bounds of temporary buffers they created, which could result in temporary buffers allocated smaller than accesses to them. I've fixed Rfactor, but *not* ComputeAt - however all ComputeAt tests still pass (may require loop fusion to trigger this issue) - I will come back to it.
Being more precise about bounds is more complex, rather than taking the minimum of starts and maximum of stops we now need to determine if two bounds overlap or are adjacent. There are many edge cases and so I've added a bunch of test coverage of the merging method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42185
Reviewed By: mruberry
Differential Revision: D22870391
Pulled By: nickgg
fbshipit-source-id: 3ee34fcbf0740a47259defeb44cba783b54d0baa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42202
Currently we used the template in order to be able to take both
`std::vector<ExprHandle>` and `std::vector<VarHandle>`. However,
semantics of this function tells that the only allowed option should be
the former one: we're specifying indices for the tensor access we want
to generate. While it could be convenient to avoid conversion from
vector of vars to a vector of exprs at the callsites, it makes the code
less explicit and thus more difficult to reason about.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D22806429
Pulled By: ZolotukhinM
fbshipit-source-id: 8403af5fe6947c27213050a033e79a09f7075d4c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41451
Since TE operates on a limited subset of ops with a well-defined
semantics, we can easily infer shapes of intermediate and output tensors
given shapes of the inputs.
There is a couple of ops that are not yet supported in the shape
inference, once we add them we could relax the shape info requirements
in the TE fuser: currently it requires all values in the fusion group to
have shapes known and we can change it to only inputs.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D22543470
Pulled By: ZolotukhinM
fbshipit-source-id: 256bae921028cb6ec3af91977f12bb870c385f40
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42201
Previously, we've been using operators <, >, ==, et al. and relied on
the dtype to be picked automatically. It led to a wrong dtype being
picked for the result, but that choice was overwritten by the type
explicitly specified in JIT IR, which we were lowering. Now we are
moving towards using shape inference instead of relying on all types
being specified in the IR, and that made this issue to immediately pop
up.
Test Plan: Imported from OSS
Reviewed By: Krovatkin
Differential Revision: D22806428
Pulled By: ZolotukhinM
fbshipit-source-id: 89d2726340efa2bb3da45d1603bedc53955e14b9
Summary:
[3/N] Implement Enum JIT support
* Add enum value as constant support
* Add sugared value for EnumClass
Supported:
Enum-typed function arguments
using Enum type and comparing them
Support getting name/value attrs of enums
Using Enum value as constant
TODO:
Add PyThon sugared value for Enum
Support Enum-typed return values
Support serialization and deserialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42085
Reviewed By: eellison
Differential Revision: D22758042
Pulled By: gmagogsfm
fbshipit-source-id: 5c6e571686c0b60d7fbad59503f5f94b3b3cd125
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42255
Changes to match Fused Op: Dequantize->Swish->Quantize
* Changes to scale handling
Results showing matching intermediate and final Swish_Int8 Op.
P137389801
Test Plan: test case test_deq_swish_quant_nnpi.py
Reviewed By: hyuen
Differential Revision: D22827499
fbshipit-source-id: b469470ca66f6405ccc89696694af372ce6ce89e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41947
Previously, if an op took an optional `Tensor?` argument, the C++ frontend (i.e. `at::op()` and `Tensor::op()`)
were generated to take `Tensor`. A previous PR (https://github.com/pytorch/pytorch/pull/41610) changed the kernels
to be written with `c10::optional<Tensor>` instead of `Tensor`, but that did not touch the C++ frontend yet.
This PR changes the C++ frontend API to take `c10::optional<Tensor>` instead of `Tensor` as well.
This should be mostly bc conserving. Since `Tensor` implicitly converts to `c10::optional<Tensor>`, any old code
calling an op with a `Tensor` would still work. There are likely corner cases that get broken though.
For example, C++ only ever does *one* implicit conversion. So if you call an op with a non-tensor object
that gets implicitly converted to a `Tensor`, then that previously worked since the API took a `Tensor` and
C++ allows one implicit conversion. Now it wouldn't work anymore because it would require two implicit conversions
(to `Tensor` and then to `c10::optional<Tensor>`) and C++ doesn't do that.
The main reasons for doing this are
- Make the C++ API more sane. Those arguments are optional and that should be visible from the signature.
- Allow easier integration for XLA and Autocast. Those backends generate code to wrap operators and forward
operator arguments to calls to at::op(). After https://github.com/pytorch/pytorch/pull/41610, there was
a mismatch because they had to implement operators with `optional<Tensor>` but call `at::op()` with `Tensor`,
so they had to manually convert between those. After this PR, they can just forward the `optional<Tensor>`
in their call to `at::op()`.
ghstack-source-id: 108873705
Test Plan: unit tests
Reviewed By: bhosmer
Differential Revision: D22704832
fbshipit-source-id: f4c00d457b178fbc124be9e884a538a3653aae1f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41610
Previously, operators that have a `Tensor?` (i.e. optional tensor) in their schema implemented it using `Tensor` in C++ and filled in an undefined tensor for the None case.
The c10 operator library, however, expects `Tensor?` to be represented as `optional<Tensor>`, so those operators couldn't be c10-full yet and still had to use codegenerated unboxing instead of templated unboxing.
This PR changes that. It extends the `hacky_wrapper_for_legacy_signatures` to not only take case of TensorOptions, but now also map between signatures taking `Tensor` and `optional<Tensor>`.
For this, it requires an additional template parameter, the expected signature, and it uses that to go argument-by-argument and unwrap any optionals it finds.
ghstack-source-id: 108873701
Test Plan: waitforsandcastle
Reviewed By: bhosmer
Differential Revision: D22607879
fbshipit-source-id: 57b2fb01a294b804f82cd55cd70f0ef4a478e14f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42141
Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior.
Description from https://github.com/pytorch/pytorch/pull/37106 holds pretty well:
"""
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.
The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
`x, y = f(a, b, c)`
and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
x_in, y_in = f(a_in, b_in, c_in)
-> x_in, y_in
```
We destroy the x and y Value*s in the process. This operation is
easy to express as an update to the aliasDb--x_out just takes on all
the aliasing information x used to have. In particular, since we know
f and prim::fusionGroup are purely functional, we don't have to mess
with any write information.
"""
The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node.
There are a few options:
- attempt to make all subgraph utils & ir cloning logic update a map
- mirror the subgraph utils implementation in create_autodiff_subgraph
- uniquely map x, y and x_in, y_in so you can back out the correspondence.
I went with the third option.
This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D22798377
Pulled By: eellison
fbshipit-source-id: 9a133bcaa3b051c0fb565afb23a3eed56dbe71f9
Summary:
This Diff provides an option for DC++ module to use the squeezed sparse feature embeddings to generate attention weights, with the purpose of reducing the network size to achieve QPS gains. There are 3 squeeze options: sum, max, and mean, along the embedding dimension and are provided for both the attention weights and resnet generation.
Example workflow: f208474456
{F257199459}
Test Plan:
1. Test single ops
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_reduce_back_mean
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_reduce_back_max
2. Test DC++ module
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_arch_one_layer_compressed_embeddings_only_squeeze_input
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_arch_shared_input_squeeze_input
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_input_compress_embeddings_squeeze_input
3. Test Arch
buck test dper3/dper3_models/ads_ranking/model_impl/sparse_nn/tests:sparse_nn_lib_test -- test_dense_sparse_interaction_compress_dot_arch_dot_compress_pp_squeezed_input
4. e2e test
buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_compress_dot_attention_fm_max_fc_size_squeeze_input
Reviewed By: taiqing
Differential Revision: D22825069
fbshipit-source-id: 29269ea22cb47d487a1c92a1f6daae1055f54cfc
Summary:
For mobile custom build, we only generate code for ops that are used by
specific models to reduce binary size.
There multiple places where we apply the op filtering:
- generated_unboxing_wrappers_*.cpp
- autograd/VariableType*.cpp
- c10 op registration (in aten/gen.py)
For c10 op registration, we filter by the main op name - all overloads
that match the main op name part will be kept.
For generated_unboxing_wrappers_*, we filter by the full op name - only
those having exactly the same overload name will be kept.
This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp
codegen to also filter by the main op name.
The reasons are:
- keeping all overloads can have better backward compatibility;
- generated_unboxing_wrappers_* are relatively small as it only contains
thin wrappers for root ops.
- generated_unboxing_wrappers_* will be replaced by c10 op registration
soon anyway.
- autograd/VariableType*.cpp are not included in OSS build.
Why it offers better backward compatibility? #40737 is an example:
It introduced a new `_convolution` overload and renamed the original one
to `_convolution.deprecated`. Before this PR, the model prepared by the
old version PyTorch won't be able to run on the custom mobile build
generated on the PR because `_convolution.deprecated` won't be kept in
the custom build due to full op name matching policy. By relaxing it to
partial matching policy, the mobile custom build CI on the PR can pass.
Will test the size impact for FB production build before landing.
Differential Revision: D22809564
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Pulled By: ljk53
fbshipit-source-id: e2fc017da31f38b9430cc2113f33e6d21a0eaf0b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42348
Use the dtype info in placeholderObserver to decide what ops to insert in the graph
In the next PR we can delete NoopObserver
Test Plan:
python test/test_quantization.py
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22859457
fbshipit-source-id: a5c618f22315534ebd9a2df77b14a0aece196989
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42222
This change adds the necessary passes to perform FP16 dynamic quantization.
We skip inserting observers for activations based on the dtype (torch.float16) and only insert the Fp16Observer for weights
Test Plan:
python test/test_quantization.py TestQuantizeJitOps
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22849220
fbshipit-source-id: 2c53594ecd2485e9e3dd0b380eceaf7c5ab5fc50
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42221
Adds a new observer that emits a warning if the range of tensor is beyond fp16 range. This will be further used in graph mode quantization to insert the cast to fp16 ops in the graph
Test Plan:
python test/test_quantizaton.py TestObserver.test_fp16_observer
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22849222
fbshipit-source-id: a301281ce38ba4d4e7a009308400d34a08c113d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42147
Op to check the range of a tensor and clamp the values to fp16 range
This operator will be inserted into the graph in subsequent diffs.
Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_fp16_saturate_op
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22849221
fbshipit-source-id: 0da3298e179750f6311e3a09596a7b8070509096
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42121
This PR changes the Module API to allow register a module with module
interface type, and therefore allows Module::clone works on the case
where there's a module interface type being shared by two submodules.
interface type will be shared by the new cloned instance in the same
compilation unit bc it only
contains a list of functionSchema, which does not involve any
attributes compared to classType.
fixes https://github.com/pytorch/pytorch/issues/41882
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D22781205
Pulled By: wanchaol
fbshipit-source-id: f97f4b75970f0b434e38b5a1f778eda2c4e5109b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42313
Changes the tests in `ProcessGroupGlooAsyncTest.cpp` to use the Gtest testing framework.
Reviewed By: malfet
Differential Revision: D22821577
fbshipit-source-id: 326b24a334ae84a16434d0d5ef27d16ba4b90d5d
Summary:
`resize_` only requires manual registration to `Autograd` key and its device kernels can safely live together with our normal device dispatch in `native_functions.yaml`.
But currently we do manual registration for `CPU/CUDA` kernels (and leaves no dispatch in native_functions.yaml) which makes `resize_` non-overrideable from backend point of view. While it indeed should dispatch at device level, this caused xla to whitelist `resize_` and register a lowering to XLA key. This PR moves the device dispatch of `resize_` back to `native_functions.yaml` so that it shows up as `abstract` method properly for downstream extensions.
Note that we also do manual registration for `copy_/detach_/resize_as_/etc` in aten but they are slightly different than `resize_` since for them we only register `catchAll` kernels instead of device kernels. I'll need to investigate and send a followup PR for those ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42240
Reviewed By: VitalyFedyunin
Differential Revision: D22846311
Pulled By: ailzhang
fbshipit-source-id: 10b6cf99c4ed3d62fc4e1571f4a2a463d1b88c81
Summary:
See https://github.com/pytorch/pytorch/issues/41027.
This adds a helper to resize output to ATen/native/Resize.* and updates TensorIterator to use it. The helper throws a warning if a tensor with one or more elements needs to be resized. This warning indicates that these resizes will become an error in a future PyTorch release.
There are many functions in PyTorch that will resize their outputs and don't use TensorIterator. For example,
985fd970aa/aten/src/ATen/native/cuda/NaiveConvolutionTranspose2d.cu (L243)
And these functions will need to be updated to use this helper, too. This PR avoids their inclusion since the work is separable, and this should let us focus on the function and its behavior in review. A TODO appears in the code to reflect this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42079
Reviewed By: VitalyFedyunin
Differential Revision: D22846851
Pulled By: mruberry
fbshipit-source-id: d1a413efb97e30853923bce828513ba76e5a495d
Summary:
After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912
Reviewed By: albanD
Differential Revision: D22836802
Pulled By: mruberry
fbshipit-source-id: 33dfbe4d4067800c418b314b1f60fab8adcab4e7
Summary:
* Make c10::cuda functions regular non-inlined functions
* Add driver_version() and device_synchronize() functions
With this change I don't see anymore direct calls to CUDA API when look at Modules.cpp.obj
FYI malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42251
Reviewed By: malfet
Differential Revision: D22826505
Pulled By: ziab
fbshipit-source-id: 8dc2f3e209d3710e2ce78411982a10e8c727573c
Summary: To avoid repeating to() casts for every argument of the function
Test Plan: CI
Reviewed By: malfet
Differential Revision: D22833521
fbshipit-source-id: ae0a8f70339cd6adfeea2f552d35bbcd48b11cf7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42289
`sigmoid_backward` and `fmod` are not covered by neither in `test/cpp/api` nor in `Aten/test`. Add test functions to cover them
Test Plan:
1. Test locally and check new lines are covred
2. CI
Reviewed By: malfet
Differential Revision: D22804912
fbshipit-source-id: ea50ef0ef3dcf3940ac950d74f6f1cb38d8547a7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42318
We forgot to update this benchmark when quantized elu's signature
changed to require observation, fixing.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qactivation_test
```
Imported from OSS
Reviewed By: supriyar
Differential Revision: D22845251
fbshipit-source-id: 1443f6f0deac695715b1f2bd47f0f22b96dc72ca
Summary:
Also add __prepare__ method metaclass created by `with_metaclass` to conform with PEP 3115
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42232
Reviewed By: ezyang
Differential Revision: D22816936
Pulled By: malfet
fbshipit-source-id: a47d054b2f061985846d0db6b407f4e5df97b0d4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42220
Mobile custom build CI jobs need desktop version libtorch to prepare
models and dump root ops.
Ideally we should use the libtorch built on the PR so that backward
incompatible changes won't break this script - but it will significantly
slow down mobile CI jobs.
This PR changed it to install nightly instead of stable so that we have
an option to temporarily skip mobile CI jobs on BC-breaking PRs until
they are in nightly.
Test Plan: Imported from OSS
Reviewed By: seemethere
Differential Revision: D22810484
Pulled By: ljk53
fbshipit-source-id: eb5f7b762a969d1cfeeac2648816be546bd291b6
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: e04b9ce034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42302
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: efiks
Differential Revision: D22841424
fbshipit-source-id: 211463b0207da986fc5b451242ae99edf32b9f68
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42225
Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.
There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header.
I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.
Test Plan: CircleCI is all green.
Reviewed By: beauby
Differential Revision: D22812445
fbshipit-source-id: e6d824bb28f5afe75fd765de0430968174f3531f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42286
One more bug to fix. Operators such as If and AsyncIf need special treatment not just in `onnx::SsaRewrite`, but also in `RemoveOpsByType`. The solution needs two steps:
1) add external inputs/outputs of the subnets of If/AsyncIf op to the inputs/outputs of the op
2) if the inputs/outputs of the If/AsyncIf op need to be renamed as a result, the same inputs/outputs of the subnets need to be renamed as well.
I also added unit tests to cover this corner case.
Test Plan:
```
buck test //caffe2/caffe2/fb/predictor:black_box_predictor_test
mkdir /tmp/models
rm -rf /tmp/$USER/snntest
rm -rf /tmp/snntest
buck run mode/opt admarket/lib/ranking/prediction_replayer/snntest_replayer_test/tools:snntest_replay_test -- --serving_paradigm=USER_AD_PRECOMPUTATION_DSNN
```
Differential Revision: D22834028
fbshipit-source-id: c070707316cac694f452a96e5c80255abf4014bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42287
We shouldn't use block_size for thread dimensions in linear_index_weight_offsets_dedup_kernel, since the kernel doesn't iterate the embedding dimensions.
ghstack-source-id: 108834058
Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```
Reviewed By: jspark1105
Differential Revision: D22800959
fbshipit-source-id: 641d52a51070715c04f9fd286e7e22ac62001f61
Summary: function `cross_kernel_scalar` is not covered in `Aten/native/cpu/CrossKernel.cpp`, add tests to cover it
Test Plan:
1. Test locally to check new lines are covered
2. CI
https://pxl.cl/1fZjG
Reviewed By: malfet
Differential Revision: D22834122
fbshipit-source-id: 0d50f3a3e6aee52cb6fdee2b9f5883f542c7b6e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42266
function `lerp_kernel_scalar` and `lerp_kernel_tensor` are not covered in `Aten/native/cpu/LerpKernel.cpp`, add tests to cover them
Test Plan:
1. Test locally to check new lines are covered
2. CI
https://pxl.cl/1fXPd
Reviewed By: malfet
Differential Revision: D22832164
fbshipit-source-id: b1eaabbf8bfa08b4dedc1a468abfdfb619a50e3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139
A bunch of tests were failing with buck since we would output to
stdout and buck would fail parsing stdout in some cases.
Moving these print statements to stderr fixes this issue.
ghstack-source-id: 108606579
Test Plan: Run the offending unit tests.
Reviewed By: mrshenli
Differential Revision: D22779135
fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42219
Introduce a new extra info that is tagged on the forward net for the operators sharing the same input. The effect is that the auto gen sum of gradient for the input will not follow the tag of the operator tags in the forward net. This allow more flexible device allocation.
Test Plan:
# unit test
`./buck-out/gen/caffe2/caffe2/python/core_gradients_test#binary.par -r testMultiUseInputAutoGenSumDevice`
Reviewed By: xianjiec, boryiingsu
Differential Revision: D22609080
fbshipit-source-id: d558145e5eb36295580a70e1ee3a822504dd439a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41283
in optimizer.zero_grad(), detach_ is useful to avoid memory leak only when grad has grad_fn, so add check to call grad.detach_ only when the grad has grad_fn in zero_grad() function
ghstack-source-id: 108702289
Test Plan: unit test
Reviewed By: mrshenli
Differential Revision: D22487315
fbshipit-source-id: 861909b15c8497f1da57f092d8963d4920c85e38
Summary:
In preparation for creating the new torch.fft namespace and NumPy-like fft functions, as well as supporting our goal of refactoring and reducing the size of test_torch.py, this PR creates a test suite for our spectral ops.
The existing spectral op tests from test_torch.py and test_cuda.py are moved to test_spectral_ops.py and updated to run under the device generic test framework.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42157
Reviewed By: albanD
Differential Revision: D22811096
Pulled By: mruberry
fbshipit-source-id: e5c50f0016ea6bb8b093cd6df2dbcef6db9bb6b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38999
Adds boxing for inplace and outplace kernels, itemizes
remaining unsupported cases, and fails compilation when
new unsupported types are introduced in op signatures.
Test Plan: Imported from OSS
Differential Revision: D21718547
Pulled By: bhosmer
fbshipit-source-id: 03295128b21d1843e86789fb474f38411b26a8b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42151
Previously our Caffe2 SpatialBN op impl was incorrect for computing running_var without unbias coefficent. Actually it should fail the test because the output will be different with CuDNN's output. However, our tests are too weak to find this bug. This diff fix all of them.
Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:spatial_bn_op_test
Reviewed By: houseroad
Differential Revision: D22786127
fbshipit-source-id: db80becb67d60c44faae180c7e4257cb136a266d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41974
In this diff, 2 new sets of benchmark tests are added to the `quantization` benchmark suite where operator-level benchmarking is conducted for the learnable Python operators, the learnable c++ kernels, and the original non-backprop c++ kernels.
Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`
Benchmark Results (On devGPU with 0% volatile utilization -- all GPUs are free):
Each sample has dimensions **3x256x256**;
### In **microseconds** (`1e-6` second),
| | Python Module | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|
| Per Tensor CPU Forward | 3112.666 | 3270.740 | 3596.864 |
| Per Tensor Cuda Forward | 797.258 | 258.961 | 133.953 |
| Per Channel CPU Forward | 6587.693 | 6931.461 | 6352.417 |
| Per Channel Cuda Forward | 1579.576 | 555.723 | 479.016 |
| Per Tensor CPU Backward | 72278.390 | 22466.648 | 12922.195 |
| Per Tensor Cuda Backward | 6512.280 | 1546.218 | 652.942 |
| Per Channel CPU Backward | 74138.545 | 41212.777 | 14131.576 |
| Per Channel Cuda Backward | 6795.173 | 4321.351 | 1052.066 |
Reviewed By: z-a-f
Differential Revision: D22715683
fbshipit-source-id: 8be528b790663413cbeeabd4f68bbca00be052dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42198
1. add tests to max/min kernel
Test Plan:
1. Run locally to check cover the corresponding code part in BinaryOpsKernel.cpp.
2. CI
Reviewed By: malfet
Differential Revision: D22796019
fbshipit-source-id: 84c8d7df509de453c4ec3c5e38977733b0ef3457
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41969
In this diff, the `_LearnableFakeQuantize` module is extended to provide support for gradient scaling where the gradients for both scale and zero point are multiplied by a constant `g` (in some cases, can help with quicker convergence). In addition, it is also augmented to provide a factory method via `_with_args` such that a partial constructor of the module can be built.
Test Plan:
For correctness of the fake quantizer operators, on a devvm, enter the following command:
```
buck test //caffe2/torch:quantization -- learnable_py_module
```
Reviewed By: z-a-f
Differential Revision: D22715629
fbshipit-source-id: ff8e5764f81ca7264bf9333789f57e0b0cec7a72
Summary:
View ops as outputs of differentiable subgraphs can cause incorrect differentiation. For now, do not include them in the subgraph. This was observed with our autograd tests for MultiheadAttention and nn.Transformer, which currently fail with the legacy executor. This commit fixes those test failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42027
Reviewed By: pbelevich
Differential Revision: D22798133
Pulled By: eellison
fbshipit-source-id: 2f6c08953317bbe013933c6faaad20100376c039
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: cad1c21404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42205
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D22806731
Pulled By: efiks
fbshipit-source-id: 779a9f7f00645e7e65f183e2832dc79117eae5fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42215
Specifically on https://github.com/pytorch/pytorch/pull/27477#discussion_r371402079
We would like to supported with include_last=True overall for other reduction types like mean and max. It now causes further code fragmentation in DPER (https://www.internalfb.com/intern/diff/D22794469/).
More details: https://www.internalfb.com/intern/diff/D22794469/?dest_fbid=309597093427021&transaction_id=631457624153457
ghstack-source-id: 108733009
Test Plan:
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"
```
```
(base) [jianyuhuang@devbig281.ftw3.facebook.com: ~/fbsource/fbcode/caffe2/test] $ TORCH_SHOW_CPP_STACKTRACES=1 buck test mode/dev-nosan //caffe2/test:
nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" --print-passing-details
Parsing buck files: finished in 1.2 sec
Building: finished in 5.5 sec (100%) 10130/10130 jobs, 2 updated
Total time: 6.7 sec
More details at https://www.internalfb.com/intern/buck/build/dbdc2063-69d8-45cb-9146-308a9e8505ef
First unknown argument: --print-passing-details.
Falling back to TestPilot classic.
Trace available for this run at /tmp/testpilot.20200728-195414.1422748.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
✓ caffe2/test:nn - test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) 0.162 1/1 (passed)
Test output:
> /data/users/jianyuhuang/fbsource/fbcode/buck-out/dev/gen/caffe2/test/nn#binary,link-tree/torch/_utils_internal.py:103: DeprecationWarning: This is a NOOP in python >= 3.7, its just too dangerous with how we write code at facebook. Instead we patch os.fork and multiprocessing which can raise exceptions if a deadlock would happen.
> threadSafeForkRegisterAtFork()
> /usr/local/fbcode/platform007/lib/python3.7/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__
and __path__
> return f(*args, **kwds)
> test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) ... Couldn't download test skip set, leaving all tests enabled...
> ok
>
> ----------------------------------------------------------------------
> Ran 1 test in 0.162s
>
> OK
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
Summary (total time 5.54s):
PASS: 1
FAIL: 0
SKIP: 0
FATAL: 0
TIMEOUT: 0
OMIT: 0
Did _not_ run with tpx. See https://fburl.com/tpx for details.
```
Reviewed By: dzhulgakov
Differential Revision: D22801881
fbshipit-source-id: 80a624465727081bb9bf55c28419695a3d79c6e5
Summary:
After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912
Reviewed By: pbelevich
Differential Revision: D22790718
Pulled By: mruberry
fbshipit-source-id: 8d1eb01574b1977f00bc0696974ac38ffdd40d9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42045
This PR changes the save_data() member functions of torch::jit::mobile::Module which was introduced in #41403 to be the non member function torch::jit::mobile::_save_parameters() (taking a mobile Module as its first argument).
In addition, this PR:
* adds a getter function _ivalue() for the mobile::Module object
* renames torch::jit::mobile::_load_mobile_data() to torch::jit::mobile_load_parameters()
* refactors the import.h header file into import.h and import_data.h
Test Plan: Imported from OSS
Reviewed By: kwanmacher, iseeyuan
Differential Revision: D22766781
Pulled By: ann-ss
fbshipit-source-id: 5cabae31927187753a958feede5e9a28d71d9e92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42149
Some of these tests were flaky since we could kill the process in some
way without cleaning up the ProcessGroup. This resulted in issues where the
FileStore didn't clean up appropriately resulting in other processes in the
group to crash.
Fixed this by explicitly deleting the process_group before we bring a process
down forcibly.
ghstack-source-id: 108629057
Test Plan: waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D22785042
fbshipit-source-id: c31d0f723badbc23b7258e322f75b57e0a1a42cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42143
Replaces the original makeshift error messages in ProcessGroupNCCLTest
with more appropriate ones.
ghstack-source-id: 108711579
Test Plan: Ran the tests on DevGPU
Reviewed By: mrshenli
Differential Revision: D22778505
fbshipit-source-id: 27109874f0b474a74b09f588cf6e7528d2069702
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42192
This PR fixes the complicated skipping logic for ProcessGroupNCCLErrors Tests - it correctly logs the reason for skipping tests when GPUs are not available or the NCCL version is too old.
This is part of a broader effort to improve the testing of the ProcessGroup and Collectives tests.
ghstack-source-id: 108620568
Test Plan: Tested on devGPU and devvm. Tests are run correctly on GPU and skipped on CPU as expected.
Reviewed By: mrshenli
Differential Revision: D22782856
fbshipit-source-id: 6071dfdd9743f45e59295e5cee09e89c8eb299c9
Summary: Sometimes first dim of X in FC is BATCH_OF_FEATURE_MAX instead of BATCH. This caused an issue in f207899183 (when first dim of X is 64 but is set to 1 in inferFC). Change the check from `!= BATCH` to `== UNKNOWN`
Test Plan: unit test
Reviewed By: yinghai
Differential Revision: D22784691
fbshipit-source-id: eb66ba361d6fe75672b13edbac2fbd269a7e7a00
Summary:
And since CUDA-9.2 is incompatible with VS2019, disable CUDA-9.2 for Windows as well
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42144
Reviewed By: pbelevich
Differential Revision: D22794475
Pulled By: malfet
fbshipit-source-id: 24fc980e6fc75240664b9de8a4a63b1153f8d8ee
Summary:
Fixes issue where
- top level fuser's block_ was captured by callback due to [&] capture,
- recursive/nested fusers would compare erroneously to top-level block_ instead of own block_
Closes (https://github.com/pytorch/pytorch/issues/39810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41560
Reviewed By: Krovatkin
Differential Revision: D22583196
Pulled By: wconstab
fbshipit-source-id: 8f543cd9ea00e116cf3e776ab168cdd9fed69632
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42135
Tested the code analyzer with LLVM 9 & 10 and fixed a couple issues:
- Rename local demangle() which is available as public API since LLVM 9;
- Fix falsely associated op registrations due to the `phi` instruction;
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D22795508
Pulled By: ljk53
fbshipit-source-id: 2d47af088acd3312a7ea5fd9361cdccd48940fe6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41171
DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.
This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
ghstack-source-id: 108617516
Test Plan: Added unittest
Reviewed By: mrshenli
Differential Revision: D22449974
fbshipit-source-id: e3156b751f5262cc66437b9191818b78aee8ddea
Summary:
Found while trying to get RocM Caffe2 CI green
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42168
Reviewed By: seemethere
Differential Revision: D22791879
Pulled By: malfet
fbshipit-source-id: 8f7ef9711bdc5941b2836e4c8943bb95c72ef8af
Summary:
Found while trying to get RocM Caffe2 job green
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42169
Reviewed By: seemethere
Differential Revision: D22791896
Pulled By: malfet
fbshipit-source-id: 9df6233876aec5ead056365499bab970aa7e8bdc
Summary:
I encountered a zero division problem when using LBFGS:
File "/home/yshen/anaconda3/lib/python3.7/site-packages/torch/optim/lbfgs.py", line 118, in _strong_wolfe
bracket[1], bracket_f[1], bracket_gtd[1])
File "/home/yshen/anaconda3/lib/python3.7/site-packages/torch/optim/lbfgs.py", line 21, in _cubic_interpolate
d1 = g1 + g2 - 3 * (f1 - f2) / (x1 - x2)
ZeroDivisionError: float division by zero
My solution is to determine whether "line-search bracket is so small" before calling _cubic_interpolate
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42093
Reviewed By: pbelevich
Differential Revision: D22770667
Pulled By: mrshenli
fbshipit-source-id: f8fdfcbd3fd530235901d255208fef8005bf898c
Summary:
Apply syntax highlighting to the command in `README.md`. This makes `README.md` easier to read.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42065
Reviewed By: pbelevich
Differential Revision: D22753418
Pulled By: mrshenli
fbshipit-source-id: ebfa90fdf60478c34bc8a7284d163e0254cfbe3b
Summary:
xref gh-39002 which handled the reading but not the writing of the onnx expect files, and the last comment in that PR which points out `XXX` was suboptimal.
xref [this comment](https://github.com/pytorch/pytorch/pull/37091#discussion_r456460168) which pointed out the problem.
This PR:
- replaces `XXX` with `CURRENT_VERSION` in the stored files
- ensures that updating the results with the `--accept` flag will maintain the change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41910
Reviewed By: pbelevich
Differential Revision: D22758671
Pulled By: ezyang
fbshipit-source-id: 47c345c66740edfc8f0fb9ff358047a41e19b554
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41942
This function:
- permutes all batch dims to the front of the tensors
- aligns all the batch dims to the collective levels of all the tensors
- expands all of the batch dims such that they are present in each of
the result tensors
This function is useful for the next diff up on the stack (which is
implementing a fallback kernel for BatchedTensor). It's also useful in
general for implementing batching rules on operators that take in
multiple batch dimensions at the front of each tensor (but we don't have
too many of those in PyTorch).
Test Plan: - `./build/bin/vmap_test`
Reviewed By: ezyang
Differential Revision: D22764104
Pulled By: zou3519
fbshipit-source-id: d42cc8824a1bcf258687de164b7853af52852f53
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41941
If we know that the tensor already has the desired aligned size, we
don't need to put in the effort to align it.
Test Plan: - `./build/bin/vmap_test`, `pytest test/test_vmap.py -v`
Reviewed By: albanD
Differential Revision: D22764101
Pulled By: zou3519
fbshipit-source-id: a2ab7ce7b98d405ae905f7fd98db097210bfad65
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41940
See title. I marked methods that don't mutate the VmapPhysicalView as
`const`.
Test Plan: - wait for tests
Reviewed By: albanD
Differential Revision: D22764102
Pulled By: zou3519
fbshipit-source-id: 40f957ad61c85f0e5684357562a541a2712b1f38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42118
We toggle trace on with a certain probablility. In the case of 3 inferences with trace on/off/on. We leak the trace from the first inference. Always clean up the trace will fix it.
Test Plan:
predictor
I created a tiny repro here: D22786551
With this fix, this issue is gone.
Reviewed By: gcatron
Differential Revision: D22768382
fbshipit-source-id: 9ee0bbcb2bc5f76107dae385759fe578909a683d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42101
1. Add test fixture `atest class` to store global variables
2. Make `run_binary_ops_test` function generic: can dispose different dtypes and different numbers of parameters
3. add test to `add_kernel`
Test Plan:
Run locally to check cover the corresponding code part in `BinaryOpsKernel.cpp`.
CI
Reviewed By: malfet
Differential Revision: D22760015
fbshipit-source-id: 95b47732f661124615c0856efa827445dd714125
Summary:
the onnxifi path didn't handle the input/output name rewrite for ssa correctly for AsyncIf op. Add support for it.
Also fixed a place where we lose the net type while doing onnxifi transform.
Test Plan: Load 163357582_593 which is a multi feed model that uses AsyncIf. This used to fail with c2 not finding some blobs in workspace. Now it works.
Reviewed By: dhe95
Differential Revision: D21268230
fbshipit-source-id: ce7ec0e952513d0f251df1bfcfb2b0250f51fd94
Summary:
This PR adds a description of `torch.optim.swa_utils` added in https://github.com/pytorch/pytorch/pull/35032 to the docs at `docs/source/optim.rst`. Please let me know what you think!
vincentqb andrewgordonwilson
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41228
Reviewed By: ngimel
Differential Revision: D22609451
Pulled By: vincentqb
fbshipit-source-id: 8dd98102c865ae4a074a601b047072de8cc5a5e3
Summary:
This uses cub for cum* operations, because, unlike thrust, cub is non-synchronizing.
Cub does not support more than `2**31` element tensors out of the box (in fact, due to cub bugs the cutoff point is even smaller)
so to support that I split the tensor into `2**30` element chunks, and modify the first value of the second and subsequent chunks to contain the cumsum result of the previous chunks. Since modification is done inplace on the source tensor, if something goes wrong and we error out before the source tensor is reverted back to its original state, source tensor will be corrupted, but in most cases errors will invalidate the full coda context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42036
Reviewed By: ajtulloch
Differential Revision: D22749945
Pulled By: ngimel
fbshipit-source-id: 9fc9b54d466df9c8885e79c4f4f8af81e3f224ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42127
This diff renames core_autograd_sources to core_trainer_sources and moves/adds dependencies for the lite trainer in order to build the serializer functionality internally.
ghstack-source-id: 108589416
Test Plan: Manually tested serializer functionality from the internal lite trainer and verified that data is written correctly.
Reviewed By: iseeyuan
Differential Revision: D22738293
fbshipit-source-id: 992beb0c4368b2395f5bd5563fb2bc12ddde39a1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42136
Expect was giving weird issues so let's just use netrc since it doesn't
rely on janky expect behavior
Another follow up for: https://github.com/pytorch/pytorch/pull/41964
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: yns88
Differential Revision: D22778940
Pulled By: seemethere
fbshipit-source-id: 1bdf879a5cfbf68a7d2d34b6966c20f95bd0a3b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42037
This is to fix#41951
Test Plan: Imported from OSS
Reviewed By: yf225
Differential Revision: D22764717
Pulled By: glaringlee
fbshipit-source-id: e6da0aeb05a2356f52446e6d5fad391f2cd1cf6f
Summary: we need this op to avoid the splicing of a dense tensor and then use the Mergesinglescaler op
Test Plan: integrated test with dper2
Differential Revision: D22677523
fbshipit-source-id: f4f9a1f06841b0906ec8cbb435482ae0a89e1721
Summary:
This PR fixes an issue in https://github.com/pytorch/pytorch/issues/40967 where duplicate parameters across different parameter groups are not allowed, but duplicates inside the same parameter group are accepted. After this PR, both cases are treated equally and raise `ValueError`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41597
Reviewed By: zou3519
Differential Revision: D22608019
Pulled By: vincentqb
fbshipit-source-id: 6df41dac62b80db042cfefa6e53fb021b49f4399
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39384
Introducing `ComputeUnitFactory` which is responsible for providing `ComputeUnit`s (Shaders),
it caches it, using shader name (glsl file name)+workGroupSize as a cacheKey, just `std::map<string, std::shared_ptr>`
Macro GLSL_SPV changed to have literal name for cache key as a first argument.
All constructors of ComputeUnit are changed to use `ComputeUnitFactory`
Ownership model:
ComputeUnitFactory also owns `vkPipelineCache` that is internal vulkan cache object ( https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkPipelineCache.html )
`VContext` (global object) owns ComputeUnitFactory, that owns ComputeUnits, vkPipelineCache, for destruction of them we need valid VkDevice, so it should be destructed before `vkDestryDevice` in `~VContext` => As members of the class will be destructed only after destructor - forcing destruction of ComputeUnitFactory before `vkDestroyDevice`, doing `unique_ptr<ComputeUnitFactory>.reset()`
Test Plan: Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D21962430
Pulled By: IvanKobzarev
fbshipit-source-id: effe60538308805f317c11448b31dbcf670487e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42019
According to benchmarks, this makes IValue::toStringRef() 3-4x as fast.
ghstack-source-id: 108451154
Test Plan: unit tests
Reviewed By: ezyang
Differential Revision: D22731354
fbshipit-source-id: 3ca3822ea7310d8593e38b1d3e6014d6d80963db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42034
In this diff, scale and zero point gradient calculations are updated to correctly reflect the actual backpropagation equation (instead of `dScale * dX`, the near-final output should be `dScale * dY`; the same applies to zero point).
Test Plan:
To execute the unit tests for all affected learnable fake quantize modules and kernels, on a devvm, execute the following command:
`buck test //caffe2/test:quantization -- learnable`
To enable the `cuda` tests, execute the following command:
`buck test mode/dev-nosan //caffe2/test:quantization -- learnable`
Reviewed By: jerryzh168
Differential Revision: D22735668
fbshipit-source-id: 45c1e0fd38cbb2d8d5e60be4711e1e989e9743b4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42033
In this diff, the Python `_LearnableFakeQuantize` module is updated where the gradient with respect to the input `x` is actually computed instead of passed through. Argument naming is also updated for better clarity; and unit tests on the `PerTensor` and `PerChannel` operators are added for asserting correctness.
Test Plan:
On a devvm, execute the command:
`buck test //caffe2/test:quantization -- learnable_py_module`
To include `cuda` tests as well, run:
`buck test mode/dev-nosan //caffe2/test:quantization -- learnable_py_module`
Reviewed By: jerryzh168
Differential Revision: D22735580
fbshipit-source-id: 66bea7e9f8cb6422936e653500f917aa597c86de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42032
In this diff, the arguments `dX` within the C++ kernels are named as `dY` for clarity and avoid confusion since it doesn't represent the gradient with respect to the input.
Test Plan:
To test all related fake quantize kernel operators, on a devvm, run the command:
`buck test //caffe2/test:quantization -- learnable`
Reviewed By: z-a-f, jerryzh168
Differential Revision: D22735429
fbshipit-source-id: 9d6d967f08b98a720eca39a4d2280ca8109dcdd6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42114
Remove settings for the logit test case.
(Note: this ignores all push blocking failures!)
Test Plan: test_op_nnpi_fp16.py test case.
Reviewed By: hyuen
Differential Revision: D22766728
fbshipit-source-id: 2fe8404b103c613524cf1beddf1a0eb9068caf8a
Summary:
Current losses in PyTorch only include a (partial) implementation of Huber loss through `smooth l1` based on Fast RCNN - which essentially uses a delta value of 1. Changing/Renaming the [`_smooth_l1_loss()`](3e1859959a/torch/nn/functional.py (L2487)) and refactoring to include delta, enables to use the actual function.
Supplementary to this, I have also made a functional and criterion versions for anyone that wants to set the delta explicitly - based on the functional `smooth_l1_loss()` and the criterion `Smooth_L1_Loss()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37599
Differential Revision: D21559311
Pulled By: vincentqb
fbshipit-source-id: 34b2a5a237462e119920d6f55ba5ab9b8e086a8c
Summary:
xref gh-38010 and gh-38011.
After this PR, there should be only two warnings:
```
pytorch/docs/source/index.rst:65: WARNING: toctree contains reference to nonexisting \
document 'torchvision/index'
WARNING: autodoc: failed to import class 'tensorboard.writer.SummaryWriter' from module \
'torch.utils'; the following exception was raised:
No module named 'tensorboard'
```
If tensorboard and torchvision are prerequisites to building docs, they should be added to the `requirements.txt`.
As for breaking up quantization into smaller pieces: I split out the list of supported operations and the list of modules to separate documents. I think this makes the page flow better, makes it much "lighter" in terms of page cost, and also removes some warnings since the same class names appear in multiple sub-modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41321
Reviewed By: ngimel
Differential Revision: D22753099
Pulled By: mruberry
fbshipit-source-id: d504787fcf1104a0b6e3d1c12747ec53450841da
Summary:
xref gh-38011.
Fixes two warnings when building documentation by
- using the external link to torchvision
- install tensorboard before building documentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41334
Reviewed By: ngimel
Differential Revision: D22753083
Pulled By: mruberry
fbshipit-source-id: 876377e9bd09750437fbfab0378664b85701f827
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41930
As title
ghstack-source-id: 108517079
Test Plan: CI
Reviewed By: jerryzh168
Differential Revision: D22698386
fbshipit-source-id: 4f748c9bae4a0b615aa69c7cc8d8e451e5d26863
Summary:
**BC-Breaking Note**
This PR changes the behavior of the torch.tensor, torch.as_tensor, and sparse constructors. When given a tensor as input and a device is not explicitly specified, these constructors now always infer their device from the tensor. Historically, if the optional dtype kwarg was provided then these constructors would not infer their device from tensor inputs. Additionally, for the sparse ctor a runtime error is now thrown if the indices and values tensors are on different devices and the device kwarg is not specified.
**PR Summary**
This PR's functional change is a single line:
```
auto device = device_opt.has_value() ? *device_opt : (type_inference ? var.device() : at::Device(computeDeviceType(dispatch_key)));
```
=>
```
auto device = device_opt.has_value() ? *device_opt : var.device();
```
in `internal_new_from_data`. This line entangled whether the function was performing type inference with whether it inferred its device from an input tensor, and in practice meant that
```
t = torch.tensor((1, 2, 3), device='cuda')
torch.tensor(t, dtype=torch.float64)
```
would return a tensor on the CPU, not the default CUDA device, while
```
t = torch.tensor((1, 2, 3), device='cuda')
torch.tensor(t)
```
would return a tensor on the device of `t`!
This behavior is niche and odd, but came up while aocsa was fixing https://github.com/pytorch/pytorch/issues/40648.
An additional side affect of this change is that the indices and values tensors given to a sparse constructor must be on the same device, or the sparse ctor must specify the dtype kwarg. The tests in test_sparse.py have been updated to reflect this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41984
Reviewed By: ngimel
Differential Revision: D22721426
Pulled By: mruberry
fbshipit-source-id: 909645124837fcdf3d339d7db539367209eccd48
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/41318 pushed to ci-all branch.
Original description:
Closes https://github.com/pytorch/pytorch/issues/24137.
This PR adds support for the torch.bool tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since bool is not supported as a native ncclDataType_t, we add the following logic:
Map at::kBool to ncclUint8
During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR.
The reduction logic (for example for reduce/allreduce) is as follows:
sum, max = bitwise or
product, min = bitwise and
Note that this PR doesn't add support for BAND/BOR/BXOR. That is because these reduction ops currently are not supported by NCCL backend, see https://github.com/pytorch/pytorch/issues/41362
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41959
Reviewed By: mrshenli
Differential Revision: D22719665
Pulled By: rohan-varma
fbshipit-source-id: 8bc4194a8d1268589640242277124f277d2ec9f1
Summary:
build_android.sh should check PYTHON environment variable before trying to use default python executable.
Even in that case, try to pick python3 over python2 when available.
Closes https://github.com/pytorch/pytorch/issues/41795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41927
Reviewed By: seemethere
Differential Revision: D22696850
Pulled By: malfet
fbshipit-source-id: be236c2baf54a1cd111e55ee7743cdc93cb6b9d7
Summary:
[2/N] Implement Enum JIT support
Add prim::EnumName and prim::EnumValue and their lowerings to support getting `name` and `value` attribute of Python enums.
Supported:
Enum-typed function targuments
using Enum type and comparing them
Support getting name/value attrs of enums
TODO:
Add PyThon sugared value for Enum
Support Enum-typed return values
Support enum values of different types in same Enum class
Support serialization and deserialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41965
Reviewed By: eellison
Differential Revision: D22714446
Pulled By: gmagogsfm
fbshipit-source-id: db8c4e26b657e7782dbfc2b58a141add1263f76e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41945
This test previously did a thread sleep before launching the allgather operation, and then waited on the work object. Since the sleep was done before the work object was created, it did not affect the allgather call, and thus, did not test work-level timeouts as intended.
I am removing this test for now. In the future we can add this test back, but would need to somehow inject a `cudaSleep` call before the allgather (so the collective operation itself is delayed). This may require overriding the `ProcessGroupNCCL::collective`, so it's a bit more heavy-weight.
In the meantime, we can remove this test - work-level timeouts are still thoroughly tested with Gloo.
ghstack-source-id: 108370178
Test Plan: Ran ProcessGroupNCCL tests on devGPU
Reviewed By: jiayisuse
Differential Revision: D22702291
fbshipit-source-id: a36ac3d83abfab6351c0476046a2f3b04a80c44d
Summary:
Since NCCL makes calls to shm_open/shm_close it must depend on librt on Linux
This should fix `DSO missing from command line` error on some platforms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41978
Reviewed By: colesbury
Differential Revision: D22721430
Pulled By: malfet
fbshipit-source-id: d2ae08ce9da3979daaae599e677d5e4519b080f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41267
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.
Test Plan:
vec256_test run on android phone.
Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D22482196
fbshipit-source-id: 1c22cf67ec352942c465552031e9329550b27b3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41482
This adds a new tag for use with pipeline parallelism.
Test Plan: CI
Reviewed By: heslami
Differential Revision: D22551487
fbshipit-source-id: 90910f458a9bce68f7ef684773322a49aa24494a
Summary:
instead exporting schemas using the current binary being tested, install nightly and export its schemas to use in a back-compat test run by the current binary being tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41949
Reviewed By: houseroad
Differential Revision: D22731054
Pulled By: bradleyhd
fbshipit-source-id: 68a7e7637b9be2604c0ffcde2a40dd208057ba72
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41964
Since we're not executing this in a docker container we should go ahead
an install expect explicitly
This is a follow up PR to #41871
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D22736738
Pulled By: seemethere
fbshipit-source-id: a56e19c1ee13c2f6e2750c2483202c1eea3b558a
Summary:
Currently constant pooling runs before const propagation, which can create more constants that need pooling. This can get in the way of serialization/deserialization stability because each time user serializes and deserializes a module, runCleanUpPasses is called upon it. Doing so multiple times would lead to different saved module.
This PR moves constant pooling after const propagation, which may slow down const propagation a little bit, but would otherwise side-step aforementioned problem.
test_constant_insertion in test_jit.py is also updated because after fixing the pass ordering, the number of constants is no longer a constant and it is extremely difficult to get the exact number with the current convoluted test structure. So for now, I changed the test to check only that CSE doesn't change number of "prim::constant" rather than comparing against a known number. Also left a TODO to improve this test.
ConstantPropagation pass is replaced by ConstantPropagationImmutableTypes because the latter is used in runCleanUpPasses. If not replaced, the former would create new CSE opportunities by folding more constants. This voids the purpose of the test case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41891
Reviewed By: colesbury
Differential Revision: D22701540
Pulled By: gmagogsfm
fbshipit-source-id: 8e60dbdcc54a93dac111d81b8d88fb39387224f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41596
We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](https://github.com/pytorch/pytorch/issues/39272).
1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object.
2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`.
3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function.
`cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr).
ghstack-source-id: 108409748
Test Plan:
Run old python test/distributed/test_c10d.py.
Some additional tests:
`test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
`test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.
As of v10:
```
........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests
OK (skipped=3)
```
`flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](https://github.com/pytorch/pytorch/pull/41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/).
Reviewed By: izdeby
Differential Revision: D22583690
fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a
Summary:
Added a logic so that if a prehook is passed into the prepare method during quantization, then the hook will be added as a prehook to all leaf nodes (and modules specified in the non_leaf_module_list).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41863
Test Plan:
Small demo, made simple module then called prepare with prehook parameter set to the numeric suite logger, printed the results to verify its what we wanted
{F245156246}
Reviewed By: jerryzh168
Differential Revision: D22671288
Pulled By: edmundw314
fbshipit-source-id: ce65a00830ff03360a82c0a075b3b6d8cbc4362e
Summary:
fix https://github.com/pytorch/pytorch/issues/40604
Add parameter to Dataloader to configure the per-worker prefetch number.
Before this edit, the prefetch process always prefetch 2 * num_workers data items, this commit help us make this configurable, e.x. you can specify to prefetch 10 * num_workers data items.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41130
Reviewed By: izdeby
Differential Revision: D22705288
Pulled By: albanD
fbshipit-source-id: 2c483fce409735fef1351eb5aa0b033f8e596561
Summary:
so that testing _min_max on the different devices is easier, and min/max operations have better CUDA test coverage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41908
Reviewed By: mruberry
Differential Revision: D22697032
Pulled By: ngimel
fbshipit-source-id: a796638fdbed8cda90a23f7ff4ee167f45530914
Summary:
**Summary**
This commit updates the repository's pull request template to remind contributors to tag the issue that their pull request addresses.
**Fixes**
This commit fixes https://github.com/pytorch/pytorch/issues/35319.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41812
Reviewed By: gmagogsfm
Differential Revision: D22667902
Pulled By: SplitInfinity
fbshipit-source-id: cda5ff7cbbbfeb89c589fd0dfd378bf73a59d77b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41963
the error message of dot(CUDA) was copied from dot(CPU), however, they both are easy to cause confusion
Test Plan: wait for unittests
Reviewed By: ngimel
Differential Revision: D22710822
fbshipit-source-id: 565b51149ff4bee567ef0775e3f8828579565f8a
Summary:
Import __future__ to make `print(*args)` a syntactically correct statement under Python-2
Otherwise, if once accidentally invokes setup.py using Python-2 interpreter they will be greeted by:
```
File "setup.py", line 229
print(*args)
^
SyntaxError: invalid syntax
```
instead of:
```
Python 2 has reached end-of-life and is no longer supported by PyTorch.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41960
Reviewed By: orionr, seemethere
Differential Revision: D22710174
Pulled By: malfet
fbshipit-source-id: ffde3ddd585707ba1d39e57e0c6bc9c4c53f8004
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41612
This change adds preliminary support to quantize the EmbeddingBag operators. We currently support 4-bit and 8-bit quantization+packing of the weights.
To quantize these operators, specify the operator name in the `custom_op_name` field of the NoopObserver. Based on the op name (4bit or 8bit) we call the corresponding quantization functions.
Refer to the testplan for how to invoke the qconfig for the embedding_bag ops.
Future versions of this will support 4-bit and 2-bit qtensors with native support to observe and quantize it.
NB - This version assumes that the weights in the EmbeddingBag Module reside on the same device.
Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag
Imported from OSS
Reviewed By: vkuzo, jerryzh168
Differential Revision: D22609342
fbshipit-source-id: 23e33f44a451c26719e6e283e87fbf09b584c0e6
Summary:
I copyed a pruned model after deleteing the derived tensors. In order to be able to reparameter the model, we should check the existence of the tensors here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41913
Reviewed By: izdeby
Differential Revision: D22703248
Pulled By: mrshenli
fbshipit-source-id: f5274d2c634a4c9a038100d8a6e837f132eabd34
Summary:
As explained in https://github.com/pytorch/pytorch/issues/41922 using `if(NOT ${var})" is usually wrong and can lead to issues like https://github.com/pytorch/pytorch/issues/41922 where the condition is wrongly evaluated to FALSE instead of TRUE. Instead the unevaluated variable name should be used in all cases, see the CMake docu for details.
This fixes the `NOT ${var}` cases by using a simple regexp replacement. It seems `pybind11_PREFER_third_party` is the only variable really prone to causing an issue as all others are set. However due to CMake evaluating unquoted strings in `if` conditions as variable names I recommend to never use unquoted `${var}` in an if condition. A similar regexp based replacement could be done on the whole codebase but as that does a lot of changes I didn't include this now. Also `if(${var})` will likely lead to a parser error if `var` is unset instead of a wrong result
Fixes https://github.com/pytorch/pytorch/issues/41922
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41924
Reviewed By: seemethere
Differential Revision: D22700229
Pulled By: mrshenli
fbshipit-source-id: e2b3466039e4312887543c2e988270547a91c439
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41479
Previously we were re-running CSE every time we recursed into a new block, which in turn created a new Alias Db for the whole graph. This was O(# Nodes * # Blocks).
For graphs which don't have any autodiff opportunities, such as Densenet, create_autodiff_subgraphs is now linear in number of nodes. For Densenet this pass was measured at ~.1 seconds.
This pass is still non-linear for models which actually do create autodiff subgraphs, because in the
```
bool any_changed = true;
while (any_changed) {
AliasDb aliasDb(graph_);
any_changed = false;
for (auto it = workblock.end()->reverseIterator();
it != workblock.begin()->reverseIterator();) {
bool changed;
std::tie(it, changed) = scanNode(*it, aliasDb);
any_changed |= changed;
}
}
```
loop we recreate the AliasDb (which is O(N)) every time we merge something and scan node returns. I will make that linear in next PR in the stack.
Test Plan: Imported from OSS
Reviewed By: Krovatkin
Differential Revision: D22600606
Pulled By: eellison
fbshipit-source-id: b08abfde2df474f168104c5b477352362e0b7b16
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41437
[copied from commented code]
the IR has many nodes which can never be reordered around, such as a
prim::Bailout. if a node N is surrounded by two nodes which cannot be
reordered, A and B, then a differentiable subgraph that is created from N
can only contain nodes from [A, B] The nodes from A to B represent one
work block for the subgraph slicer to work on. By creating these up
front, we avoid retraversing the whole graph block any time scanNode
returns, and we can also avoid attempting to create differentiable
subgraphs in work blocks that do not contain a minimum number of differentiable nodes
This improved compilation time of e of densenet (the model with the slowest compilation time we're tracking) from 56s -> 28s, and for mobilenet from 8s -> 6s.
Test Plan: Imported from OSS
Reviewed By: Krovatkin, ZolotukhinM
Differential Revision: D22600607
Pulled By: eellison
fbshipit-source-id: e5ab6ed87bf6820b4e22c86eabafd9d17bf7cedc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41436
Constants are not executed as instructions, we should ignore them when counting subgraph size, as we ignore them in counting block size for loop unrolling.
Test Plan: Imported from OSS
Reviewed By: Krovatkin, ZolotukhinM
Differential Revision: D22600608
Pulled By: eellison
fbshipit-source-id: 9770b21c936144a3d6a1df89cf3be5911095187e
Summary:
Per comment in run_test.py, every test module must have a __main__ entrypoint:
60e2baf5e0/test/run_test.py (L237-L238)
Also disable test_wait_all on Windows, as it fails with an uncaught exception:
```
test_wait_all (__main__.TestFuture) ... Traceback (most recent call last):
File "run_test.py", line 744, in <module>
main()
File "run_test.py", line 733, in main
raise RuntimeError(err)
RuntimeError: test_futures failed!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41826
Reviewed By: seemethere, izdeby
Differential Revision: D22654899
Pulled By: malfet
fbshipit-source-id: ab7fdd7adce3f32c53034762ae37cf35ce08cafc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41318
Closes https://github.com/pytorch/pytorch/issues/24137.
This PR adds support for the `torch.bool` tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since `bool` is not supported as a native `ncclDataType_t`, we add the following logic:
1) Map `at::kBool` to `ncclUint8`
2) During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR.
The reduction logic (for example for reduce/allreduce) is as follows:
sum, max = bitwise or
product, min = bitwise and
Tests are added to ensure that the reductions work as expected.
ghstack-source-id: 108315417
Test Plan: Added unittests
Reviewed By: mrshenli
Differential Revision: D22496604
fbshipit-source-id: a1a15381ec41dc59923591885d40d966886ff556
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41269
The ultimate goal is to move things that are not gated with `if (compute_requires_grad(...))`
or `if (grad_fn)` out from VariableType so that VariableType kernels can be enabled/disabled
based upon `GradMode`. Then we can merge `AutoNonVariableTypeMode` and `NoGradGuard`.
We've moved profiling / tracing logic out from VariableType. One remaining thing that's
not gated with the if-statement is the `increment_version` call.
However, the `gen_variable_type.py` does use bits from `derivatives.yaml` to determine whether
to emit the `increment_version` call. If an output is never going to be differentiable (not based
upon runtime property of the variable but based upon static property, e.g. it's integral type)
then it would never emit the increment_version call.
Hypothetically, increment_version for a tensor can be orthogonal to its differentiability.
This PR is to make the change and test its impact. Making this logical simplification would
allow us to move this out from VariableType to aten codegen.
ghstack-source-id: 108318746
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D22471643
fbshipit-source-id: 3e3a442c7fd851641eb4a9c4f024d1f5438acdb8
Summary:
I'd like to amend the docstring introduced in https://github.com/pytorch/pytorch/issues/41564. It's not rendering correctly on the web, and this should fix it.
cc albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41835
Reviewed By: izdeby
Differential Revision: D22672368
Pulled By: albanD
fbshipit-source-id: f0b03c2b2a4c79b790d54f7c8f2ae28ef9d76a75
Summary:
Move the timing utils to `torch.utils._benchmark`. I couldn't figure out how to get setuptools to pick it up and put it under `torch` unless it is in the `torch` directory. (And I think it has to be for `setup.py develop` anyway.)
I also modified the record function benchmark since `Timer` and `Compare` should always be available now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41506
Reviewed By: ngimel
Differential Revision: D22601460
Pulled By: robieta
fbshipit-source-id: 9cea7ff1dcb0bb6922c15b99dd64833d9631c37b
Summary:
Fixes https://github.com/pytorch/pytorch/issues/19227
This PR adds a regression test for ONNX exports where a module has a sequential that references an Embedding layer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32598
Reviewed By: izdeby
Differential Revision: D22672790
Pulled By: ezyang
fbshipit-source-id: c88beb29a36b07378c28b0e4546efe887fcbc3be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41895
### Summary
The iOS binary for 1.6.0 has been uploaded to AWS. This PR bumps up the version for cocoapods.
### Test Plan
- Check CI
Test Plan: Imported from OSS
Reviewed By: husthyc
Differential Revision: D22683787
Pulled By: xta0
fbshipit-source-id: bb95b670a7945d823d55e9c65b357765753f295a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41892
Currently the input of batch_norm is considered as dynamically quantizable but it shouldn't be
this PR fixes that
Test Plan:
internal models
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D22681423
fbshipit-source-id: 7f428751de0c4af0a811b9c952e1d01afda42d85
Summary:
They are already owned by `jenkins` user after the build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41884
Reviewed By: orionr
Differential Revision: D22682441
Pulled By: malfet
fbshipit-source-id: daf99532d300d30a5de591ad03af4597e145fdfc
Summary:
This pull request enables the following tests from test_torch, previously skipped on ROCm:
test_pow_-2_cuda_float32/float64
test_sum_noncontig_cuda_float64
test_conv_transposed_large
The first two tests experienced precision issues on earlier ROCm version, whereas the conv_transposed test was hitting a bug in MIOpen which is fixed with the version shipping with ROCm 3.5
ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41611
Reviewed By: xw285cornell
Differential Revision: D22672690
Pulled By: ezyang
fbshipit-source-id: 5585387c048f301a483c4c0566eb9665555ef874
Summary:
Separates out the docs build from the push and limits when the push
actually happens.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41871
Reviewed By: yns88
Differential Revision: D22673716
Pulled By: seemethere
fbshipit-source-id: fff8b35ba8465dc15832214c4c9ef03ce12faa48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41859
A value can be used multiple times in the same node, we don't really need to assert uses of dequantize == 1
Test Plan: Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D22673525
fbshipit-source-id: 2c4a770e0ddee722ca54e68d310c395e7f418b3b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41687
Specifically, this makes a new library (lazy), which can be used from both core
and workspace.
This allows workspace.Createnet to trigger lazy loading of dyndep dependencies.
Test Plan: Added a unit test specifically for workspace.CreateNet
Reviewed By: dzhulgakov
Differential Revision: D22441877
fbshipit-source-id: 3a9d1af9962585d08ea2566c9c85bec7377d39f2
Summary:
Add pass that fuses Conv and Batchnormalization nodes into one node Conv.
This pass is only applied in inference mode (training is None or TrainingMode.Eval).
Since this pass needs access to param_dict it is written outside peephole file where these kind of passes (fusing multiple nodes into one) is usually placed.
This PR also adds wrapper skipIfNoEmbed to skip debug_embed_params test:
Pass that fuses Conv and Batchnorm changes the params of resnet model and parameters of onnx and pytorch model won't match. Since parameters are not matching, debug_embed_params test for test_resnet will fail and that is expected, therefore debug_embed_params test for test_resnet should be skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40547
Reviewed By: gchanan
Differential Revision: D22631687
Pulled By: bzinodev
fbshipit-source-id: fe45812400398a32541e797f727fd8697eb6d8c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41820
Pull Request resolved: https://github.com/pytorch/glow/pull/4721
In order to support int8 quantized tensor as an input to OnnxifiOp, we need to
- Add support to recognize and extract shape meta from int8 tensor at input of OnnxifiOp
- Make a copy of the input data and shift by 128 in Glow if input data is uint8 quantized tensor to get correct result because Glow uses int8 to represent the quantized data regardless.
- Propagate correct quantization parameters to through shape info in C2.
This diff implements the above.
Test Plan:
```
buck test caffe2/caffe2/contrib/fakelowp/test:test_int8_quantnnpi
```
Reviewed By: jackm321
Differential Revision: D22650584
fbshipit-source-id: 5e867f7ec7ce98bb066ec4128ceb7cad321b3392
Summary:
The function torch.cross is a bit confusing, in particular the defaulting of the dim argument.
The default `dim` has been documented as -1 but it is actually `None`. This increases the confusion, in two possible ways depending on how carefully you read the rest. I also add a final warning to the final sentence.
This partially addresses https://github.com/pytorch/pytorch/issues/39310.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41850
Reviewed By: izdeby
Differential Revision: D22664625
Pulled By: albanD
fbshipit-source-id: b8669e026fd01de9e4ec16da1414b9edfaa76bdd
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/40056
A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538
Reviewed By: zou3519
Differential Revision: D22608376
Pulled By: ezyang
fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
Summary:
Fix a bug in SplitWithTail and SplitWithMask where loop_options such as Cuda block/thread bindings are overwritten by the split. This PR fixes this bug by propagating the loop options to the outer loop, which for axis bindings should be equivalent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40035
Reviewed By: ZolotukhinM
Differential Revision: D22080263
Pulled By: nickgg
fbshipit-source-id: b8a9583fd90f69319fc4bb4db644e91f6ffa8e67
Summary:
closes gh-37584. ~I think I need to do more to generate an image, but the `.circleci/README.md` is vague in the details. The first commit reflows and updates that document a bit, I will continue to update it as the PR progresses :)~ Dropped updating `.circleci/README.md`, will do that in a separate PR once this is merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38796
Reviewed By: gchanan
Differential Revision: D22627522
Pulled By: ezyang
fbshipit-source-id: 99d5c19e942f15b9fc10f0de425790474a4242ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39984
Add Alltoall and Alltoallv to PT NCCL process group using NCCL Send/Recv.
Reviewed By: jiayisuse
Differential Revision: D20781624
fbshipit-source-id: 109436583ff69a3fea089703d32cfc5a75f973e0
Summary:
With transition to hipclang, the HIP runtime library name was changed. A symlink was added to ease the transition, but is going to be removed. Conditionally set library name based on HIP compiler used. Patch gloo submodule as part of build_amd.py script until its associated fix is available.
CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41813
Reviewed By: zhangguanheng66
Differential Revision: D22660077
Pulled By: xw285cornell
fbshipit-source-id: c538129268d9947535b34523201f655b13c9e0a3
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 139c6f2292
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41814
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D22648844
fbshipit-source-id: 4cfa8d83585407f870ea2bdee74e1c1f371082eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41828
This reverts commit fe66bdb498efe912d8b9c437a14efa4295c04fdd.
This also makes a sense to THTensorEvenMoreMath because sumall was removed, see THTensor_wrap.
Test Plan: Imported from OSS
Reviewed By: orionr
Differential Revision: D22657473
Pulled By: malfet
fbshipit-source-id: 95a806cedf1a3f4df91e6a21de1678252b117489
Summary:
The goal is to implement cross layer equalization as described in section 4.1 in this paper: https://arxiv.org/pdf/1906.04721.pdf
Given two adjacent submodules in a trained model, A,B quantization might hurt one of the submodules more than the other. The paper poses the idea that a loss in accuracy from quantizing can be due to a difference in the channel ranges between the two submodules (the output channel range of A can be small, while the input channel range of B can be large). To minimize this source of error, we want to scale the tensors of A,B s.t. their channel ranges are equal (them being equal means no difference in ranges and minimizes this source of error).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41685
Test Plan: Imported from OSS
Reviewed By: z-a-f
Differential Revision: D22630219
Pulled By: edmundw314
fbshipit-source-id: ccc91ba12c10b652d7275222da8b85455b8a7cd5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41558
The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.
Fixes https://github.com/pytorch/pytorch/issues/41474
ghstack-source-id: 108231453
Test Plan: Verified in https://github.com/pytorch/pytorch/issues/41474.
Reviewed By: fmassa
Differential Revision: D22582779
fbshipit-source-id: 63e34d8a020c4af996ef079cfb7041b2474e27c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41825
Add flag to gate D21374246 (e7a09b4d17) to mitigate mobile size regression.
ghstack-source-id: 108212047
Test Plan: CI
Reviewed By: linbinyu
Differential Revision: D22650708
fbshipit-source-id: ac9318af824ac31f519b7d5b4fe72df892d8d3f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41576
Previously we are assuming CallMethod only happens on module instances,
but it turns out this is not true, this PR fixes this issue.
Test Plan: Imported from OSS
Reviewed By: z-a-f
Differential Revision: D22592789
fbshipit-source-id: 48217626d9ea8e82536f00a296b8f9a471582ebe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41809
Add new unittests to Operator Kernels.
Explicitly announce function type in tests because it can't be inferred.
Test Plan: CI
Reviewed By: malfet
Differential Revision: D22647221
fbshipit-source-id: ef2f0e8c847841e90aa26d028753f23c8c53d6b0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41570
For min/max based quantization observers, calculating min and max of a tensor
takes most of the runtime. Since the calculation of min and max is done
on the same tensor, we can speed this up by only reading the tensor
once, and reducing with two outputs.
One question I had is whether we should put this into the quantization
namespace, since the use case is pretty specific.
This PR implements the easier CPU path to get an initial validation.
There is some needed additional work in future PRs, which durumu will
take a look at:
* CUDA kernel and tests
* making this work per channel
* benchmarking on observer
* benchmarking impact on QAT overhead
Test Plan:
```
python test/test_torch.py TestTorch.test_min_and_max
```
quick bench (not representative of real world use case):
https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca
```
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390])
min and max separate 11.90243935585022
min and max combined 6.353186368942261
% decrease 0.466228209277153
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586])
min and max separate 3.468616485595703
min and max combined 1.8227086067199707
% decrease 0.4745142294372342
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146])
min and max separate 1.5707778930664062
min and max combined 0.8645427227020264
% decrease 0.4496085496757899
```
Imported from OSS
Reviewed By: supriyar
Differential Revision: D22589349
fbshipit-source-id: c2e3f1b8b5c75a23372eb6e4c885f842904528ed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41815
**All are minor changes to enable better simulations.**
The constructors of MinMaxObserver, MovingAverageMinMaxObserver, PerChannelMinMaxObserver, and MovingAveragePerChannelMinMaxObserver are augmented so they can utilize the dynamic quantization range support in the _ObserverBase class.
In addition, minor adjustments are made to the enable_static_observation function that allow observer to update parameters but do not fake quantize on the output (for constructing baseline).
Test Plan:
To ensure this modification is still backward compatible with past usages, numerics are verified by running the quantization unit test suite, which contains various observer tests. The following command executes the test suite, which also verifies the observer numerics:
```
buck test //caffe2/test:quantization -- observer
```
Reviewed By: z-a-f
Differential Revision: D22649128
fbshipit-source-id: 32393b706f9b69579dc2f644fb4859924d1f3773
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41806
Generally a good practice not to have tests spew output.
Test Plan:
`build/bin/test_tensorexpr`
Imported from OSS
Reviewed By: zheng-xq
Differential Revision: D22646833
fbshipit-source-id: 444e883307d058fe77e7550d436fa61b7d91a701
Summary:
Auto fuse the output loops of outer Rfactors, so it is in a more convenient format for binding GPU axes.
An example:
```
Tensor* c = Reduce("sum", {}, Sum(), b, {{m, "m"}, {n, "n"}, {k, "k"}});
LoopNest loop({c});
std::vector<For*> loops = loop.getLoopStmtsFor(c);
auto v = loops.at(0)->var();
loop.rfactor(c->body(), v);
```
Before:
```
{
Allocate(tmp_buf, float, {m});
sum[0] = 0.f;
for (int m_1 = 0; m_1 < m; m_1++) {
tmp_buf[m_1] = 0.f;
}
for (int m_1 = 0; m_1 < m; m_1++) {
for (int n = 0; n < n_1; n++) {
for (int k = 0; k < k_1; k++) {
tmp_buf[m_1] = (tmp_buf[m_1]) + (b[((n_1 * m_1) * k_1 + k) + k_1 * n]);
}
}
}
for (int m_1 = 0; m_1 < m; m_1++) {
sum[0] = (sum[0]) + (tmp_buf[m_1]);
}
Free(tmp_buf);
}
```
After:
```
{
sum[0] = 0.f;
for (int m = 0; m < m_1; m++) {
Allocate(tmp_buf, float, {m_1});
tmp_buf[m] = 0.f;
for (int n = 0; n < n_1; n++) {
for (int k = 0; k < k_1; k++) {
tmp_buf[m] = (tmp_buf[m]) + (b[((n_1 * m) * k_1 + k) + k_1 * n]);
}
}
sum[0] = (sum[0]) + (tmp_buf[m]);
Free(tmp_buf);
}
}
```
The existing Rfactor tests cover this case, although I did rename a few for clarity. This change broke the LLVMRFactorVectorizedReduction test because it now does what its intending to (vectorize a loop with a reduction in it) rather than nothing, and since that doesn't work it correctly fails. I've disabled it for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40050
Reviewed By: ZolotukhinM
Differential Revision: D22605639
Pulled By: nickgg
fbshipit-source-id: e359be53ea62d9106901cfbbc42d55d0e300e8e0
Summary: Enforce duplicate op name check on mobile
Test Plan: run full/lite predictor
Reviewed By: iseeyuan
Differential Revision: D22639758
fbshipit-source-id: 2993c4bc1b14c833b273183f4f343ffad62121b3
Summary:
Skipping the test test_streams as it is flaky on rocm.
cc: jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41697
Reviewed By: zhangguanheng66
Differential Revision: D22644600
Pulled By: malfet
fbshipit-source-id: b1b16d496e58a91c44c40d640851fd62a5d7393d
Summary:
This PR implements a feature extension discussed in https://github.com/pytorch/pytorch/issues/41516.
I followed this other PR https://github.com/pytorch/pytorch/issues/22245 to add this other module. While I was at it, I also added `extra_repr()` method in `Flatten` which was missing.
I see there are no unit tests for these modules. Should I add those too? If so, what is the best place I should place these?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41564
Reviewed By: gchanan
Differential Revision: D22636766
Pulled By: albanD
fbshipit-source-id: f9efdefd3ffe7d9af9482087625344af8f990943
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41535
A generalized fake quantization module is built to support lower-bit fake quantization with back propagation on the scale and zero point. The module supports both per tensor and per channel fake quantization.
Test Plan:
Please see diff D22337313 for a related experiment performed on the fake quantizer module.
The `_LearnableFakeQuantize` module supports the following use cases:
- Per Tensor Fake Quantization or Per Channel Fake Quantization
- Static Estimation from Observers or Quantization Parameter Learning through Back Propagation
By default, the module assumes per tensor affine fake quantization. To switch to per channel, during initialization, declare `channel_size` with the appropriate length. To toggle between utilizing static estimation and parameter learning with back propagation, you can invoke the call `enable_param_learning` or `enable_static_estimate`. For more information on the flags that support these operations, please see the doc string of the `_LearnableFakeQuantize` module.
The `_LearnableFakeQuantizer` module relies on 2 operators for its forward and backward paths: `_LearnableFakeQuantizePerTensorOp` and `_LearnableFakeQuantizePerChannelOp`. The backpropagation routine is developed based on the following literature:
- Learned Step Size Quantization: https://openreview.net/pdf?id=rkgO66VKDS
- Trained Quantization Thresholds: https://arxiv.org/pdf/1903.08066.pdf
Reviewed By: z-a-f
Differential Revision: D22573645
fbshipit-source-id: cfd9ece8a959ae31c00d9beb1acf9dfed71a7ea1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41693
Add non zero offset test cases for Quantize and Dequantize Ops.
Test Plan: Added new test case test_int8_non_zero_offset_quantize part of the test_int8_ops_nnpi.py test file.
Reviewed By: hyuen
Differential Revision: D22633796
fbshipit-source-id: be17ee7a0caa6e9bc7b175af539be2e6625ad47a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41376
torch::jit::mobile::Module does not currently support accessing parameters via their attribute names, but torch::jit::Module does. This diff adds an equivalent functionality to mobile::Module.
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D22609142
Pulled By: ann-ss
fbshipit-source-id: 1a5272ff336f99a3c0bb6194c6a6384754f47846
Summary:
## TLDR
Support using NaN default value for missing dense features in RawInputProcessor for DPER2. In preparation for subsequent support for null flag features in compute meta. For train_eval this is already supported in DPER3 and we do not plan to support this in DPER2 train eval.
Differential Revision: D22439142
fbshipit-source-id: 99ae9755bd41a5d5f43bf5a9a2819d64f3883005
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41503
Fix for https://github.com/pytorch/pytorch/issues/41192
We can map fill_ and zero_ to their functional equivalents full_like and zeros_like
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D22629269
Pulled By: eellison
fbshipit-source-id: f1c62684dc55682c0b3845022e0461ec77d07179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41689
It's annoying not to know which jobs are actually related to docker
builds so let's just add the prefix.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D22631578
Pulled By: seemethere
fbshipit-source-id: ac0cdd983ccc3bebcc360ba479b378d8f0eaa9c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41618
More LayerNorm Vectorization in calcMeanStd function.
Test Plan: test covered in test_layernorm_nnpi_fp16.py
Reviewed By: hyuen
Differential Revision: D22606585
fbshipit-source-id: be773e62f0fc479dbc2d6735f60c2e98441916e9
Summary:
A minor spell check!
I have gone through a dozen of .md files to fix the typos.
zou3519 take a look!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41599
Reviewed By: ezyang
Differential Revision: D22601629
Pulled By: zou3519
fbshipit-source-id: 68d8f77ad18edc1e77874f778b7dadee04b393ef
Summary:
The test loops over `upper` but does not use it effectively running the same test twice which increases test times for no gain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41583
Reviewed By: soumith, seemethere, izdeby
Differential Revision: D22598475
Pulled By: zou3519
fbshipit-source-id: d100f20143293a116ff3ba08b0f4eaf0cc5a8099
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41113
In this diff, the `ObserverBase` class is augmented with 2 additional optional arguments qmin and qmax. Correspondingly the calculation of qmin and qmax and the related quantization parameters are modified to accommodate this additional flexibility should the number of bits for quantization be lower than 8 (the default value).
Additional logic in the base class `_calculate_qparams` function has also been modified to provide support for dynamic quantization range.
Test Plan:
To ensure this modification is still backward compatible with past usages, numerics are verified by running the quantization unit test suite, which contains various observer tests. The following command executes the test suite, which also verifies the observer numerics:
`buck test //caffe2/test:quantization -- observer`
This modified observer script can be tested within the experiments for lower bit fake quantization. Please see the following diffs for reference.
- Single Fake Quantizer: D22337447
- Single Conv Layer: D22338532
Reviewed By: z-a-f
Differential Revision: D22427134
fbshipit-source-id: f405e633289322078b0f4a417f54b684adff2549
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41621
Per title. In some situation, deviceGuard constructor in mul_kernel_cuda segfaults, so construct deviceGuard conditionally only when first argument is scalar.
This does not root cause why deviceGuard constructor segfaults, so the issue might come back.
Test Plan: pytorch oss CI
Reviewed By: jianyuh
Differential Revision: D22616460
fbshipit-source-id: b91bbe55c6eb0bbe80b8d6a61c41f09288752658
Summary:
* Add EnumType and AnyEnumType as first-class jit type
* Add Enum-typed IValue
* Enhanced aten::eq to support Enum
Supported:
Enum-typed function targuments
using Enum type and comparing them
TODO:
Add PyThon sugared value for Enum
Support getting name/value attrs of enums
Support Enum-typed return values
Support enum values of different types in same Enum class
Support serialization and deserialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41390
Reviewed By: eellison
Differential Revision: D22524388
Pulled By: gmagogsfm
fbshipit-source-id: 1627154a64e752d8457cd53270f3d14aea4b1150
Summary:
- Removes outdated language like "BoolTensor"
- Consistently labels keyword arguments, like out
- Uses a more natural string to describe their return type
- A few bonus fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41626
Reviewed By: ngimel
Differential Revision: D22617322
Pulled By: mruberry
fbshipit-source-id: 03cc3562b78a07ed30bd1dc7936d7a4f4e31f01d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37587
Lifting RecordFunction up into the dispatcher code
Test Plan: Imported from OSS
Differential Revision: D21374246
fbshipit-source-id: 19f9c1719e6fd3990e451c5bbd771121e91128f7
Summary:
https://github.com/pytorch/pytorch/issues/38349
mruberry
Not entirely sure if all the changes are necessary in how functions are added to Pytorch.
Should it throw an error when called with a non-complex tensor? Numpy allows non-complex arrays in its imag() function which is used in its isreal() function but Pytorch's imag() throws an error for non-complex arrays.
Where does assertONNX() get its expected output to compare to?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41298
Reviewed By: ngimel
Differential Revision: D22610500
Pulled By: mruberry
fbshipit-source-id: 817d61f8b1c3670788b81690636bd41335788439
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41578
A new op aten::gcd(Tensor...) was added while the duplicated op name check was disabled. It's not a prime op, but it has the same name with one prime op aten::gcd(int, int).
It will be safer to enforce all prim ops have overload name, even there is no duplicated name right now. People may add tensor ops without overload name in the future.
This diff added the overload name for all ops defined using "DEFINE_INT_OP".
```
aten::__and__
aten::__or__
aten::__xor__
aten::__lshift__
aten::__rshift__
aten::__round_to_zero_floordiv
aten::gcd
```
Test Plan: run full JIT predictor
Reviewed By: iseeyuan
Differential Revision: D22593689
fbshipit-source-id: b3335d356a774d33450a09d0a43ff947197f9b8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41606
The previous diff (D22220798 (59294fbbb9) and D22220797) was recently reverted (D22492356 (28291d3cf8), D22492355) because of a bug associated with the op AsyncIf. The AsyncIf op has net_defs as args and the SSA rewriting didn't take that into account. It has a special path for the op If, but not for AsyncIf. Several changes I made to fix the bug:
1) Add op AsyncIf to the special path for If op in SSA rewriting
2) clear inputs/outputs of the netdefs that are args in If/AsyncIf ops because they're no longer valid
3) revert renamed inputs/outputs in the arg netdefs that are in the external_outputs in the parent netdef
2) and 3) are existing bugs in the `SsaRewrite` function that were just never exposed before.
The algorithm for `RemoveOpsByType` is the same as in my previous diff D22220798 (59294fbbb9). The only new changes in this diff are in `onnx::SsaRewrite` and a few newly added unit tests.
(Note: this ignores all push blocking failures!)
Reviewed By: yinghai
Differential Revision: D22588652
fbshipit-source-id: ebb68ecd1662ea2bae14d4be8f61a75cd8b7e3e6
Summary:
Leave undefined tensors / None returned from custom backward functions as undefined/None instead of creating a tensor full of zeros. This change improves performance in some cases.
**This is BC-Breaking:** Custom backward functions that return None will now see it potentially being propagated all the way up to AccumulateGrad nodes. Potential impact is that .grad field of leaf tensors as well as the result of autograd.grad may be undefined/None where it used to be a tensor full of zeros. Also, autograd.grad may raise an error, if so, consider using allow_unused=True ([see doc](https://pytorch.org/docs/stable/autograd.html?highlight=autograd%20grad#torch.autograd.grad)) if it applies to your case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41490
Reviewed By: albanD
Differential Revision: D22578241
Pulled By: heitorschueroff
fbshipit-source-id: f4966f4cb520069294f8c5c1691eeea799cc0abe
Summary:
lcm was missing an abs. This adds it plus extends the test for NumPy compliance. Also includes a few doc fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41552
Reviewed By: ngimel
Differential Revision: D22580997
Pulled By: mruberry
fbshipit-source-id: 5ce1db56f88df4355427e1b682fcf8877458ff4e
Summary:
Previously we did not link against amdhip64 (roughly equivalent to cudart). Apparently, the recent RTDL_GLOBAL fixes prevent the extensions from finding the symbols needed for launching kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41257
Reviewed By: zou3519
Differential Revision: D22573288
Pulled By: ezyang
fbshipit-source-id: 89f9329b2097df26785e2f67e236d60984d40fdd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40474
These constants are unnecessary since there is an enum, and we can add
the size at the end of the enum and it will be equal to the list size. I
believe that this is the typical pattern used to represent enum sizes.
ghstack-source-id: 107969012
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D22147754
fbshipit-source-id: 7064a897a07f9104da5953c2f87b58179df8ea84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41577
* Remove skipping test
* Use fma_avx_emulation
* Increase test examples to 100
(Note: this ignores all push blocking failures!)
Test Plan: Tests are covered in test_sls_8bit_nnpi.py
Reviewed By: hyuen
Differential Revision: D22585742
fbshipit-source-id: e1f62f47eb10b402b11893ffca7a6786e31daa79
Summary:
Changes in the ROCm runtime have improved hipEventRecord. The events no longer take ~4 usec to execute on the gpu stream, instead they appear instantaneous. If you record two events, with no other activity in between, then they will have the same timestamp and the elapsed duration will be 0.
The profiler uses hip/cuda event pairs to infer gpu execution times. It wraps functions whether they send work to the gpu or not. Functions that send no gpu work will show as having zero duration. Also they will show as running at the same time as neighboring functions. On a trace, all those functions combine into a 'call stack' that can be tens of functions tall (when indeed they should be sequential).
This patch suppresses recording the zero duration 'kernel' events, leaving only the CPU execution part. This means functions that do not use the GPU do not get an entry for how long they were using the GPU, which seams reasonable. This fixes the 'stacking' on traces. It also improves the signal to noise of the GPU trace beyond what was available previously.
This patch will not effect CUDA or legacy ROCm as those are not able to 'execute' eventRecord markers instantaneously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41540
Reviewed By: zou3519
Differential Revision: D22597207
Pulled By: albanD
fbshipit-source-id: 5e89de2b6d53888db4f9dbcb91a94478cde2f525
Summary:
Before, inverse for division by scalar is calculated in the precision of the non-scalar operands, which can lead to underflow:
```
>>> x = torch.tensor([3388.]).half().to(0)
>>> scale = 524288.0
>>> x.div(scale)
tensor([0.], device='cuda:0', dtype=torch.float16)
>>> x.mul(1. / scale)
tensor([0.0065], device='cuda:0', dtype=torch.float16)
```
This PR makes results of multiplication by inverse and division the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41446
Reviewed By: ezyang
Differential Revision: D22542872
Pulled By: ngimel
fbshipit-source-id: b60e3244809573299c2c3030a006487a117606e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41507
These fields have always been a part of tensor types, this change just
makes them serializable through IR dumps.
Test Plan: Imported from OSS
Reviewed By: Krovatkin, ngimel
Differential Revision: D22563661
Pulled By: ZolotukhinM
fbshipit-source-id: f01aaa130b7e0005bf1ff21f65827fc24755b360
Summary:
Implementing the quantile operator similar to [numpy.quantile](https://numpy.org/devdocs/reference/generated/numpy.quantile.html).
For this implementation I'm reducing it to existing torch operators to get free CUDA implementation. It is more efficient to implement multiple quickselect algorithm instead of sorting but this can be addressed in a future PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39417
Reviewed By: mruberry
Differential Revision: D22525217
Pulled By: heitorschueroff
fbshipit-source-id: 27a8bb23feee24fab7f8c228119d19edbb6cea33
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41575
Fixes https://github.com/pytorch/pytorch/issues/34294
This updates the C++ argument parser to correctly handle `TensorList` operands. I've also included a number of updates to the testing infrastructure, this is because we're now doing a much more careful job of testing the signatures of aten kernels, using the type information about the arguments as read in from `Declarations.yaml`. The changes to the tests are required because we're now only checking for `__torch_function__` attributes on `Tensor`, `Optional[Tensor]` and elements of `TensorList` operands, whereas before we were checking for `__torch_function__` on all operands, so the relatively simplistic approach the tests were using before -- assuming all positional arguments might be tensors -- doesn't work anymore. I now think that checking for `__torch_function__` on all operands was a mistake in the original design.
The updates to the signatures of the `lambda` functions are to handle this new, more stringent checking of signatures.
I also added override support for `torch.nn.functional.threshold` `torch.nn.functional.layer_norm`, which did not yet have python-level support.
Benchmarks are still WIP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34725
Reviewed By: mruberry
Differential Revision: D22357738
Pulled By: ezyang
fbshipit-source-id: 0e7f4a58517867b2e3f193a0a8390e2ed294e1f3
Summary:
Assert in OptionalType::create for valid TypePtr to catch all uses, as well as in python resolver to propagate slightly more helpful error message.
Closes https://github.com/pytorch/pytorch/issues/40713.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41509
Reviewed By: suo
Differential Revision: D22563710
Pulled By: wconstab
fbshipit-source-id: ee6314b1694a55c1ba7c8251260ea120be148b17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41464
If input is int8 rowwise quantized, currently we cannot low it to Glow. And previously, we had some error when running with inbatch broadcast. The main issue is that Tile op doesn't support uint8_t type, which is very easily added here. However, this will result in non-ideal situation that we will leave Tile -> Fused8BitRowwiseQuantizedToFloat on host side, which probably hurt the memory bw a lot. Even we later add the support to Fused8BitRowwiseQuantizedToFloat in Glow, it's still not ideal because we are doing redudant compute on identical columns. So the solution here is to swap the order of Fused8BitRowwiseQuantizedToFloat and Tile to make it Tile -> Fused8BitRowwiseQuantizedToFloat. In this way, it will resolve the error we saw immediately. For the short term, we can still run Tile in card. And for longer term, things runs faster on card.
The optimization is a heuristic. If in the net, there isn't such pattern, inbatch broadcast will work as it was before.
(Note: this ignores all push blocking failures!)
Test Plan:
```
buck test caffe2/caffe2/opt/custom:in_batch_broadcast_test
```
Reviewed By: benjibc
Differential Revision: D22544162
fbshipit-source-id: b6dd36a5925a9c8103b80f034e7730a7a085a6ff
Summary:
Delete "nogpu" job since both "AVX" and "AVX2" jobs already act like one
Fix naming problem when NO_AVX_NO_AVX2 job and NO_AVX2 jobs were semantically identical, due to the following logic in test.sh:
```
if [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX-* ]]; then
export ATEN_CPU_CAPABILITY=default
elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX2-* ]]; then
export ATEN_CPU_CAPABILITY=avx
fi
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41565
Reviewed By: seemethere
Differential Revision: D22584743
Pulled By: malfet
fbshipit-source-id: 783cce60f35947b5d1e8b93901db36371ef78243
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41343
Currently caffe2.InitOpLibrary does the dll import uniliaterally. Instead if we make a lazy version and use it, then many pieces of code which do not need the caffe2urrenoperators get a lot faster.
One a real test, the import time went from 140s to 68s. 8s.
This also cleans up the algorithm slightly (although it makes a very minimal
difference), by parsing the list of operators once, rather than every time a
new operator is added, since we defer the RefreshCall until after we've
imported all the operators.
The key way we maintain safety, is that as soon as someone does an operation
which requires a operator (or could), we force importing of all available
operators.
Future work could include trying to identify which code is needed for which
operator and only import the needed ones. There may also be wins available by
playing with dlmopen (which opens within a namespace), or seeing if the dl
flags have an impact (I tried this and didn't see an impact, but dlmopen may
make it better).
Note that this was previously landed and reverted. The issue was that if a import failed and raised an exception, the specific library would not be removed from the lazy imports. This caused our tests which had libraries that failed to poison all other tests that ran after it. This has been fixed and a unit test has been added for this case (to help make it obvious what failed).
Test Plan:
I added a new test a lazy_dyndep_test.py (copied from all_compare_test.py).
I'm a little concerned that I don't see any explicit tests for dyndep, but this
should provide decent coverage.
I've added a specific test to handle the poisoning issues mentioned above, which caused the previous version to get reverted.
Differential Revision: D22506369
fbshipit-source-id: 7395df4778e8eb0220630c570360b99a7d60eb83
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41405
Test Plan:
**Imported from GitHub: all checks have passed**
{F244195355}
**The Intern Builds & Tests have 127 success, 5 no signals, and 1 failure. Double check the failed test log file, the failure is result differences:**
- AssertionError: 0.435608434677124 != 0.4356083869934082
- AssertionError: 0.4393022060394287 != 0.4393021583557129
- AssertionError: 0.44707541465759276 != 0.44707536697387695
These are all very small numerical errors (within 0.0000001).
Reviewed By: malfet
Differential Revision: D22531486
Pulled By: threekindoms
fbshipit-source-id: 21543ec76bb9b502885b5146c8ba5ede719be9ff
Summary:
Add Conf.is_test_stage() method to avoid duplicating state in ['test', 'test1', 'test2'] throughout the code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41553
Test Plan: Make sure that in modified config.yml ROCM tests jobs are assigned `pytorch/amd-gpu` resource class
Reviewed By: yns88
Differential Revision: D22580471
Pulled By: malfet
fbshipit-source-id: 514555f0c0ac94c807bf837ba209560055335587
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41515
add test in atest.cpp to cover logical_and_kernel, logical_or_kernel and logical_nor_kernel in Aten/native/cpu/BinaryOpsKernel.cpp
https://pxl.cl/1drmV
Test Plan: CI
Reviewed By: malfet
Differential Revision: D22565235
fbshipit-source-id: 7ad9fd8420d7fdd23fd9a703c75da212f72bde2c
Summary:
A small PR fixing some formatting in lcm, gcd, and the serialization note. Adds a note to lcm and gcd explaining behavior that is not always defined.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41526
Reviewed By: ngimel
Differential Revision: D22569341
Pulled By: mruberry
fbshipit-source-id: 5f5ff98c0831f65e82b991ef444a5cee8e3c8b5a
Summary:
Fix https://github.com/pytorch/pytorch/issues/32530
I used the next() function to generate samples one at a time. To compensate replacement=False, I added a variable called "sample_list" to RandomSampler for random permutation.
cc SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40026
Reviewed By: zhangguanheng66
Differential Revision: D22519869
Pulled By: ezyang
fbshipit-source-id: be65850025864d659a713b3bc461b25d6d0048a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41424
Adding alltoall to Gloo process group
Test Plan:
buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest
Verified on TSC as well D22141532
Reviewed By: osalpekar
Differential Revision: D22451929
fbshipit-source-id: 695c4655c894c85229b16097fa63352ed04523ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41218
This test doesn't assert anything and was accidentally committed as
part of a larger diff a few months ago.
ghstack-source-id: 107882848
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D22469852
fbshipit-source-id: 0baa23da56b08200e16cf66df514566223dd9b15
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41217
Fixes this flaky test. Due to the possibility of callback
finishCreatingOwnerRRef running after request_callback has processed and
created the owner RRef, we could actually end up with 0 owners on the node,
since the callback removes from the owners_ map. In this case, shutdown is fine
since there are no owners. On the other hand, if the callback runs first, there
will be 1 owner which we will delete in shutdown when we detect it has no
forks. So either way, shutdown works fine and we don't need to enforce there to
be 1 owner.
ghstack-source-id: 107883497
Test Plan: Ran the test 500 times with TSAN.
Reviewed By: ezyang
Differential Revision: D22469806
fbshipit-source-id: 02290d6d5922f91a9e2d5ede21d1cf1c4598cb46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41265
This PR adds tests for the Async Work wait-level timeouts that were added in the previous PR
ghstack-source-id: 107835732
Test Plan: New tests are in this diff - Running on local machine and Sandcastle
Reviewed By: jiayisuse
Differential Revision: D22470084
fbshipit-source-id: 5552e384d384962e359c5f665e6572df03b6aa63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40948
Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.
ghstack-source-id: 107835738
Test Plan: Tests are in the last PR in this stack
Reviewed By: jiayisuse
Differential Revision: D22173763
fbshipit-source-id: e0493231a23033464708ee2bc0e295d2b087a1c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40947
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.
ghstack-source-id: 107835734
Test Plan: This diff added tests - checking CI/Sandcastle for correctness. These are NCCL tests so they require at least 2 GPUs to run.
Reviewed By: jiayisuse
Differential Revision: D22173101
fbshipit-source-id: 8595e4b67662cef781b20ced0befdcc53d157c39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40946
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.
ghstack-source-id: 107835739
Test Plan: Test added to `ProcessGroupNCCLTest` in the next PR in this stack.
Reviewed By: jiayisuse
Differential Revision: D22127898
fbshipit-source-id: 543964855ac5b41e464b2df4bb6c211ef053e73b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40944
This stack adds Work-level timeout for blocking wait.
This PR just changes the API to accept a default wait arg for the wait function in each ProcessGroup backend. The ProcessGroup superclass correctly waits for the given timeout by changing the CV wait to wait_for.
Closes: https://github.com/pytorch/pytorch/issues/37571
ghstack-source-id: 107835735
Test Plan: Tests in 4th PR in this stack
Reviewed By: jiayisuse
Differential Revision: D22107135
fbshipit-source-id: b38c07cb5e79e6c86c205e580336e7918ed96501
Summary:
The test was always running on the CPU. This actually caused it to throw an error on non-MKL builds, since the CUDA test (which ran on the CPU) tried to execute but the test requires MKL (a requirement only checked for the CPU variant of the test).
Fixes https://github.com/pytorch/pytorch/issues/41402.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41523
Reviewed By: ngimel
Differential Revision: D22569344
Pulled By: mruberry
fbshipit-source-id: e9908c0ed4b5e7b18cc7608879c6213fbf787da2
Summary:
This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514
Reviewed By: ngimel
Differential Revision: D22569348
Pulled By: mruberry
fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 73ea1f5828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40332
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: gchanan, yns88
Differential Revision: D22150737
fbshipit-source-id: fe7e6787adef9e2fedee5d1a0a1e57bc4760b88c
Summary:
Update the API to access grad in cpp to avoid unexpected thread safety issues.
In particular, with the current API, a check like `t.grad().defined()` is not thread safe.
- This introduces `t.mutable_grad()` that should be used when getting a mutable version of the saved gradient. This function is **not** thread safe.
- The `Tensor& grad()` API is now removed. We could not do a deprecation cycle as most of our call side use non-const Tensors that use the non-const overload. This would lead to most calls hitting the warning. This would be too verbose for all the users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40887
Reviewed By: ezyang
Differential Revision: D22343932
Pulled By: albanD
fbshipit-source-id: d5eb909bb743bc20caaf2098196e18ca4110c5d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41505
fix the dequantization to match the fixes from quantization
Test Plan:
test is not conclusive, since only comparing emulation with reference collected from Amy's run
running an evaluation workflow at the moment
Reviewed By: venkatacrc
Differential Revision: D22558092
fbshipit-source-id: 3ff00ea15eac76007e194659c3b4949f07ff02a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41494
revert back to the changes from amylittleyang to make quantization work
Test Plan:
ran against a dump from ctr_instagram, and verified that:
-nnpi and fakelowp match bitwise
-nnpi is different at most by 1 vs fbgemm, most likely due to the type of
rounding
Reviewed By: venkatacrc
Differential Revision: D22555276
fbshipit-source-id: 7074521d181f15ef6270985bb71c4b44d25d1c30
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41476
deleted this test by default, re-adding it in its own file to make it
more explicit
Test Plan: ran the test
Reviewed By: yinghai
Differential Revision: D22550217
fbshipit-source-id: 758e279b2bab3b23452a3d0ce75fb366f7afb7be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41461
capacity is misleading, and we have many wrong uses internally. Let's rename to nbytes to avoid the confusion in future. Ultimately, we could remove this parameter if possible.
So far I haven't seen any case this capacity is necessary.
Test Plan: oss ci
Differential Revision: D22544189
fbshipit-source-id: f310627f2ab8f4ebb294e0dd5eabc380926991eb
Summary:
Declaring GLOG_ constants in google namespace causes a conflict in C++ project that uses GLOG and links with LibPyTorch compiled without GLOG.
For example, see https://github.com/facebookresearch/ReAgent/issues/288
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41504
Reviewed By: kaiwenw
Differential Revision: D22564308
Pulled By: malfet
fbshipit-source-id: 2167bd2c6124bd14a67cc0a1360521d3c375e3c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41430
To avoid duplication at compile time, modularize the common autograd files used by both mobile and full-jit.
ghstack-source-id: 107742889
Test Plan: CI
Reviewed By: kwanmacher
Differential Revision: D22531358
fbshipit-source-id: 554f10be89b7ed59c9bde13387a0e1b08000c116
Summary:
The contiguity preprocessing was mistakenly removed in
cd48fb503088af2c00884f1619db571fffbcdafa . It causes erroneous output
when the output tensor is not contiguous. Here we restore this
preprocessing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41286
Reviewed By: zou3519
Differential Revision: D22550822
Pulled By: ezyang
fbshipit-source-id: ebad4e2ba83d2d808e3f958d4adc9a5513a95bec
Summary:
Doc update intended to clarify and expand our current serialization behavior, including explaining the difference between torch.save/torch.load, torch.nn.Module.state_dict/torch.nn.Module.load_state_dict, and torch.jit.save/torch.jit.load. Also explains, for the time, when historic serialized Torchscript behavior is preserved and our recommendation for preserving behavior (using the same PyTorch version to consume a model as produced it).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41395
Reviewed By: ngimel
Differential Revision: D22560538
Pulled By: mruberry
fbshipit-source-id: dbc2f1bb92ab61ff2eca4888febc21f7dda76ba1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41429
This diff contains the benchmark test to evaluate the speed of executing the learnable fake quantization operator, both in the forward path and the backward path, with respect to both per tensor and per channel usages.
Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`
Benchmark Results (Locally on CPU):
Each sample has dimensions **3x256x256**; Each batch has 16 samples (`N=16`)
- Per Tensor Forward: 0.023688 sec/sample
- Per Tensor Backward: 0.165926 sec/sample
- Per Channel Forward: 0.040432 sec / sample
- Per Channel Backward: 0.173528 sec / sample
Reviewed By: vkuzo
Differential Revision: D22535252
fbshipit-source-id: e8e953ff2de2107c6f2dde4c8d5627bdea67ef7f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36893
Adding an end to end test for running a simple training loop in C++
for the distributed RPC framework.
The goal of this change is to enable LeakSanitizer and potentially catch memory
leaks in the Future. Enabling LSAN with python multiprocessing is tricky and we
haven't found a solution for this. As a result, adding a C++ test that triggers
most of the critical codepaths would be good for now.
As an example, this unit test would've caught the memory leak fixed by:
https://github.com/pytorch/pytorch/pull/31030
ghstack-source-id: 107781167
Test Plan:
1) Verify the test catches memory leaks.
2) waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D21112208
fbshipit-source-id: 4eb2a6b409253108f6b6e14352e593d250c7a64d
Summary:
This pulls the following merge requests from CMake upstream:
- https://gitlab.kitware.com/cmake/cmake/-/merge_requests/4979
- https://gitlab.kitware.com/cmake/cmake/-/merge_requests/4991
The above two merge requests improve the Ampere build:
- If `TORCH_CUDA_ARCH_LIST` is not set, it can now automatically pickup 8.0 as its part of its default value
- If `TORCH_CUDA_ARCH_LIST=Ampere`, it no longer fails with `Unknown CUDA Architecture Name Ampere in CUDA_SELECT_NVCC_ARCH_FLAGS`
Codes related to architecture < 3.5 are manually removed because PyTorch no longer supports it.
cc: ngimel ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41133
Reviewed By: malfet
Differential Revision: D22540547
Pulled By: ezyang
fbshipit-source-id: 6e040f4054ef04f18ebb7513497905886a375632
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41262
In this diff, implementation is provided to support the GPU kernel running the learnable fake quantize per tensor kernels.
Test Plan: On a devvm, run `buck test //caffe2/test:quantization -- learnable` to test both the forward and backward for the learnable per tensor fake quantize kernels. The test will test the `cuda` version if a gpu is available.
Reviewed By: vkuzo
Differential Revision: D22478832
fbshipit-source-id: 2731bd8b57bc83416790f6d65ef42d450183873c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41127
In this diff, implementation is provided to support the GPU kernel running the learnable fake quantize per tensor kernels.
Test Plan: On a devvm, run `buck test //caffe2/test:quantization -- learnable` to test both the forward and backward for the learnable per tensor fake quantize kernels. The test will test the `cuda` version if a gpu is available.
Reviewed By: z-a-f
Differential Revision: D22435037
fbshipit-source-id: 515afde13dd224d21fd47fb7cb027ee8d704cbdd
Summary: Adding epsilon input argument to the Logit Op
Test Plan: Added test_logit test case.
Reviewed By: hyuen
Differential Revision: D22537133
fbshipit-source-id: d6f89afd1589fda99f09550a9d1b850cfc0b9ee1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41293
Add new operators that does quantize and packing for 8 bit and 4 bit embedding bag operators.
This is an initial change to help unblock testing. This will be follwed by adding graph mode passes to enable quantization of embedding_bag module
Note to reviewers: Future PRs will replace this op with a separate quantize and pack operator and add support for floating point scale and zero point.
Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingBag
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D22506700
fbshipit-source-id: 090cc85a8f56da417e4b7e45818ea987ae97ca8a
Summary:
Add support for including pytorch via an add_subdirectory()
This requires using PROJECT_* instead of CMAKE_* which refer to
the top-most project including pytorch.
TEST=add_subdirectory() into a pytorch checkout and build.
There are still some hardcoded references to TORCH_SRC_DIR, I will
fix in a follow on commit. For now you can create a symlink to
<pytorch>/torch/ in your project.
Change-Id: Ic2a8aec3b08f64e2c23d9e79db83f14a0a896abc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41387
Reviewed By: zhangguanheng66
Differential Revision: D22539944
Pulled By: ezyang
fbshipit-source-id: b7e9631021938255f0a6ea897a7abb061759093d
Summary:
The test asserts that the stream is "ready" but doesn't wait for the
event to be "executed" which makes it fail on some platforms where the
`query` call occurs "soon enough".
Fixes https://github.com/pytorch/pytorch/issues/38807
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41398
Reviewed By: zhangguanheng66
Differential Revision: D22540012
Pulled By: ezyang
fbshipit-source-id: 6f56d951e48133ce4f6a9a54534298b7d2877c80
Summary:
Related to https://github.com/pytorch/pytorch/issues/41368
These benchmarks support CUDA already so there is no reason for it not to be in the benchmark config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41438
Reviewed By: zhangguanheng66
Differential Revision: D22540756
Pulled By: ezyang
fbshipit-source-id: 621eceff37377c1ab06ff7483b39fc00dc34bd46
Summary: Adding shape inference for SpraseToDense. Proposal impl of shape inference only works when data_to_infer_dim is given, otherwise SpraseToDense output dimension depends on max value of input tensor
Test Plan:
buck test //caffe2/caffe2/python:sparse_to_dense_test
buck test //caffe2/caffe2/python:hypothesis_test -- test_sparse_to_dense
Dper3 Changes:
f204594813
buck test dper3/dper3_models/ads_ranking/model_impl/sparse_nn/tests:sparse_nn_lib_test
Reviewed By: zhongyx12, ChunliF
Differential Revision: D22479511
fbshipit-source-id: 8983a9baea8853deec53ad6f795c874c3fb93de0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41452
The model exported from online training workflow with int8 quantization contains FCs with 4 inputs. The extra input is the quant_param blob. This diff is to adjust the bound_shape_inferencer to get shape info for the quant_param input.
Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```
Reviewed By: anurag16
Differential Revision: D22543215
fbshipit-source-id: 0977fca06630e279d47292e6b44f3d8180a767a5
Summary:
1. Support SparseAdagradFusedWithSparseLengthsMeanGradient and RowWiseSparseAdagradFusedWithSparseLengthsMeanGradient on CPU and GPU
2. Add the dedup implementation of fused RowWiseAdagrad op on GPUs for mean pooling
Reviewed By: xianjiec
Differential Revision: D22165603
fbshipit-source-id: 743fa55ed5893c34bc6406ddfbbbb347b88091d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41414
This diff exports replaceAllUsesAfterNodeWith to PythonAPI.
Test Plan: Tested locally. Please let me know if there is a set of unit tests to be passed outside of the default ones triggered by Sandcastle.
Reviewed By: soumith
Differential Revision: D22523211
fbshipit-source-id: 3f075bafa6208ada462abc57d495c15179a6e53d
Summary:
remove layernorm templates and make them float since that's the only variant
minor fixes in logging and testing
Test Plan: ran the test
Reviewed By: venkatacrc
Differential Revision: D22527359
fbshipit-source-id: d6eec362a6e88e1c12fddf820ae629ede13fb2b8
Summary:
BCELoss currently uses different broadcasting semantics than numpy. Since previous versions of PyTorch have thrown a warning in these cases telling the user that input sizes should match, and since the CUDA and CPU results differ when sizes do not match, it makes sense to upgrade the size mismatch warning to an error.
We can consider supporting numpy broadcasting semantics in BCELoss in the future if needed.
Closes https://github.com/pytorch/pytorch/issues/40023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41426
Reviewed By: zou3519
Differential Revision: D22540841
Pulled By: ezyang
fbshipit-source-id: 6c6d94c78fa0ae30ebe385d05a9e3501a42b3652
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41287
Profiler tests that test profiling with builtin functions and `test_callback_simple` test has been broken for a while. This diff fixes that by preferring c10 ops to non-c10 ops in our operation matching logic.
The result of this is that these ops go through the c10 dispatch and thus have profiling enabled. For `test_callback_simple` this results in the effect that we choose `aten::add.Tensor` over `aten::add.Int` which fixes the type issue.
Test Plan:
Ensured that the tests are no longer flaky by running them a bunch
of times.
Reviewed By: vincentqb
Differential Revision: D22489197
fbshipit-source-id: 8452b93e4d45703453f77d968350c0d32f3f63fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41202
This commit fixes dead stores in JIT surfaced by the Quality Analyzer.
Test Plan: Continuous integration.
Reviewed By: jerryzh168
Differential Revision: D22461492
fbshipit-source-id: c587328f952054fb9449848e90b7d28a20aed4af
Summary: The device attribute in the op benchmark can only include 'cpu' or 'cuda'. So adding a check in this diff.
Test Plan: buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --warmup_iterations 1 --iterations 1
Reviewed By: ngimel
Differential Revision: D22538252
fbshipit-source-id: 3e5af72221fc056b8d867321ad22e35a2557b8c3
Summary: Change the device config in qobserver test to a string to honor --device flag.
Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qobserver_test -- --iterations 1 --device cpu
Reviewed By: ngimel
Differential Revision: D22536379
fbshipit-source-id: 8926b2393be1f52f9183f8205959a3ff18e3ed2a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41312
I was hoping that exhale had gotten incremental recompilation
in its latest version, but experimentally this does not seem
to have been the case. Still, I had gotten the whole shebang
to be working on the latest version of these packages, so might
as well land the upgrade. There was one bug in Optional.h that
I had to fix; see the cited bug report.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D22526349
Pulled By: ezyang
fbshipit-source-id: d4169c2f48ebd8dfd8a593cc8cd232224d008ae9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41354
The nightly pipeline has the potential to be flaky and thus the html
pages have the potential not to be updated.
This should actually be done as an automatic lambda job that runs
whenever the S3 bucket updates but this is intermediate step in order to
get there.
Closes https://github.com/pytorch/pytorch/issues/40998
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D22530283
Pulled By: seemethere
fbshipit-source-id: 0d80b7751ede83e6dd466690cc0a0ded68f59c5d
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36403
Copy-paste of the issue description:
* Escape hatch: Introduce unsafe_* version of the three functions above that have the current behavior (outputs not tracked as views). The documentation will explain in detail why they are unsafe and when it is safe to use them. (basically, only the outputs OR the input can be modified inplace but not both. Otherwise, you will get wrong gradients).
* Deprecation: Use the CreationMeta on views to track views created by these three ops and throw warning when any of the views is modified inplace saying that this is deprecated and will raise an error soon. For users that really need to modify these views inplace, they should look at the doc of the unsafe_* version to make sure their usecase is valid:
* If it is not, then pytorch is computing wrong gradients for their use case and they should not do inplace anymore.
* If it is, then they can use the unsafe_* version to keep the current behavior.
* Removal: Use the CreationMeta on view to prevent any inplace on these views (like we do for all other views coming from multi-output Nodes). The users will still be able to use the unsafe_ versions if they really need to do this.
Note about BC-breaking:
- This PR changes the behavior of the regular function by making them return proper views now. This is a modification that the user will be able to see.
- We skip all the view logic for these views and so the code should behave the same as before (except the change in the `._is_view()` value).
- Even though the view logic is not performed, we do raise deprecation warnings for the cases where doing these ops would throw an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39299
Differential Revision: D22432885
Pulled By: albanD
fbshipit-source-id: 324aef091b32ce69dd067fe9b13a3f17d85d0f12
Summary:
Fixes the overhead reported by ngimel in https://github.com/pytorch/pytorch/pull/40927#issuecomment-657709646
As it turns out, `Tensor.size(n)` has more overhead than `Tensor.sizes()[n]`. Since addmm does a lot of introspection of the input matrix sizes and strides, this added up to a noticeable (~1 us) constant time overhead.
With this change, a 1x1 matmul takes 2.85 us on my machine compared to 2.90 us on pytorch 1.5.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41374
Reviewed By: ailzhang
Differential Revision: D22519924
Pulled By: ngimel
fbshipit-source-id: b29504bee7de79ce42e5e50f91523dde42b073b7
Summary:
nccl tests and parallelize_bmuf_distributed test are failing on rocm3.5.1. Skipping these tests to upgrade the CI to rocm3.5.1
jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41409
Reviewed By: orionr
Differential Revision: D22528928
Pulled By: seemethere
fbshipit-source-id: 928196b7a62a441d391e69f54b278313ecc75d77
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41145
**Summary**
This commit adds out-of-source-tree tests for `to_backend`. These tests check
that a Module can be lowered to a backend, exported, loaded (in both
Python and C++) and executed.
**Fixes**
This commit fixes#40067.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D22510076
Pulled By: SplitInfinity
fbshipit-source-id: f65964ef3092a095740f06636ed5b1eb0884492d
Summary:
Closes https://github.com/pytorch/pytorch/issues/36977
This avoid the division by zero that was causing NaNs to appear in the output. `AvgPooling2d` and `AvgPooling3d` both had this issue on CPU and CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41368
Reviewed By: ailzhang
Differential Revision: D22520013
Pulled By: ezyang
fbshipit-source-id: 3ece7829f858f5bc17c2c1d905266ac510f11194
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41270
The Smart Keyboard model for Oculus requires operators previously not in the lite interpreter: aten::exp (for floats), aten::ord, aten::lower, aten::__contains__.str_list, aten::slice.str, aten::strip, aten::split.str, and aten::__getitem__.str.
Test Plan:
Verify smart keyboard model can be used:
Check out next diff in stack and follow test instructions there
Reviewed By: iseeyuan
Differential Revision: D22289812
fbshipit-source-id: df574d5af4d4fafb40f0e209b66a93fe02d83020
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40179
- Pass no-psabi to shut up GCC about # Suppress "The ABI for passing
parameters with 64-byte alignment has changed in GCC 4.6"
- Fix use of deprecated data() accessor (and minor optimization: hoist
accessor out of loop)
- Undeprecate NetDef.num_workers, no one is serious about fixing these
- Suppress warnings about deprecated pthreadpool types
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22234138
Pulled By: ezyang
fbshipit-source-id: 6a1601b6d7551a7e6487a44ae65b19acdcb7b849
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41330
`torch.cuda.check_error` is annotated as taking an `int` as argument but when running `torch.cuda.check_error(34)` one would get:
```
TypeError: cudaGetErrorString(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch._C._cudart.cudaError) -> str
Invoked with: 34
```
Even if one explicitly casted the argument, running `torch.cuda.check_error(torch._C._cudart.cudaError(34))` would give:
```
AttributeError: 'str' object has no attribute 'decode'
```
This PR fixes both issues (thus allowing `check_error` to be called with a un-casted int) and adds a test.
ghstack-source-id: 107628709
Test Plan: Unit tests
Reviewed By: ezyang
Differential Revision: D22500549
fbshipit-source-id: 9170c1e466dd554d471e928b26eb472a712da9e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41198https://github.com/pytorch/pytorch/pull/39611 unified signatures of some ops taking TensorOptions arguments by making them optional.
That has FC implications but only for models writting with a PyTorch version after that version (see explanation in description of that PR).
However, it also changed the default from `pin_memory=False` to `pin_memory=None`, which actually breaks FC for preexisting models too if they're re-exported with a newer PyTorch,
because we materialize default values when exporting. This is bad.
This PR reverts that particular part of https://github.com/pytorch/pytorch/pull/39611 to revert the FC breakage.
ghstack-source-id: 107475024
Test Plan: waitforsandcastle
Reviewed By: bhosmer
Differential Revision: D22461661
fbshipit-source-id: ba2776267c3bba97439df66ecb50be7c1971d20d
Summary: add logit and swish to this list
Test Plan: f203925461
Reviewed By: amylittleyang
Differential Revision: D22506814
fbshipit-source-id: b449e4ea16354cb76915adb01cf317cffb494733
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41297
GH issue: https://github.com/pytorch/pytorch/issues/40105
Add a helper function to DDP to print out all relevant env vars for debugging
Test Plan:
test through unittest, example output:
---
env:RANK=3
env:LOCAL_RANK=N/A
env:WORLD_SIZE=N/A
env:MASTER_PORT=N/A
env:MASTER_ADDR=N/A
env:CUDA_VISIBLE_DEVICES=N/A
env:GLOO_SOCKET_IFNAME=N/A
env:GLOO_DEVICE_TRANSPORT=N/A
env:NCCL_SOCKET_IFNAME=N/A
env:NCCL_BLOCKING_WAIT=N/A
...
---
Reviewed By: mrshenli
Differential Revision: D22490486
fbshipit-source-id: 5dc7d2a18111e5a5a12a1b724d90eda5d35acd1c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41195
`BLAS_F2C` is set in `THGeneral.h`.
`sdot` redefined with double return type in the case that `BLAS_F2C` is set and `BLAS_USE_CBLAS_DOT` is not.
Test Plan: CircleCI green, ovrsource green
Reviewed By: malfet
Differential Revision: D22460253
fbshipit-source-id: 75f17b3e47da0ed33fcadc2843a57ad616f27fb5
Summary:
1. While do convert() preserve module's **pre and post forward** hooks
2. While do fusion preserve only module's **pre forward** hooks (because after fusion output no longer the same)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37233
Differential Revision: D22425141
Pulled By: jerryzh168
fbshipit-source-id: e69b81821d507dcd110d2ff3594ba94b9593c8da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41215
To unblock int8 model productization on accelerators, we need the shape and type info for all the blobs after int8 quantization. This diff added shape inference functions for int8 quantization related ops.
Test Plan:
```
buck test caffe2/caffe2/quantization/server:int8_gen_quant_params_test
buck test caffe2/caffe2/quantization/server:fully_connected_dnnlowp_op_test
```
Reviewed By: hx89
Differential Revision: D22467487
fbshipit-source-id: 8298abb0df3457fcb15df81f423f557c1a11f530
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37176
The non-deprecated user-facing interface to these ops (F.interpolate)
has a good interface: output size and scale are both specified as
a scalar or list, and exactly one must be present. These aten ops
have an older, clunkier interface where output size is required and
scales are specified as separate optional scalars per dimension.
This change adds new overloads to the aten ops that match the interface
of interpolate. The plan is to eventually remove the old overloads,
resulting in roughly net-zero code added. I also believe it is possible
to push this interface down further, eliminating multiple optional<double>
arguments, and simplifying the implementations.
The rollout plan is to land this, wait for a reasonable interval for
forwards-compatibility (maybe 1 week?), land the change that updates
interpolate to call these overloads, wait for a reasonable interval
for backwards-compatibility (maybe 6 months?), then remove the old
overloads.
This diff does not add the `.out` variants of the ops because they
are not currently accessible through any user-facing API.
ghstack-source-id: 106938113
Test Plan:
test_nn covers these ops fairly well, so that should prevent this diff
from breaking anything on its own.
test_nn on the next diff in the stack actually uses these new overloads,
so that should validate that they are actually correct.
Differential Revision: D21209989
fbshipit-source-id: 2b74d230401f071364eb05e138cdaa55279cfe91
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37175
ghstack-source-id: 106938114
Test Plan: Upcoming diffs use this for upsampling.
Differential Revision: D21209994
fbshipit-source-id: 1a71c07e45e28772a2bbe450b68280dcc0fe2def
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41313
This diff backs out the backout diff. The failure was due to C++ `or`
not being supported in MSVC. This is now replaced with ||
Original commit changeset: fc7f3f8c968d
Test Plan: Existing unit tests, check github CI.
Reviewed By: malfet
Differential Revision: D22494777
fbshipit-source-id: 3271288919dc3a6bfb82508ab9d021edc910ae45
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41268
We also want nvidia runtime packages to get installed when the
BUILD_ENVIRONMENT also includes "*cu*"
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22505885
Pulled By: seemethere
fbshipit-source-id: 4d8e70ed8aed9c6fd1828bc13cf7d5b0f8f50a0a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41305
added a warning message when layernorm under/overflows, which is what
nnpi does, reducing the frequency of the logging to every 1000
Test Plan: compilation
Reviewed By: yinghai
Differential Revision: D22492726
fbshipit-source-id: 9343beeae6e65bf3846c6b3d2edd2a08dac85ed6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41315
We should pass the number of indices but not embedding size in SparseAdagrad fused PyTorch operator
Reviewed By: jianyuh
Differential Revision: D22495422
fbshipit-source-id: ec5d3a5c9547fcd8f95106d912b71888217a5af0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41037
This diff contains the core implementation for the fake quantizer per channel kernel that supports back propagation on the scale and zero point.
Test Plan:
On a devvm, use:
- `buck test //caffe2/test:quantization -- learnable_forward_per_channel`
- `buck test //caffe2/test:quantization -- learnable_backward_per_channel`
Reviewed By: z-a-f
Differential Revision: D22395665
fbshipit-source-id: 280c2405d04adfeda9fb9cfc94d89e8d868e0d41
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41029
This diff contains the core implementation for the fake quantizer per tensor kernel that supports back propagation on the scale and zero point.
Test Plan:
On a devvm, use:
- `buck test //caffe2/test:quantization -- learnable_forward_per_tensor`
- `buck test //caffe2/test:quantization -- learnable_backward_per_tensor`
Reviewed By: z-a-f
Differential Revision: D22394145
fbshipit-source-id: f6748b635b86679aa9174a8065e6be5e20a95d81
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41303
the error came from I0710 18:02:48.025024 1780875 NNPIOptions.cpp:49]
[NNPI_LOG][D] [KS] convert_base_kernel_ivp.cpp(524): Output Scale 108240.101562
is out of valid range +-(Min 0.000061 Max 65504.000000)!!!
Seems like the weights we are using are too small, thus generating scaling
factors out of the range of fp16 (>65k). I am tentatively increasing this
factor to a higher value to avoid this. (10x bigger)
Also increased max_examples to 100
Test Plan: ran this test
Reviewed By: yinghai
Differential Revision: D22492481
fbshipit-source-id: c0f9e59b0e70895ab787868ef1d87e6e80106554
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41299
When using `cub::DeviceRadixSort::SortPairs` (https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html), the `end_bit` argument, or the most-significant bit index (exclusive) needed for key comparison, should be passed with `int(log2(float(num_rows)) + 1)` instead of `int(log2(float(num_indice)) + 1)`. This is because all the values in indices array are guaranteed to be less than num_rows (hash_size), not num_indices. Thanks ngimel for pointing this point and thanks malfet for quickly fixing the log2() compilation issues.
Note:
An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.
Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```
Reviewed By: malfet
Differential Revision: D22491662
fbshipit-source-id: 4fdabe86244c948af6244f9bd91712844bf1dec1
Summary:
It doesn't do a good job of checking BLAS library capabilities, so hardcode the undef of BLAS_F2C
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41285
Differential Revision: D22489781
Pulled By: malfet
fbshipit-source-id: 13a14f31e08d7f9ded49731e4fd23663bac75cd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41146
**Summary**
This commit adds support for using `Modules` that have been lowered as
submodules in `ScriptModules`.
**Test Plan**
This commit adds execution and save/load tests to test_backends.py for
backend-lowered submodules.
**Fixes**
This commit fixes#40069.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D22459543
Pulled By: SplitInfinity
fbshipit-source-id: 02e0c0ccdce26c671ade30a34aca3e99bcdc5ba7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41272
Implicit casting from int to float is resulting in vec256_test build failure
internally. This diff fixes that.
Test Plan: Build vec256_test for android and run it on android phone.
Reviewed By: ljk53, paulshaoyuqiao
Differential Revision: D22484635
fbshipit-source-id: ebb9fc2eccb8261ab01d8266150fc3b05166f1e7
Summary:
# Description
The goal is to reduce the size of the docker image. I checked a few things:
* Docker layer overlaps
* Removing .git folder
* Removing intermediate build artifacts (*.o and *.a)
The only one that gave satisfying result was the 3rd approach, removing *.o and *.a. The final image went from 10 GB to 9.7 GB.
I used Dive (https://github.com/wagoodman/dive) to inspect the Docker image manually.
# Test:
* Check the image size was reduced
* No test failures in CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41207
Test Plan:
* Check the image size was reduced
* No test failures in CI
Differential Revision: D22465221
Pulled By: ssylvain
fbshipit-source-id: 48754259729401e3c08447b0fa0630ca7217cb98
Summary:
Resubmit #40927
Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678
`addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code.
After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927
Reviewed By: ezyang
Differential Revision: D22468490
Pulled By: ngimel
fbshipit-source-id: f8a22be3216f67629420939455e31a88af20201d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40875
This op uses the given num_bins and a spacing strategy to automatically bin and compute the histogram of given matrices.
Test Plan: Unit tests.
Reviewed By: neha26shah
Differential Revision: D22329069
fbshipit-source-id: 28406b94e284d52d875f73662fc82f93dbc00064
Summary:
Shape is passed to _reshape_to_tensor as a Constant and cannot infer shape of the input when model is exported with dynamic axes set. Instead of a Constant pass output of a subgraph Shape-Slice-Concat to compute the shape for the Reshape node in _reshape_to_tensor function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40418
Reviewed By: hl475
Differential Revision: D22480127
Pulled By: houseroad
fbshipit-source-id: 11853adb6e6914936871db1476916699141de435
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41214
Same as D21032976, add check for duplicated op name in JIT
Test Plan:
run full JIT predictor
also
buck test pytorch-playground
Reviewed By: smessmer
Differential Revision: D22467871
fbshipit-source-id: 9b7a40a217e6c63cca44cad54f9f657b8b207a45
Summary:
some people have been confused by `retain_graph` in the snippet, they thought it was an additional requirement imposed by amp.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41203
Differential Revision: D22463700
Pulled By: ngimel
fbshipit-source-id: e6fc8871be2bf0ecc1794b1c6f5ea99af922bf7e
Summary:
Per title. `lgamma` produces a different result for `-inf` compared to scipy, so there comparison is skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41225
Differential Revision: D22473346
Pulled By: ngimel
fbshipit-source-id: e4ebda1b10e2a061bd4cef38d1d7b5bf0f581790
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40252
As title says.
Test Plan:
python test/test_mobile_optimizer.py
Imported from OSS
Differential Revision: D22126825
fbshipit-source-id: a1880587ba8db9dee0fa450bc463734e4a8693d9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39631
Background:
Currently, we cannot send ScriptModule over RPC as an argument.
Otherwise, it would hit the following error:
> _pickle.PickleError: ScriptModules cannot be deepcopied using
> copy.deepcopy or saved using torch.save. Mixed serialization of
> script and non-script modules is not supported. For purely
> script modules use my_script_module.save(<filename>) instead.
Failed attempt:
tried to install `torch.jit.ScriptModule` to RPC's
dispatch table, but it does not work as the dispatch table only
matches exact types and using base type `torch.jit.ScriptModule`
does not work for derived typed.
Current solution:
The current solution exposes `_enable_jit_rref_pickle` and
`_disable_jit_rref_pickle` APIs to toggle the `allowJitRRefPickle`
flag. See `test_pickle_script_module_with_rref` as an example.
Test Plan: Imported from OSS
Differential Revision: D21920870
Pulled By: mrshenli
fbshipit-source-id: 4d58afce5d0b4b81249b383c173488820b1a47d6
Summary:
unique op test failure in caffe2 blocks upgrading CI to rocm3.5.1. Skipping the test to unblock will re-enable after root causing and fixing the issue.
jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41219
Differential Revision: D22471452
Pulled By: xw285cornell
fbshipit-source-id: 9e503c8b37c0a4b92632f77b2f8a90281a9889c3
Summary:
the current quantization rounding function uses fbgemm which
defaults to round to nearest. The current implementation of hw uses round
flush to infinity. Adding such an option to switch the mode of rounding.
Test Plan: ran against test_fc_int8
Reviewed By: venkatacrc
Differential Revision: D22452306
fbshipit-source-id: d2a1fbfc695612fe07caaf84f52669643507cc9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39343
Building on top of previous PR that adds fused add_relu op, this PR adds
a JIT pass to transform input graph to find all fusable instancs of add
+ relu and fuses them.
Test Plan:
python test/test_jit.py TestJit.test_add_relu_fusion
Imported from OSS
Differential Revision: D21822396
fbshipit-source-id: 12c7e8db54c6d70a2402b32cc06c7e305ffbb1be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39342
Many networks such as resnet have adds followed by relu. This op is the
first step in enabling this fused implementation.
Once we have the fused add_relu op, a JIT pass will be written to
replace add + relu patterns with add_relu.
Test Plan:
python test/test_nn.py TestAddRelu
Imported from OSS
Differential Revision: D21822397
fbshipit-source-id: 03df83a3e46ddb48a90c5a6f755227a7e361a0e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39341
This PR introduces neon backend for vec256 class for float datatype.
For now only aarch64 is enabled due to few issues with enabling in
aarch32 bit.
Test Plan:
vec256_test
Imported from OSS
Differential Revision: D21822399
fbshipit-source-id: 3851c4336d93d1c359c85b38cf19904f82bc7b8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40059
This benchmark is added specifically for mobile to see if compiler is
autovectorizing and thus we have no advantage of neon backend for vec256
for add op.
Test Plan:
CI
Imported from OSS
Differential Revision: D22055146
fbshipit-source-id: 43ba6c4ae57c6f05d84887c2750ce21ae1b0f0b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41200
In short, we messed up. The SHM and CMA backends of TensorPipe are Linux-specific and thus they are guarded by a #ifdef in the agent's code. Due to a mishap with CMake (due the fact that TensorPipe has two CMake files, one for PyTorch and a "standalone" one) we were not correctly propagating some flags and these #ifdefs were always false. This means that these two backends have always been disabled and have thus never been covered by our OSS CI. It would be irresponsible to enable them now in v1.6, so instead we remove any mention of them from the docs.
Note that this is perhaps not as bad as it sounds. These two backends were providing higher performance (latency) when the two endpoints were on the same machine. However, I suspect that most RPC users will only do transfers across machines, for which SHM and CMA wouldn't have played any role.
ghstack-source-id: 107458630
Test Plan: Docs only
Differential Revision: D22462158
fbshipit-source-id: 0d72fea11bcaab6d662184bbe7270529772a5e9b
Summary:
Fixes gh-39007
We replaced actual content with links to generated content in many places to break the documentation into manageable chunks. This caused references like
```
https://pytorch.org/docs/stable/torch.html#torch.flip
```
to become
```
https://pytorch.org/docs/master/generated/torch.flip.html#torch.flip
```
The textual content that was located at the old reference was replaced with a link to the new reference. This PR adds a `<p id="xxx"/p>` reference next to the link, so that the older references from outside tutorials and forums still work: they will bring the user to the link that they can then follow through to see the actual content.
The way this is done is to monkeypatch the sphinx writer method that produces the link. It is ugly but practical, and in my mind not worse than adding javascript to do the same thing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39086
Differential Revision: D22462421
Pulled By: jlin27
fbshipit-source-id: b8f913b38c56ebb857c5a07bded6509890900647
Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173
Differential Revision: D22459998
Pulled By: malfet
fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41103
add a CLANG_CODE_COVERAGE option to CMakeList. If the option is ON, add code coverage needed compile flags.
Test Plan:
Clone pytorch source code to local, modified these changes and builded it with `CLANG_CODE_COVERAGE ON` and `BUILD_TESTS ON`. Run a manual test and attach code coverage report.
{F243609020}
Reviewed By: malfet
Differential Revision: D22422513
fbshipit-source-id: 27a31395c31b5b5f4b72523954722771d8f61080
Summary:
Based on discussion with jlucier (https://github.com/pytorch/pytorch/pull/38925#issuecomment-655859195) . `batch_size` change isn't made because data loader only has the notion of `batch_sampler`, not batch size. If `batch_size` dependent sharding is needed, users can still access it from their own code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41175
Differential Revision: D22456525
Pulled By: zou3519
fbshipit-source-id: 5281fcf14807f219de06e32107d5fe7d5b6a8623
Summary:
**Summary**
This commit fixes the JIT triage workflow based on testing done in my
own fork.
**Test Plan**
This commit has been tested against my own fork. This commit is
currently at the tip of my master branch, and if you open an issue in my
fork and label it JIT, it will be added to the Triage Review project in
that fork under the Needs triage column.
*Old issue that is labelled JIT later*
<img width="700" alt="Captura de Pantalla 2020-07-08 a la(s) 6 59 42 p m" src="https://user-images.githubusercontent.com/4392003/86988551-5b805100-c14d-11ea-9de3-072916211f24.png">
*New issue that is opened with the JIT label*
<img width="725" alt="Captura de Pantalla 2020-07-08 a la(s) 6 59 17 p m" src="https://user-images.githubusercontent.com/4392003/86988560-60dd9b80-c14d-11ea-94f0-fac01a0d239b.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41170
Differential Revision: D22460584
Pulled By: SplitInfinity
fbshipit-source-id: 278483cebbaf3b35e5bdde2a541513835b644464
Summary:
This avoids a (currently only) warning of cmake:
```
The dependency target "nccl_external" of target "gloo_cuda" does not exist.
Call Stack (most recent call first):
CMakeLists.txt:411 (include)
```
This will be a real problem once Policy CMP0046 is set which will make this warning be an error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41180
Differential Revision: D22460623
Pulled By: malfet
fbshipit-source-id: 0222b12b435e5e2fdf2bc85752f95abba1e3d4d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40806
When the input is empty, the operator will crash on "runtime error: division by zero". This has been causing Inference platform server crashes.
Example crash logs:
{P134526683}
Test Plan:
Unit test
See reproducing steps in the Test Plan of D22300135
Reviewed By: houseroad
Differential Revision: D22302089
fbshipit-source-id: aaa5391fddc86483b0f3aba3efa7518e54913635
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41096
The spark spot model had some issues in tensor conversion, see P134598596. It happens when we convert an undefined c10 tensor to caffe2 tensor.
This diff added a null check.
Test Plan: spark spot model runs without problem
Reviewed By: smessmer
Differential Revision: D22330705
fbshipit-source-id: dfe0f29a48019b6611cad3fd8f2ae49e8db5427e
Summary:
When we return to Python from C++ in PyTorch and have warnings and and error, we have the problem of what to do when the warnings throw because we can only throw one error.
Previously, if we had an error, we punted all warnings to the C++ warning handler which would write them to stderr (i.e. system fid 2) or pass them on to glog.
This has drawbacks if an error happened:
- Warnings are not handled through Python even if they don't raise,
- warnings are always printed with no way to suppress this,
- the printing bypasses sys.stderr, so Python modules wanting to
modify this don't work (with the prominent example being Jupyter).
This patch does the following instead:
- Set the warning using standard Python extension mechanisms,
- if Python decides that this warning is an error and we have a
PyTorch error, we print the warning through Python and clear
the error state (from the warning).
This resolves the three drawbacks discussed above, in particular it fixes https://github.com/pytorch/pytorch/issues/37240 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41116
Differential Revision: D22456393
Pulled By: albanD
fbshipit-source-id: c3376735723b092efe67319321a8a993402985c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39488
Currently caffe2.InitOpLibrary does the dll import uniliaterally. Instead if we make a lazy version and use it, then many pieces of code which do not need the caffe2urrenoperators get a lot faster.
One a real test, the import time went from 140s to 68s. 8s.
This also cleans up the algorithm slightly (although it makes a very minimal
difference), by parsing the list of operators once, rather than every time a
new operator is added, since we defer the RefreshCall until after we've
imported all the operators.
The key way we maintain safety, is that as soon as someone does an operation
which requires a operator (or could), we force importing of all available
operators.
Future work could include trying to identify which code is needed for which
operator and only import the needed ones. There may also be wins available by
playing with dlmopen (which opens within a namespace), or seeing if the dl
flags have an impact (I tried this and didn't see an impact, but dlmopen may
make it better).
Test Plan:
I added a new test a lazy_dyndep_test.py (copied from all_compare_test.py).
I'm a little concerned that I don't see any explicit tests for dyndep, but this
should provide decent coverage.
Differential Revision: D21870844
fbshipit-source-id: 3f65fedb65bb48663670349cee5e1d3e22d560ed
Summary:
This is a duplicate of https://github.com/pytorch/pytorch/pull/38362
"This PR completes Interpolate's deprecation process for recomputing the scales values, by updating the default value of the parameter recompute_scale_factor as planned for pytorch 1.6.0.
The warning message is also updated accordingly."
I'm recreating this PR as previous one is not being updated.
cc gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39453
Reviewed By: hl475
Differential Revision: D21955284
Pulled By: houseroad
fbshipit-source-id: 911585d39273a9f8de30d47e88f57562216968d8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40718
Currently only constant except tensor must be inlined during serialization.
Tensor are stored in the contant table. This patch generalizes this capability
to any IValue. This is particularly useful for non ASCII string literal that
cannot be inlined.
Test Plan: Imported from OSS
Differential Revision: D22298169
Pulled By: bzinodev
fbshipit-source-id: 88cc59af9cc45e426ca8002175593b9e431f4bac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40326
Adds a helper function `addCallbackWithTLSState` to both
torch/csrc/utils/future.h which is used internally by RPC framework and the JIT
future. Uses this helper function to avoid to pass in TLS state where it is needed for rpc and `record_function_ops.cpp`. For example, the following:
```
at::ThreadLocalState tls_state;
fut->addCallback([tls_state = std::move(tls_state)]() {
at::ThreadLocalStateGuard g(tls_state);
some_cb_that_requires_tls_state();
}
```
becomes
```
fut->addCallbackWithTLSState(some_cb_that_requires_tls_state);
```
ghstack-source-id: 107383961
Test Plan: RPC Tests and added a test in test_misc.cpp
Differential Revision: D22147634
fbshipit-source-id: 46c02337b90ee58ca5a0861e932413c40d06ed4c
Summary:
Noticed while trying to script one of the models which happened to have numpy values as constants. Lacking the numpy prefix in the error message was quite confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41024
Differential Revision: D22426399
Pulled By: dzhulgakov
fbshipit-source-id: 06158b75355fac6871e4861f82fc637c2420e370
Summary:
This PR contains the following updates:
1. MIOpen 3D pooling enabled in Caffe2.
2. Refactored the MIOpen pooling code in caffe2.
3. Enabled unit test cases for 3D pooling.
CC: ezyang jeffdaily ashishfarmer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38260
Differential Revision: D21524754
Pulled By: xw285cornell
fbshipit-source-id: ddfe09dc585cd61e42eee22eff8348d326fd0c3b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41136
Running this within CI seems impossible since this script exits out
after one failed test, so let's just add an option that CI can use to
power through these errors.
Should not affect current functionality.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22441694
Pulled By: seemethere
fbshipit-source-id: 7f152fea15af9d47a964062ad43830818de5a109
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41056
**Summary**
This commit adds a new GitHub workflow that automatically adds a card to
the "Need triage" section of the project board for tracking JIT triage
for each new issue that is opened and labelled "jit".
**Test Plan**
???
Test Plan: Imported from OSS
Differential Revision: D22444262
Pulled By: SplitInfinity
fbshipit-source-id: 4e7d384822bffb978468c303322f3e2c04062644
Summary:
Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678
`addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code.
After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927
Differential Revision: D22418756
Pulled By: ezyang
fbshipit-source-id: 44e7bb5964263d73ae8cc6adc5f6d4e966476ae6
Summary:
This should be in its own file...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41137
Reviewed By: jamesr66a
Differential Revision: D22437922
Pulled By: eellison
fbshipit-source-id: 1b62dde1a4ebac673b5c60aea4f398f734d62501
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39611
A few ops have been taking non-optional ScalarType, Device and Layout. That isn't supported by the hacky wrapper that makes those
kernels work with the c10 operator library. This PR unifies the signatures and makes those ops c10-full.
ghstack-source-id: 107330186
Test Plan: waitforsandcastle
Differential Revision: D21915788
fbshipit-source-id: 39f0e114f2766a3b27b80f93f2c1a95fa23c78d4
Summary:
I noticed this very unusual use of atomics in `at::native::DispatchStub`. The comment asserts that `choose_cpu_impl()` will always return the same value on every thread, yet for some reason it uses a CAS loop to exchange the value instead of a simple store? That makes no sense considering it doesn't even read the exchanged value.
This replaces the CAS loop with a simple store and also improves the non-initializing case to a single atomic load instead of two.
For reference, the `compare_exchange` was added in https://github.com/pytorch/pytorch/issues/32148 and the while loop added in https://github.com/pytorch/pytorch/issues/35794.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40930
Differential Revision: D22438224
Pulled By: ezyang
fbshipit-source-id: d56028ce18c8c5dbabdf366379a0b6aaa41aa391
These jobs didn't really fulfill the intended purpose that they had once
had since the travis python versions were basically locked to 3.7.
Going to go ahead and remove these along with its docker jobs as well
since we don't actively need them anymore.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
ghstack-source-id: cdfc4fc2ae15a0c86d322cc706d383d6bc189fbc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41134
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36439
A proposal of versioning in bytecode, as suggested by dzhulgakov in the internal post: https://fb.workplace.com/groups/pytorch.mobile.work/permalink/590192431851054/
kProducedBytecodeVersion is added. If the model version is not the same as the number in the code, an error will be thrown.
The updated bytecode would look like below. It's a tuple of elements, where the first element is the version number.
```
(3,
('__torch__.m.forward',
(('instructions',
(('STOREN', 1, 2),
('DROPR', 1, 0),
('MOVE', 2, 0),
('OP', 0, 0),
('RET', 0, 0))),
('operators', (('aten::Int', 'Tensor'),)),
('constants', ()),
('types', ()),
('register_size', 2))))
```
Test Plan: Imported from OSS
Differential Revision: D22433532
Pulled By: iseeyuan
fbshipit-source-id: 6d62e4abe679cf91a8e18793268ad8c1d94ce746
Summary:
Per title. This is not used currently in the pytorch codebase, but it is a legitimate usecase, and we have extensions that want to do that and are forced to roll their own atomic implementations for non-standard types. Whether atomic op returns old value or not should not affect performance, compiler is able to generate correct code depending on whether return value is used. https://godbolt.org/z/DBU_UW.
Atomic operations for non-standard integer types (1,2 and 8 byte-width) are left as is, with void return.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41028
Differential Revision: D22425008
Pulled By: ngimel
fbshipit-source-id: ca064edb768a6b290041a599e5b50620bdab7168
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41047.
Some CPU kernel implementations don't call `cast_outputs()`, so when CPU temporaries were created to hold their outputs they weren't copied back to the out parameters correctly. Instead of fixing that issue, for simplicity this PR disables the behavior. The corresponding test in test_type_promotion.py is expanded with more operations to verify that unary ops can no longer have out arguments with different dtypes than their inputs (except in special cases like torch.abs which maps complex inputs to float outputs and torch.deg2rad which is secretly torch.mul).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41097
Differential Revision: D22422352
Pulled By: mruberry
fbshipit-source-id: 8e61d34ef1c9608790b35cf035302fd226fd9421
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40902
See the bottom of this stack for context.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D22360210
Pulled By: suo
fbshipit-source-id: 4275127173a36982ce9ad357aa344435b98e1faf
Summary:
There's a regression in MIOpen in ROCm3.5 that results in failure of autocast tests. Skipping the tests for now and will re-enable once the fixes are in MIOpen.
ezyang jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41088
Differential Revision: D22419823
Pulled By: xw285cornell
fbshipit-source-id: 347fb9a03368172fe0b263d14d27ee0c3efbf4f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41077
This PR is a refactor that moves error messages into their callsites in
`_vmap_internals.py`. Furthermore, because a little bird told me we've
dropped python 3.5 support, this PR adopts f-string syntax to clean up
the string replace logic. Together these changes make the error messages
read better IMO.
Test Plan:
- `python test/test_vmap.py -v`. There exists tests that invoke each of the
error messages.
Differential Revision: D22420473
Pulled By: zou3519
fbshipit-source-id: cfd46b2141ac96f0a62864928a95f8eaa3052f4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41092
added overload name for some full JIT operators and removed some duplicated op registrations
Test Plan:
apply D21032976, then buck run fbsource//xplat/caffe2/fb/pytorch_predictor:predictor
make sure there's no runtime error in operator registration
Reviewed By: iseeyuan
Differential Revision: D22419922
fbshipit-source-id: f651898e75b5bdb8dc03fc00b136689536c51707
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41020
Only support quantization of list append for List[Tensor]
Test Plan: Imported from OSS
Differential Revision: D22420698
fbshipit-source-id: 179677892037e136d90d16230a301620c3111063
Summary: Export logit op to pt for better preproc perf
Test Plan:
unit test
Also tested with model re-generation
Reviewed By: houseroad
Differential Revision: D22324611
fbshipit-source-id: 86accb6b4528e5c818d2c3f8c67926f279d158d6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41094
mimic nnpi's quantization operations
removed redundant int8 test
Test Plan: ran FC with sizes up to 5, running bigger sizes
Reviewed By: venkatacrc
Differential Revision: D22420537
fbshipit-source-id: 91211c8a6e4d3d3bec2617b758913b44aa44b1b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40842
**Summary**
This commit adds out-of-source-tree tests for `to_backend`. These tests check
that a Module can be lowered to a backend, exported, loaded (in both
Python and C++) and executed.
**Fixes**
This commit fixes#40067.
Test Plan: Imported from OSS
Differential Revision: D22418731
Pulled By: SplitInfinity
fbshipit-source-id: 621ba4efc1b121fa76c9c7ca377792ac7440d250
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40841
**Summary**
This commit adds support for using `Modules` that have been lowered as
submodules in `ScriptModules`.
**Test Plan**
This commit adds execution and save/load tests to test_backends.py for
backend-lowered submodules.
**Fixes**
This commit fixes#40069.
Test Plan: Imported from OSS
Differential Revision: D22418716
Pulled By: SplitInfinity
fbshipit-source-id: d2b2c6d5d2cf3042a620b3bde7d494f1abe28dc1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40840
**Summary**
This commit moves the TestBackend used for the JIT backend
extension to the tests directory. It was temporarily placed
in the source directory while figuring out some details of
the user experience for this feature.
**Test Plan**
`python test/test_jit.py TestBackends`
**Fixes**
This commit fixes#40067.
Test Plan: Imported from OSS
Differential Revision: D22418682
Pulled By: SplitInfinity
fbshipit-source-id: 9356af1341ec4d552a41c2a8929b327bc8b56057
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40839
**Summary**
This commit splits the to_backend API properly into
`libtorch` and `libtorch_python`. The backend interface and all
of the code needed to run a graph on a backend is in
libtorch, and all of the code related to creating a Python binding
for the lowering process is in `libtorch_python`.
**Test Plan**
`python test/test_jit.py TestBackends`
**Fixes**
This commit fixes#40072.
Test Plan: Imported from OSS
Differential Revision: D22418664
Pulled By: SplitInfinity
fbshipit-source-id: b96e0c34ab84e45dff0df68b8409ded57a55ab25
Summary:
[index_put](https://pytorch.org/docs/master/tensors.html#torch.Tensor.index_put) requires src and dst tensors to be the same dtype, so imo it belongs on the promote list when autocast is active (output should be widest dtype among input dtypes).
i also put some other registrations in alphabetical order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41035
Differential Revision: D22418305
Pulled By: ngimel
fbshipit-source-id: b467cb16ac6c2ba1f9e43531f69a144b17f00b87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41100
Unit test case for the Int8FC to cover quantization scale errors.
Test Plan: test_int8_ops_nnpi.py test case test_int8_small_input.
Reviewed By: hyuen
Differential Revision: D22422353
fbshipit-source-id: b1c1baadc32751cd7e98e0beca8f0c314d9e5f10
Summary:
Spotted a broken link, and while I was at it, fixed a few little language and formatting nits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41066
Reviewed By: mruberry
Differential Revision: D22415371
Pulled By: dongreenberg
fbshipit-source-id: 7d11c13235b28a01886063c11a4c5ccb333c0c02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41085
add the first LUT implementation of swish
Test Plan:
compared against swish lowered as x*sigmoid(x), had to
increase the threshold of error but looks generally right
Reviewed By: venkatacrc
Differential Revision: D22418117
fbshipit-source-id: c75fa496aa7a5356ddc87f1d61650f432e389457
Summary:
Previously it used the default arch set which may or may not coincide with the user's.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40170
Differential Revision: D22400866
Pulled By: xw285cornell
fbshipit-source-id: 222ba684782024fa68f37bf7d4fdab9a2389bdea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37174
ghstack-source-id: 106938112
Test Plan: Upcoming diffs use this for upsampling.
Differential Revision: D21210002
fbshipit-source-id: d6a55ab6420c05a92873a569221b613149aa0daa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37173
This function is only called in one place, so inline it. This eliminates
boilerplate related to overloads and allows for further simplification
of shared logic in later diffs.
All shared local variables have the same names (from closed_over_args),
and no local variables accidentally collide.
ghstack-source-id: 106938108
Test Plan: Existing tests for interpolate.
Differential Revision: D21209995
fbshipit-source-id: acfadf31936296b2aac0833f704764669194b06f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37172
This improves readability by keeping cases with similar behavior close
together. It should also have a very tiny positive impact on perf.
ghstack-source-id: 106938109
Test Plan: Existing tests for interpolate.
Differential Revision: D21209996
fbshipit-source-id: c813e56aa6ba7370b89a2784fcb62cc146005258
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37171
Every one of these branches returns or raises, so there's no need for elif.
This makes it a little easier to reorder and move conditions.
ghstack-source-id: 106938110
Test Plan: Existing test for interpolate.
Differential Revision: D21209992
fbshipit-source-id: 5c517e61ced91464b713f7ccf53349b05e27461c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41060
Exposes a const ref opaque_handle and made copy_tensor_metdata a
protected method. This helps in reusing code in sub classes of OpaqueTensorImpl
Test Plan: waitforbuildbot
Reviewed By: dzhulgakov
Differential Revision: D22406602
fbshipit-source-id: e3b8338099f257da7f6bbff679f1fdb71e5f335a
Summary:
Decouple DataParallel/DistributedDataParallel from CUDA to support more device types.
- Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility
- Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space.
- Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs.
Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454
Differential Revision: D22051557
Pulled By: mrshenli
fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8
Summary:
solves most of gh-38011 in the framework of solving gh-32703.
These should only be formatting fixes, I did not try to fix grammer and syntax.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41068
Differential Revision: D22411919
Pulled By: zou3519
fbshipit-source-id: 25780316b6da2cfb4028ea8a6f649bb18b746440
Summary:
We need an easy to way to quickly visually grep binary sizes from builds
and then have a way to test out those binaries quickly.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41074
Differential Revision: D22415667
Pulled By: seemethere
fbshipit-source-id: 86386e5390dce6aae26e952a47f9e2a2221d30b5
Summary:
This is to workaround an issue in hipclang wrt templated kernel name arguments to hipLaunchKernelGGL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41022
Differential Revision: D22404183
Pulled By: ngimel
fbshipit-source-id: 63135ccb9e087f4c8e8663ed383979f7e2c1ba06
Summary:
Currently embedding_bag's CPU kernel queries whether weight.requires_grad() is true. This violates layering of AutoGrad and Op Kernels, causing issues in third-party backends like XLA. See this [issue](https://github.com/pytorch/xla/issues/2215) for more details.
This PR hoists the query of weight.requires_grad() to Python layer, and splits embedding_bag into two separate ops, each corresponding to weight.requires_grad() == true and false.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40557
Reviewed By: ailzhang
Differential Revision: D22327476
Pulled By: gmagogsfm
fbshipit-source-id: c815b3690d676a43098e12164517c5debec90fdc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40837
As ninja has accurate dependency tracking, if there is nothing to do,
then we will very quickly noop. But this is important for correctness:
if a change was made to a header that is not listed explicitly in
the distutils Extension, then distutils will come to the wrong
conclusion about whether or not recompilation is needed (but Ninja
will work it out.)
This caused https://github.com/pytorch/vision/issues/2367
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D22340930
Pulled By: ezyang
fbshipit-source-id: 481b74f6e2cc78159d2a74d413751cf7cf16f592
Summary:
Forward-declare `tensorpipe::Message` class in utils.h
Guard TensorPipe specific methods in utils.cpp with `#ifdef USE_TENSORPIPE`
Pass `USE_TENSORPIPE` as private flag to `torch_cpu` library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40846
Differential Revision: D22338864
Pulled By: malfet
fbshipit-source-id: 2ea2aea84527ae7480e353afb55951a068b3b980
Summary:
It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type.
Should fix regression caused by https://github.com/pytorch/pytorch/pull/40864
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41055
Differential Revision: D22405478
Pulled By: malfet
fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
Summary:
In issue https://github.com/pytorch/pytorch/issues/36997 the user encountered a non-meaningful error message when trying to export the model to ONNX. The Pad operator in opset 9 requires the list of paddings to be constant. This PR tries to improve the error message given to the user when this is not the case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39651
Reviewed By: hl475
Differential Revision: D21992262
Pulled By: houseroad
fbshipit-source-id: b817111c2a40deba85e4c6cdb874c1713312dba1
Summary:
I ran `make linkcheck` using `sphinx.builders.linkcheck` on the documentation and noticed a few links weren't using HTTPS so I quickly updated them all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40878
Differential Revision: D22404647
Pulled By: ngimel
fbshipit-source-id: 9c9756db59197304023fddc28f252314f6cf4af3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40717
`in_dims` specifies which dimension of the input tensors should be
vmapped over. One can also specify `None` as an `in_dim` for a particular
input to indicate that we do not map over said input.
We implement `in_dims` by creating a BatchedTensor with BatchDim equal
to said `in_dim`. Most of this PR is error checking. `in_dims` must
satisfy the following:
- `in_dim` can be either an int or a Tuple[Optional[int]]. If it is an
int, we use it to mean the `in_dim` for every input.
- If `in_dims` is not-None at some index `idx`, then the input at index
`idx` MUST be a tensor (vmap can only map over tensors).
jax supports something more generalized: their `in_dims` can match the
structure of the `inputs` to the function (i.e., it is a nested python
data structure matching the data structure of `inputs` specifying where
in `inputs` the Tensors to be mapped are and what their map dims should
be). We don't have the infrastruture yet so we only support `int` or a
flat tuple for `in_dims`.
Test Plan: - `pytest test/test_vmap.py -v`
Differential Revision: D22397914
Pulled By: zou3519
fbshipit-source-id: 56d2e14be8b6024e4cde2729eff384da305b4ea3
Summary:
Most time-consuming tests in test_nn (taking about half the time) were gradgradchecks on Conv3d. Reduce their sizes, and, most importantly, run gradgradcheck single-threaded, because that cuts the time of conv3d tests by an order of magnitude, and barely affects other tests.
These changes bring test_nn time down from 1200 s to ~550 s on my machine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40999
Differential Revision: D22396896
Pulled By: ngimel
fbshipit-source-id: 3b247caceb65d64be54499de1a55de377fdf9506
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40939
Previously, when we would do shape analysis by running the op with representative inputs, we would always set the grad property to false. This led to a wrong static analysis when we would create differentiable subgraphs, and propagate shapes without also propagating requires_grad, and then uninline them.
Test Plan: Imported from OSS
Differential Revision: D22394676
Pulled By: eellison
fbshipit-source-id: 254e6e9f964b40d160befe0e125abe1b7aa2bd5e
Summary:
Original commit changeset: 46c59d849fa8
The original commit is breaking DPER3 release pipeline with the following failures:
https://www.internalfb.com/intern/chronos/jobinstance?jobinstanceid=9007207344413239&smc=chronos_gp_admin_client&offset=0
```
Child workflow f 202599639 failed with error: c10::Error: [enforce fail at operator.cc:76] blob != nullptr. op Save: Encountered a non-existing input blob: feature_preproc/feature_sparse_to_dense/default_float_value
```
https://www.internalfb.com/intern/chronos/jobinstance?jobinstanceid=9007207344855973&smc=chronos_gp_admin_client&offset=0
```
Child workflow f 202629391 failed with error: c10::Error: [enforce fail at operator.cc:76] blob != nullptr. op Save: Encountered a non-existing input blob: tum_preproc/inductive/feature_sparse_to_dense/default_float_value
```
Related UBN tasks: T69529846, T68986110
Test Plan: Build a DPER3 package on top of this commit, and check that DPER3 release test `model_deliverability_test` is passing.
Differential Revision: D22396317
fbshipit-source-id: 92d5b30cc146c005d6159a8d5bfe8973e2c546dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40856
Add a new activation function - Mish: A Self Regularized Non-Monotonic Neural Activation Function https://arxiv.org/abs/1908.08681
Test Plan:
buck test //caffe2/caffe2/python/operator_test:elementwise_ops_test -- 'test_mish'
{F242275183}
Differential Revision: D22158035
fbshipit-source-id: 459c1dd0ac5b515913fc09b5f4cd13dcf095af31
Summary:
This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40992
Differential Revision: D22398733
Pulled By: malfet
fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f
Summary:
Closes https://github.com/pytorch/pytorch/issues/40560
This adds the equation for the weighted mean to `CrossEntropyLoss`'s docs and the `reduction` argument for `CrossEntropyLoss` and `NLLLoss` no longer describes a non-weighted mean of the outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40991
Differential Revision: D22395805
Pulled By: ezyang
fbshipit-source-id: a623b6dd2aab17220fe0bf706bd9b62d6ba531fd
Summary:
Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40864
Reviewed By: ngimel
Differential Revision: D22392877
Pulled By: soumith
fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41004
Tracing has been moved into separate files. Now we can disable it by not compiling the source files for xplat mobile build.
ghstack-source-id: 107158627
Test Plan: CI + build size bot
Reviewed By: iseeyuan
Differential Revision: D22372615
fbshipit-source-id: bf2e2249e401295ff63020a292df119b188fb966
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40903
This PR continues the work of #38467 - decoupling Autograd and Trace for manually registered ops.
ghstack-source-id: 107158638
Test Plan: CI
Differential Revision: D22354804
fbshipit-source-id: f5ea45ade2850296c62707a2a4449d7d67a9f5b5
Summary:
Unbind, which has a special backward with cat, is arguably better than multiple selects, whose backward is creating & adding a bunch of tensors as big as `self`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40884
Reviewed By: pbelevich
Differential Revision: D22363376
Pulled By: zou3519
fbshipit-source-id: 0911cdbb36f9a35d1b95f315d0a2f412424e056d
Summary:
fix https://github.com/pytorch/pytorch/issues/40227
Removed the sorting operation both in ModuleDict class, updated the docstring.
Also remove a sort operation in corresponding unit test, which will lead to unit test fail.
BC Note: Python version after 3.6, the plain dict will preserve the order of keys.
example:
For a python 3.6+ user, if he is initial a ModuleDict instance using plain python dict:
{
"b": torch.nn.MaxPool2d(3),
"a": torch.nn.MaxPool2d(3)
}
, he will get a ModuleDict which preserve the order:
ModuleDict(
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)
For a python 3.5 user, if we maintain the same input, then the output ModuleDict could be:
ModuleDict(
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40905
Differential Revision: D22357480
Pulled By: albanD
fbshipit-source-id: 0e2502769647bb64f404978243ca1ebe5346d573
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40807
We pack a lot of logic into `jit/__init__.py`, making it unclear to
developers and users which parts of our API are public vs. internal. This
is one in a series of PRs intended to pull implementation out into
separate files, and leave `__init__.py` as a place to register the
public API.
This PR moves all the tracing-related stuff out, and fixes other spots up
as necessary. Followups will move other core APIs out.
The desired end-state is that we conform to the relevant rules in [PEP 8](https://www.python.org/dev/peps/pep-0008/#public-and-internal-interfaces). In particular:
- Internal implementation goes in modules prefixed by `_`.
- `__init__.py` exposes a public API from these private modules, and nothing more.
- We set `__all__` appropriately to declare our public API.
- All use of JIT-internal functionality outside the JIT are removed (in particular, ONNX is relying on a number internal APIs). Since they will need to be imported explicitly, it will be easier to catch new uses of internal APIs in review.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D22320645
Pulled By: suo
fbshipit-source-id: 0720ea9976240e09837d76695207e89afcc58270
Summary:
Needed maintenance step to avoid running out of disk space on RocM testers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40917
Differential Revision: D22385844
Pulled By: malfet
fbshipit-source-id: b6dc9ba888a2e34c311e9bf3c8b7b98fa1ec5435
Summary:
Define static script implementation of __len__ and __contains__ on any subclass derived from a type such as ModuleList, Sequential, or ModuleDict. Implement getitem for classes derived from ModuleDict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40789
Reviewed By: eellison
Differential Revision: D22325159
Pulled By: wconstab
fbshipit-source-id: fc1562c29640fe800e13b5a1dd48e595c2c7239b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40931
Fix docstrings for dynamic quantized Linear/LSTM and associated classes
ghstack-source-id: 107064446
Test Plan: Docs show up in correctly
Differential Revision: D22360787
fbshipit-source-id: 8e357e081dc59ee42fd7f12ea5079ce5d0cc9df2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40652
Resolves https://github.com/pytorch/pytorch/issues/40304, but looking for
feedback on whether there is a better approach for this.
In order to profile `rpc_async` calls made within a torchscript function, we
add the profiling logic to `rpcTorchscript` which is the point where the RPC is
dispatched and is called by the jit `rpc_async` operator. We take a somewhat
similar approach to how this is done in the python API. If profiling is
enabled, we call `record_function_enter` which creates a `RecordFunction`
object and runs its starting callbacks. Then, we schedule end callbacks for
this `RecordFunction` to be run when the jit future completes.
One caveat is that `rpcTorchscript` can also be called by rpc_async from a
non-JIT function, in which case the profiling logic lives in Python. We add a
check to ensure that we don't double profile in this case.
ghstack-source-id: 107109485
Test Plan: Added relevant unittests.
Differential Revision: D22270608
fbshipit-source-id: 9f62d1a2a27f9e05772d0bfba47842229f0c24e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40913
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
Once we start merging multiple test suites in a single file (which we'll happen in the next diffs in the stack) the OSX tests on CircleCI start failing due to "too many open files". This indicates a file descriptor leak. I then managed to repro it on Linux too by lowering the limit on open file descriptors (`ulimit -n 500`). Each test method that unittest runs is run on a new instance of the Testcase class. With our multiprocessing wrappers, this instance contains a list of child processes. Even after these processes are terminated, it appears they still hold some open file descriptor (for example a pipe to communicate with the subprocess). It also appears unittest is keeping these Testcase instances alive until the entire suite completes, which I suspect is what leads to this "leak" of file descriptors. Based on that guess, in this diff I am resetting the list of subprocesses during shutdown, and this seems to fix the problem.
ghstack-source-id: 107045908
Test Plan: Sandcastle and CircleCI
Differential Revision: D22356784
fbshipit-source-id: c93bb9db60fde72cae0b0c735a50c17e427580a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40815
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This prepares the stack by aligning the `ddp_under_dist_autograd` test to the other ones, so that later changes will be more consistent and thus easier to follow. It does so by moving the `skipIf` decorators and the `setUp` methods from the base test suite to the entry point scripts.
ghstack-source-id: 107045911
Test Plan: Sandcastle and CircleCI
Differential Revision: D22287535
fbshipit-source-id: ab0c9eb774b21d81e0ebd3078df958dbb4bfa0c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40814
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This prepares the stack by simplifying the TensorPipe fixture. A comment says that the TensorPipe fixture cannot subclass the generic fixture class as that would lead to a diamond class hierarchy which Python doesn't support (whereas in fact it does), and therefore it copies over two properties that are defined on the generic fixture. However, each class that uses the TensorPipe fixture also inherits from the generic fixture, so there's no need to redefine those properties. And, in fact, by not redefining it we save ourselves some trouble when the TensorPipe fixture would end up overriding another override.
ghstack-source-id: 107045914
Test Plan: Sandcastle and CircleCI
Differential Revision: D22287533
fbshipit-source-id: 254c38b36ba51c9d852562b166027abacbbd60ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40860
It turns out that the `@_skip_if_tensorpipe_agent` decorator was written in such a way that it accidentally caused the test to become a no-op (and thus always succeed) for all agents. What this means is that all tests wrapped by that decorator were never ever being run, for any agent.
My understanding of the root cause is that the following code:
```
@_skip_if_tensorpipe_agent
def test_foo(self):
self.assertEqual(2 + 2, 4)
```
ended up behaving somewhat like this:
```
def test_foo(self):
def original_test_func(self):
self.assertEqual(2 + 2, 4)
return unittest.skipIf(self.agent == "TENSORPIPE")(original_test_func)
```
which means that the test body of the decorated method was not actually calling the original test method.
This issue probably came from the `@_skip_if_tensorpipe_agent` being copy-pasted from `requires_process_group_agent` (which, however, is not a decorator but rather a decorator *factory*). An unfortunate naming (calling `decorator` what was in fact the wrapped method) then hindered readability and hid the issue.
Note that a couple of tests had become legitimately broken in the meantime and no one had noticed. The breakages have been introduced in #39909 (a.k.a., D22011868 (145df306ae)).
ghstack-source-id: 107045916
Test Plan: Discovered this as part of my refactoring, in D22332611. After fixing the decorator two tests started breaking (for real reasons). After fixing them all is passing.
Differential Revision: D22332611
fbshipit-source-id: f88ca5574675fdb3cd09a9f6da12bf1e25203a14
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40865
1. applied filter for the module types
2. removed the assumption that the conv bn are immediate child of parent module
Test Plan:
python test/test_quantization.py TestQuantizeJitPasses
Imported from OSS
Differential Revision: D22338074
fbshipit-source-id: 64739a5e56c0a74249a1dbc2c8454b88ec32aa9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40743
`aten::append` modifies input inplace and the output is ignored, these ops are not
supported right now, so we'll need to first make `aten::append` non-inplace
by change
```
ignored = aten::append(list, x)
```
to
```
x_list = aten::ListConstruct(x)
result = aten::add(list, x_list)
```
and then quantize the aten::add instead.
Test Plan:
TestQuantizeJitOps.test_general_shape_ops
Imported from OSS
Differential Revision: D22302151
fbshipit-source-id: 931000388e7501e9dd17bec2fad8a96b71a5efc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40199
Mobile custom selective build has already been covered by `test/mobile/custom_build/build.sh`.
It builds a CLI binary with host-toolchain and runs on host machine to
check correctness of the result.
But that custom build test doesn't cover the android/gradle build part.
And we cannot use it to measure and track the in-APK size of custom
build library.
So this PR adds the selective build test coverage for android NDK build.
Also integrate with the CI to upload the custom build size to scuba.
TODO:
Ideally it should build android/test_app and measure the in-APK size.
But the test_app hasn't been covered by any CI yet and is currently
broken, so build & measure AAR instead (which can be inaccurate as we
plan to pack C++ header files into AAR soon).
Sample result: https://fburl.com/scuba/pytorch_binary_size/skxwb1gh
```
+---------------------+-------------+-------------------+-----------+----------+
| build_mode | arch | lib | Build Num | Size |
+---------------------+-------------+-------------------+-----------+----------+
| custom-build-single | armeabi-v7a | libpytorch_jni.so | 5901579 | 3.68 MiB |
| prebuild | armeabi-v7a | libpytorch_jni.so | 5901014 | 6.23 MiB |
| prebuild | x86_64 | libpytorch_jni.so | 5901014 | 7.67 MiB |
+---------------------+-------------+-------------------+-----------+----------+
```
Test Plan: Imported from OSS
Differential Revision: D22111115
Pulled By: ljk53
fbshipit-source-id: 11d24efbc49a85f851ecd0e481d14123f405b3a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40188
Create a custom command for this task to avoid copy/paste for new build jobs.
Test Plan: Imported from OSS
Differential Revision: D22111114
Pulled By: ljk53
fbshipit-source-id: a7d4d6bbd61ba6b6cbaa137ec7f884736957dc39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40661
Add ser-de to support int8 quantization during online training
Test Plan:
```
buck test caffe2/caffe2/fb/fbgemm:int8_serializer_test
```
Reviewed By: hx89
Differential Revision: D22273292
fbshipit-source-id: 3b1e9c820243acf41044270afce72a262ef92bd4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40646
Support double, float, at::Half.
Avoid creating output result on CPU.
Both of two tensors must be on GPU.
Reviewed By: ngimel
Differential Revision: D22258840
fbshipit-source-id: 95f4747477f09b40b1d682cd1f76e4c2ba28c452
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40920
Pytorch depends on this from both C and C++ source files, so unify linking so it's fully fixed.
Test Plan: Build it on Windows
Reviewed By: dreiss, supriyar
Differential Revision: D22348247
fbshipit-source-id: 2933b4804f4725ab1742914656fa367527f8f7e1
Summary: Namespacing the symbol, since it clashes with "the real thing" otherwise.
Test Plan: Sandcastle + build it on windows
Reviewed By: dreiss
Differential Revision: D22348240
fbshipit-source-id: f9c9a7abc97626ba327605cb4749fc5c38a24d35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40925
normalization operator does not handle empty tensors correctly. This is a fix.
Test Plan: unit tests
Differential Revision: D22330340
fbshipit-source-id: 0bccf925bb768ebb997ed0c88130c5556308087f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40872
Shape hints as its name suggests, is hint. We should use real shape from workspace for the weights.
Reviewed By: ChunliF
Differential Revision: D22337680
fbshipit-source-id: e7a6101fb613ccb332c3e34b1c2cb8c6c47ce79b
Summary:
## TLDR
Support using NaN default value for missing dense features in RawInputProcessor for DPER2. In preparation for subsequent support for null flag features in compute meta. For train_eval this is already supported in DPER3 and we do not plan to support this in DPER2 train eval.
## Overview
Intern project plan to support adding dense flags for missing feature values instead of replacing with zero.
## Project plan :
https://docs.google.com/document/d/1OsPUTjpJycwxWLCue3Tnb1mx0uDC_2KKWvC1Rwpo2NI/edit?usp=sharing
## Code paths:
See https://fb.quip.com/eFXUA0tbDmNw for the call stack for all affected code paths.
Test Plan:
## fblearner flow test
1. `flow-cli clone f197867430 --run-as-secure-group ads_personalization_systems --force-build` to build a ephemeral package and start a fblearner flow run (may fail)
2. Clone the new run and change the secure_group to `XXXX` and entitlement to `default` in the UI
3. Adds explicit_null_min_coverage flag
4. Optionally reduce `max_examples` since we only test pass/fail instead of quality.
5. Submit the run to test the change
Example:
f198538878
## compare output coverages to daiquery runs
1. Randomly select null flag features from compute meta workflow output
2. Look up the feature id in feature metadata using feature name
3. Check against a daiquery sample of coverage to see if the coverage falls within guidelines.
https://www.internalfb.com/intern/daiquery/workspace/275342740223489/192619942076136/
## Sampled features:
GFF_C66_ADS_USER_SUM_84_PAGE_TYPE_RATIO_EVENT_LIKE_IMPRESSION: 15694257
- original feature compute meta coverage: 0.999992
- daiquery feature coverage (10k rows): 0.69588
- null flag compute meta coverage: 0.293409
GFF_R1303_ADS_USER_SUM_7_PAGE_TYPE_COUNTER_CONVERSION: 16051183
- original feature compute meta coverage: 0.949868
- daiquery feature coverage: 0.82241
- null flag compute meta coverage: 0.151687
## Unit tests:
`buck test fblearner/flow/projects/dper/tests/workflows:ads_test`
https://www.internalfb.com/intern/testinfra/testconsole/testrun/6192449504303863/
Differential Revision: D22026450
fbshipit-source-id: 46c59d849fa89253f14dc2b035c4c677cd6e3a4c
Summary:
Move Storage class from __init__.pyi.in to types.py and make it a protocol, since this is not a real class
Expose `PyTorchFileReader` and `PyTorchFileWriter` native classes
Ignore function attributes, as there are yet no good way to type annotate those, see https://github.com/python/mypy/issues/2087
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40862
Differential Revision: D22344743
Pulled By: malfet
fbshipit-source-id: 95cdb6f980ee79383960f306223e170c63df3232
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40885
`TracingState::setValue` associates a concrete IValue in the traced
program with a `Value*` symbolic. Previously, the logic for how
GenericDicts worked was special cased to only work for very simple cases
and silently eat other cases.
This PR generalizes the logic to reflect the same behavior as using
dictionaries on input: whenever we encounter a dictionary in the system,
we completely "burn in" all the keys into the graph, and then
recursively call `setValue` on the associated value.
This has the effect of requiring that any dictionary structure you are
creating in a traced program be of fixed structure, similar to how any
dictionary used as input must be static as well.
Test Plan: Imported from OSS
Differential Revision: D22342490
Pulled By: suo
fbshipit-source-id: 93e610a4895d61d9b8b19c8d2aa4e6d57777eaf6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40649
The original implementation of RemoveOpsByType is pretty buggy and does not remove all instances of the ops that should be removed. It's also quite complicated and hard to modify. I reimplemented it by first converting the graph to its SSA form. The algorithm is quite simple once the graph is in SSA form. It's very similar to constant propagation with a few modifications. The hardest part is to deal with the case of removing an op with the output being an output of the predict net, because that output has to be preserved.
(Note: this ignores all push blocking failures!)
Reviewed By: yinghai, dzhulgakov
Differential Revision: D22220798
fbshipit-source-id: faf6ed5242f1e2f310125d964738c608c6c55c94
Summary:
This PR introduces a warning when user tries to export the model to ONNX in training-amenable mode while constant folding is turned on. We want to warn against any unintentional use because constant folding may fold some parameters that may be intended to be trainable in the exported model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40546
Reviewed By: hl475
Differential Revision: D22310917
Pulled By: houseroad
fbshipit-source-id: ba83b8e63af7c458b5ecca8ff2ee1c77e2064f90
Summary:
Right now it is used to check whether `math.remainder` exists, which is the case for both Python-3.7 and 3.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40868
Differential Revision: D22343454
Pulled By: malfet
fbshipit-source-id: 6b6d4869705b64c4b952309120f92c04ac7e39fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38257
It seems we're doing a runtime type check for custom classes on each operator call if the operator has custom class arguments.
This does not have an effect on operators without custom class arguments, but this is a problem for operators with custom class arguments,
for example operators taking a at::native::xnnpack::Conv2dOpContext argument.
The long term solution would be to move those checks to op registration time instead of doing them at call time,
but as an intermediate fix, we can at least make the check fast by
- Using ska::flat_hash_map instead of std::unordered_map
- Using std::type_index instead of std::string (i.e. avoid calling std::hash on a std::string)
ghstack-source-id: 106805209
Test Plan: waitforsandcastle
Reviewed By: ezyang
Differential Revision: D21507226
fbshipit-source-id: bd120d5574734be843c197673ea4222599fee7cb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40876
clang format reducer.cpp
ghstack-source-id: 106980050
Test Plan: unit test
Differential Revision: D22321422
fbshipit-source-id: 54afdff206504c7bbdf2e408928cc32068e15cdc
Summary:
Moving this to a file that can be source by downstream pytorch/xla.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40828
Reviewed By: malfet
Differential Revision: D22339513
Pulled By: ailzhang
fbshipit-source-id: c43b18fa2b7e1e8bb6810a6a43bb7dccd4756238
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40585
This PR aborts incomplete NCCL Communicators in the ProcessGroupNCCL
destructor. This should prevent pending NCCL communicators from blocking other CUDA ops.
ghstack-source-id: 106988073
Test Plan: Sandcastle/ OSS CI
Differential Revision: D22244873
fbshipit-source-id: 4b4fe65e1bd875a50151870f8120498193d7535e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40709
Cast to kHalf and back to kFloat before the linear operator to mimic FP16 quant support
Test Plan:
python test/test_quantization.py test_convert_dynamic_fp16
Imported from OSS
Differential Revision: D22335977
fbshipit-source-id: f964128ec733469672a1ed4cb0d757d0a6c22c3a
Summary:
If virtual function is implemented in header file, it's implementation will be included as a weak symbol to every shared library that includes this header along with all of it's dependencies.
This was one of the reasons why size of libcaffe2_module_test_dynamic.so was 500Kb (AddRelatedBlobInfo implementation pulled a quarter of libprotobuf.a with it)
Combination of this and https://github.com/pytorch/pytorch/issues/40845 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40844
Differential Revision: D22334725
Pulled By: malfet
fbshipit-source-id: 836a4cbb9f344355ddd2512667e77472546616c0
Summary:
… file
This prevents implementation of those functions(as lambdas) to be embedded as weak symbol into every shared library that includes this header.
Combination of this and https://github.com/pytorch/pytorch/pull/40844 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40845
Differential Revision: D22334779
Pulled By: malfet
fbshipit-source-id: 64706918fc2947350a58c0877f294b1b8b085455
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40830Fixes#40725
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22323886
Pulled By: ezyang
fbshipit-source-id: b8a61496923d9f086d4c201024748505ba783238
Summary:
Apologize if this seems trivial, but i'd like to fix them on my way of reading some of the source code. Thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40692
Differential Revision: D22284651
Pulled By: mrshenli
fbshipit-source-id: 4259d1808aa4d15a02cfd486cfb44dd75fdc58f8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758
Currently we flip a coin for each sampled callback each time
we run RecordFunction, this PR is an attempt to skip most of the coin
flips (for the low-probability observers) and keep the distribution
close to the original one
Test Plan:
CI and record_function_benchmark
```
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 30108 us.
Time per iteration (1x1): 1496.78 us.
Time per iteration (16x16): 2142.46 us.
Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 19051 us.
Time per iteration (1x1): 1581.89 us.
Time per iteration (16x16): 2195.67 us.
Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18715 us.
Time per iteration (1x1): 1566.11 us.
Time per iteration (16x16): 2131.17 us.
Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18814 us.
Time per iteration (1x1): 1536.2 us.
Time per iteration (16x16): 1985.82 us.
Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18278 us.
Time per iteration (1x1): 1526.32 us.
Time per iteration (16x16): 2093.77 us.
Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18545 us.
Time per iteration (1x1): 1524.65 us.
Time per iteration (16x16): 2080 us.
Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048
```
Reviewed By: dzhulgakov
Differential Revision: D22320879
Pulled By: ilia-cher
fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40792
Fixes https://github.com/pytorch/pytorch/issues/40529.
One followup should be to produce a better error message when a new
dictionary has different keys than the traced input. Right now it
presents as a fairly opaque `KeyError`.
Test Plan: Imported from OSS
Differential Revision: D22311731
Pulled By: suo
fbshipit-source-id: c9fbe0b54cf69daed2f11a191d988568521a3932
Summary:
This directory is opted-in to clang-format but is not format-clean. This blocks continuous formatting from being enabled on fbcode, and causes hassle for other codemods that leave inconsistent formatting. This diff runs clang-format, which is widely used and considered safe.
If you are unhappy with the formatting of a particular block, please *accept this diff* and then in a stacked commit undo the change and wrap that code in `// clang-format off` and `// clang-format on`, or `/* clang-format off */` and `/* clang-format on */`.
drop-conflicts
Test Plan: sandcastleit
Reviewed By: jerryzh168
Differential Revision: D22311706
fbshipit-source-id: 1ca59a82e96156a4a5dfad70ba3e64d44c5e762a
Summary:
Allow np.memmap objects to be processed by default_collate
np.memmap objects has the same behavior as numpy arrays, and the only difference is that they are stored in a binary file on the disk. However, the default_collate function used by PyTorch DataLoader only accepts np.array, and rejects np.memmap by type checking. This commit allows np.memmap objects to be processed by default_collate. In this way, users can use in-disk large arrays with PyTorch DataLoader.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39847
Reviewed By: ezyang
Differential Revision: D22284650
Pulled By: zou3519
fbshipit-source-id: 003e3208a2afd1afc2e4640df14b3446201e00b4
Summary: Use the newly added counter op in sparse adagrad
Reviewed By: chocjy, ellie-wen
Differential Revision: D19221100
fbshipit-source-id: d939d83e3b5b3179f57194be2e8864d0fbbee2c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40727
This unit test doesn't need to initialize a PG, as a result avoiding
initializing a process group.
#Closes: https://github.com/pytorch/pytorch/issues/40292
ghstack-source-id: 106817362
Test Plan: waitforbuildbot
Differential Revision: D22295131
fbshipit-source-id: 5a60e91e4beeb61cc204d24c564106d0215090a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40788
Avoid repeated the same `:gencode[foo/bar]` over and over again
Test Plan: CI
Reviewed By: EscapeZero
Differential Revision: D22271151
fbshipit-source-id: f8db57db4ee0948bcca0c8945fdf30380ba81cae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40802
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22317343
Pulled By: ezyang
fbshipit-source-id: 8a982dd93a28d102dfd63163cd44704e899922e0
Summary:
This is the prototype for the modular utils that we've been discussing. It is admittedly a large PR, but a good fraction of that is documentation and examples. I've trimmed a bit on the edges since we last discussed this design (for instance Timer is no longer Fuzzer aware), but it's mostly the same.
In addition to the library and hermetic examples, I've included `examples.end_to_end` which tests https://github.com/pytorch/pytorch/pull/38061 over a variety of shapes, dtypes, degrees of broadcasting, and layouts. (CC crcrpar) I only did CPU as I'm not set up on a GPU machine yet. [Results from my devserver](https://gist.github.com/robieta/d1a8e1980556dc3f4f021c9f7c3738e2)
Key takeaways:
1) For contiguous Tensors, larger dtypes (fp32 and fp64) and lots of reuse of the mask due to broadcasting, improvements are significant. (Presumably due to better vectorization?)
2) There is an extra ~1.5 us overhead, which dominates small kernels.
3) Cases with lower write intensity (int8, lower mask fraction, etc) or non-contiguous seem to suffer.
Hopefully this demonstrates the proof-of-concept for how this tooling can be used to tune kernels and assess PRs. Looking forward to thoughts and feedback.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38338
Differential Revision: D21551048
Pulled By: robieta
fbshipit-source-id: 6c50e5439a04eac98b8a2355ef731852ba0500db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40513
This PR makes the following changes:
1. Complex Printing now uses print formatting for it's real and imaginary values and they are joined at the end.
2. Adding 1. naturally fixes the printing of complex tensors in sci_mode=True
```
>>> torch.tensor(float('inf')+float('inf')*1j)
tensor(nan+infj)
>>> torch.randn(2000, dtype=torch.cfloat)
tensor([ 0.3015-0.2502j, -1.1102+1.2218j, -0.6324+0.0640j, ...,
-1.0200-0.2302j, 0.6511-0.1889j, -0.1069+0.1702j])
>>> torch.tensor([1e-3, 3+4j, 1e-5j, 1e-2+3j, 5+1e-6j])
tensor([1.0000e-03+0.0000e+00j, 3.0000e+00+4.0000e+00j, 0.0000e+00+1.0000e-05j,
1.0000e-02+3.0000e+00j, 5.0000e+00+1.0000e-06j])
>>> torch.randn(3, dtype=torch.cfloat)
tensor([ 1.0992-0.4459j, 1.1073+0.1202j, -0.2177-0.6342j])
>>> x = torch.tensor([1e2, 1e-2])
>>> torch.set_printoptions(sci_mode=False)
>>> x
tensor([ 100.0000, 0.0100])
>>> x = torch.tensor([1e2, 1e-2j])
>>> x
tensor([100.+0.0000j, 0.+0.0100j])
```
Test Plan: Imported from OSS
Differential Revision: D22309294
Pulled By: anjali411
fbshipit-source-id: 20edf9e28063725aeff39f3a246a2d7f348ff1e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40753
The reference says that this op always returns a 1-D tensor, even if
the input and the mask are 0-D.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D22300354
Pulled By: ZolotukhinM
fbshipit-source-id: f6952989c8facf87d73d00505bf6d41573eff2d6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40576
`out_dims` specifies where in the output tensors the vmapped dimension
should appear. We implement this by simply creating a view with the
batch dimension moved to the desired position.
`out_dims` must either:
- be int (use the same value for all outputs)
- be Tuple[int] (so the user specifies one out_dim per output).
(See the vmap docstring for what we advertise out_dims to do).
I also renamed `TestVmap` to `TestVmapAPI` to make it clearer that we
are testing the API here and not specific operators (which will go into
their own test class).
Test Plan: - `pytest test/test_vmap.py -v`
Differential Revision: D22288086
Pulled By: zou3519
fbshipit-source-id: c8666cb1a0e22c54473d8045477e14c2089167cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40575
This provides some more context for the next ~2 PRs that will implement
the `out_dims` and `in_dims` functionality. I will probably add more to
it later (things I think we should add: examples (maybe in a dedicated
docs page), specific examples of things vmap cannot handle).
Test Plan:
- Code reading for now. When we are ready to add vmap to master documentation,
I'll build the docs and fix any formatting problems.
Differential Revision: D22288085
Pulled By: zou3519
fbshipit-source-id: 6e28d7bd524242395160c20270159b4b121d6789
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40728
there are two reasons the test is failing:
1) div by 0
2) result is bigger than fp16 max
for 1) make the divisor some safe number like 1e-3
2) when a combination of random numbers results in their resulting value bigger than 65e3, clip
multiplication is fine because range of random numbers is 0,100 -> result is 0->10000
Test Plan: ran tes_div test
Reviewed By: hl475
Differential Revision: D22295934
fbshipit-source-id: 173f3f2187137d6c1c4d4a505411a27f1c059f1a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39763
This is an ask from fluent. For performance reasons, they need a way to get read access to the std::string inside of a torch::List<std::string> without having to copy that string.
Instead of special casing std::string, we decided to give access to the underlying value. The API now looks like:
```cpp
torch::List<std::string> list = ...;
const std::string& str = list[2].toIValueRef().toStringRef();
```
ghstack-source-id: 106806840
Test Plan: unit tests
Reviewed By: ezyang
Differential Revision: D21966183
fbshipit-source-id: 8b80b0244d10215c36b524d1d80844832cf8b69a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40720
Add support for populating domain_discrete field in TensorBoard add_hparams API
Test Plan: Unit test test_hparams_domain_discrete
Reviewed By: edward-io
Differential Revision: D22291347
fbshipit-source-id: 78db9f62661c9fe36cd08d563db0e7021c01428d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37034
c10 takes a Stack* in boxed functions while JIT took Stack&.
c10 doesn't return anything while JIT returns an int which is always zero.
This changes JIT to follow the c10 behavior.
ghstack-source-id: 106834069
Test Plan: unit tests
Differential Revision: D20567950
fbshipit-source-id: 1a7aea291023afc52ae706957e9a5ca576fbb53b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40247
This CI job was bypassed on PR because most part of it has already been
covered by mobile-custom-build-dynamic job that runs on every PR.
However, it can still fail independently because it builds and analyzes
a small test project, e.g.: if people forget to update the registration API
used in the test project.
So this PR changed it to only build and analyze the test project and run
the job on every PR.
Test Plan: Imported from OSS
Differential Revision: D22126044
Pulled By: ljk53
fbshipit-source-id: 6699a200208a65b249bd3a4e43ad72bc07388ce3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40588
Two bugs were preventing this from working. One was a divide by zero
when multithreading was enabled, fixed similarly to the fix for static
quantized linear in the previous commit. The other was computation of
min and max to determine qparams. FBGEMM uses [0,0] for [min,max] of
empty input, do the same.
Test Plan: Added a unit test.
Differential Revision: D22264415
Pulled By: dreiss
fbshipit-source-id: 6ca9cf48107dd998ef4834e5540279a8826bc754
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40587
Previously, this was causing divide-by-zero only in the multithreaded
empty-batch case, while calculating tiling parameters for the threads.
In my opinion, the bug here is using a value that is allowed to be zero
(batch size) for an argument that should not be zero (tile size), so I
fixed the bug by bailing out right before the call to
pthreadpool_compute_4d_tiled.
Test Plan: TestQuantizedOps.test_empty_batch
Differential Revision: D22264414
Pulled By: dreiss
fbshipit-source-id: 9446d5231ff65ef19003686f3989e62f04cf18c9
Summary:
This PR implements gh-33389.
As a result of this PR, users can now specify various reduction modes for scatter operations. Currently, `add`, `subtract`, `multiply` and `divide` have been implemented, and adding new ones is not hard.
While we now allow dynamic runtime selection of reduction modes, the performance is the same as as was the case for the `scatter_add_` method in the master branch. Proof can be seen in the graph below, which compares `scatter_add_` in the master branch (blue) and `scatter_(reduce="add")` from this PR (orange).

The script used for benchmarking is as follows:
``` python
import os
import sys
import torch
import time
import numpy
from IPython import get_ipython
Ms=256
Ns=512
dim = 0
top_power = 2
ipython = get_ipython()
plot_name = os.path.basename(__file__)
branch = sys.argv[1]
fname = open(plot_name + ".csv", "a+")
for pM in range(top_power):
M = Ms * (2 ** pM)
for pN in range(top_power):
N = Ns * (2 ** pN)
input_one = torch.rand(M, N)
index = torch.tensor(numpy.random.randint(0, M, (M, N)))
res = torch.randn(M, N)
test_case = f"{M}x{N}"
print(test_case)
tobj = ipython.magic("timeit -o res.scatter_(dim, index, input_one, reduce=\"add\")")
fname.write(f"{test_case},{branch},{tobj.average},{tobj.stdev}\n")
fname.close()
```
Additionally, one can see that various reduction modes take almost the same time to execute:
```
op: add
70.6 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.1 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: subtract
71 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.4 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: multiply
70.9 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
27.4 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: divide
164 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.3 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Script:
``` python
import torch
import time
import numpy
from IPython import get_ipython
ipython = get_ipython()
nrows = 3000
ncols = 10000
dims = [nrows, ncols]
res = torch.randint(5, 10, dims)
idx1 = torch.randint(dims[0], (1, dims[1])).long()
src1 = torch.randint(5, 10, (1, dims[1]))
idx2 = torch.randint(dims[1], (dims[0], 1)).long()
src2 = torch.randint(5, 10, (dims[0], 1))
for op in ["add", "subtract", "multiply", "divide"]:
print(f"op: {op}")
ipython.magic("timeit res.scatter_(0, idx1, src1, reduce=op)")
ipython.magic("timeit res.scatter_(1, idx2, src2, reduce=op)")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36447
Differential Revision: D22272631
Pulled By: ngimel
fbshipit-source-id: 3cdb46510f9bb0e135a5c03d6d4aa5de9402ee90
Summary:
Set PYTORCH_ROCM_ARCH to `gfx900;gfx906` if `CIRCLECI` environment variable is defined
Add RocM build test jobs and schedule them on `xlarge` and `amd-gpu` resource classes respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39760
Differential Revision: D22290335
Pulled By: malfet
fbshipit-source-id: 7462f97b262abcacac3e515086ac6236a45626d2
Summary:
By default freeze_module pass, invoked from optimize_for_mobile,
preserves only forward method. There is an option to specify a list of
methods that can be preserved during freeze_module. This PR exposes that
to optimize_for_module pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40629
Test Plan: python test/test_mobile_optimizer.py
Reviewed By: dreiss
Differential Revision: D22260972
Pulled By: kimishpatel
fbshipit-source-id: 452c653269da8bb865acfb58da2d28c23c66e326
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40673
As title. We planed to have lite-interpreter and full-jit co-exist for short-term. To avoid the duplicated symbol and operator registrations in dynamic lib loading, we put the common files in a separate component.
The original source file list names are reserved.
ghstack-source-id: 106757184
Test Plan: CI
Reviewed By: kwanmacher
Differential Revision: D22276185
fbshipit-source-id: 328a8ba9c3d88437da0d30c6e6791087d0df5e2e
Summary: Add support for populating domain_discrete field in TensorBoard add_hparams API
Test Plan: Unit test test_hparams_domain_discrete
Reviewed By: edward-io
Differential Revision: D22227939
fbshipit-source-id: d2f0cd8e5632cbcc578466ff3cd587ee74f847af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40660
Support custom run_name since using timestamp as run_name can be confusing to people
Test Plan:
hp = {"lr": 0.1, "bool_var": True, "string_var": "hi"}
mt = {"accuracy": 0.1}
writer.add_hparams(hp, mt, run_name="run1")
writer.flush()
Reviewed By: edward-io
Differential Revision: D22157749
fbshipit-source-id: 3d4974381e3be3298f3e4c40e3d4bf20e49dfb07
Summary: This reverts a change that was made to fix range loop analysis warning.
Test Plan: CI
Reviewed By: nlutsenko
Differential Revision: D22274461
fbshipit-source-id: dedc3fcaa6e32259460380163758d6c9c9b73211
Summary:
Also modernize the test script itself by using `mypy.api.run` rather than `subprocess.call`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40656
Differential Revision: D22274421
Pulled By: malfet
fbshipit-source-id: 59232d4d37ee01cda56375b84ac1476d16686bfe
Summary:
Remove `skipIfRocm` from most jit tests and enable `RUN_CUDA_HALF` tests for ROCm.
These changes passed more than three rounds of CI testing against the ROCm CI.
CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40447
Differential Revision: D22190711
Pulled By: xw285cornell
fbshipit-source-id: bac44825a2675d247b3abe2ec2f80420a95348a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40624
Previously we didn't clone schema, so the default schema is used, this is
causing issue for some models
Test Plan: Imported from OSS
Differential Revision: D22259519
fbshipit-source-id: e2a393a54cb18f55da0c7152a74ddc22079ac350
Summary:
1. In reducer.cpp, we have a new boolean `find_unused_param_` and its value is set in `Reducer::prepare_for_backward`.
If `!find_unused_param_`, then it avoids `allreduce(local_used_maps_dev_)`.
2. Solves issue [38942](https://github.com/pytorch/pytorch/issues/38942).
3. Fixes incorrect `find_unused_parameters_` passing like checking `outputs.empty()` or `unused_parameters_.empty()`.
ghstack-source-id: 106693089
Test Plan:
1. Run `test/distributed/test_c10d.py` and make sure all tests pass.
2. A new test case `test_find_unused_parameters_when_unused_parameters_empty` is included. Old `reducer.cpp` was failing in that unit test because it was checking `find_unused_parameters_` by `unused_parameters_.empty()`. Current `reducer.cpp` passes this unit test.
3. Two test cases were failing `test_forward_backward_unused_parameters` and `test_forward_backward_optimizer` , because `find_unused_parameter_` of their `reducer` object was not set properly. I fixed that as well.
Imported from OSS
**Output of version 14:**
```
................s.....s...............................................test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
tensor = torch.full([100, 100], self.rank)
.test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.)
self.assertEqual(torch.full([10, 10], self.world_size), tensor)
.....s...............................
----------------------------------------------------------------------
Ran 108 tests in 214.210s
OK (skipped=3)
```
Differential Revision: D22176231
fbshipit-source-id: b5d15f034e13a0915a474737779cc5aa8e068836
Summary:
## TLDR
Support using NaN default value for missing dense features in RawInputProcessor for *DPER2*. In preparation for subsequent support for null flag features in *compute meta*. For train_eval this is already supported in DPER3 and we do not plan to support this in DPER2 train eval.
## Overview
Intern project plan to support adding dense flags for missing feature values instead of replacing with zero.
Project plan :
https://docs.google.com/document/d/1OsPUTjpJycwxWLCue3Tnb1mx0uDC_2KKWvC1Rwpo2NI/edit?usp=sharing
## Code paths:
See https://fb.quip.com/eFXUA0tbDmNw for the call stack for all affected code paths.
Test Plan:
# A. DPER3 blob value inspection
## 1. Build local bento kernel in fbcode folder
`buck build mode/dev-nosan //bento/kernels:bento_kernel_ads_ranking`
## 2. Use kernel `ads_ranking (local)` to print dense feature blob values
n280239
## 2.1 Try `default_dense_value = "0.0"` (default)
```
preproc_6/feature_preproc_6/dper_feature_processor_7/raw_input_proc_7/float_feature_sparse_to_dense_7/float_features [[0. ]
[0. ]
[0. ]
[0. ]
[0. ]
[0. ]
[0. ]
[1. ]
[1.7857143]
[1.7777778]
[1. ]
[0. ]
[0.5625 ]
[0. ]
[0. ]
[0.8 ]
[0. ]
[1. ]
[0.56 ]
[0. ]]
```
## 2.2 Try `default_dense_value = "123"`
```
preproc_2/feature_preproc_2/dper_feature_processor_3/raw_input_proc_3/float_feature_sparse_to_dense_3/float_features [[123. ]
[123. ]
[123. ]
[123. ]
[123. ]
[123. ]
[123. ]
[ 1. ]
[ 1.7857143]
[ 1.7777778]
[ 1. ]
[123. ]
[ 0.5625 ]
[123. ]
[123. ]
[ 0.8 ]
[123. ]
[ 1. ]
[ 0.56 ]
[123. ]]
```
## 2.3 Try `default_dense_value = float("nan")`
```
RuntimeError: [enforce fail at enforce_finite_op.h:40] std::isfinite(input_data[i]). Index 0 is not finite (e.g., NaN, Inf): -nan (Error from operator:
input: "unary_4/logistic_regression_loss_4/average_loss_4/average_loss" name: "" type: "EnforceFinite" device_option { random_seed: 54 })
```
which is expected due to nan input.
# B. Unit test
`buck test fblearner/flow/projects/dper/tests/preprocs:raw_feature_extractor_test`
https://www.internalfb.com/intern/testinfra/testconsole/testrun/5348024586274923/
{F241336814}
Differential Revision: D21961595
fbshipit-source-id: 3dcb153b3c7f42f391584f5e7f52f3d9c76de31f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39459
Update to this PR: this code isn't going to fully solve https://github.com/pytorch/pytorch/issues/37010. The changes required for 37010 is more than this PR initially planned. Instead, this PR switches op registration of rng related tests to use the new API (similar to what was done in #36925)
Test Plan:
1) unit tests
Imported from OSS
Reviewed By: ezyang
Differential Revision: D22264889
fbshipit-source-id: 82488ac6e3b762a756818434e22c2a0f9cb9dd47
Summary:
This replicates the pattern of other "do for luck" commands.
Prep change to add RocM to CircleCI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40631
Differential Revision: D22261707
Pulled By: malfet
fbshipit-source-id: 3dadfa434deab866a8800715f3197e84169cf43e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37055
Sometimes it's okay to bundle a large example input tensor with a model.
Add a utility function to make it easy for users to do that *on purpose*.
Test Plan: Unit test.
Differential Revision: D22264239
Pulled By: dreiss
fbshipit-source-id: 05c6422be1aa926cca850f994ff1ae83c0399119
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36764
This allows bundling inputs that are large uniform buffers in
channels-last memory format.
Test Plan: Unit test.
Differential Revision: D21142660
Pulled By: dreiss
fbshipit-source-id: 31bbea6586d07c1fd0bcad4cb36ed2b8bb88a7e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40611
This commit removes a dead store in `transformWith` of exit_transforms.cpp.
Test Plan: Continuous integration.
Reviewed By: suo
Differential Revision: D22254136
fbshipit-source-id: f68c4625f7be8ae29b3500303211b2299ce5d6f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40614
This update pulls in a oneliner fix, which sets the TCP_NODELAY option on the TCP sockets of the UV transport. This leads to exceptional performance gains in terms of latency, with about a 25x improvement in one simple benchmark. This thus resolves a regression that TensorPipe had compared to the ProcessGroup agent and, in fact, ends up beating it by 2x.
The benchmark I ran is this, with the two endpoints pinned to different cores of the same machine:
```
torch.jit.script
def remote_fn(t: int):
return t
torch.jit.script
def local_fn():
for _ in range(1_000_000):
fut = rpc.rpc_async("rhs", remote_fn, (42,))
fut.wait()
```
And the average round-trip time (one iteration) is:
- TensorPipe with SHM: 97.2 us
- TensorPipe with UV _after the fix_: 205us
- Gloo: 440us
- TensorPipe with UV _before the fix_: 5ms
Test Plan: Ran PyTorch RPC test suite
Differential Revision: D22255393
fbshipit-source-id: 3f6825d03317d10313704c05a9280b3043920507
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40584
Also patch [this github issue](https://github.com/pytorch/pytorch/issues/33124)
involving an illegal assembly instruction in 8x8-dq-aarch64-neon.S.
Test Plan:
Build binaries, copy to shaker, run executables. Also run all
existing caffe tests.
Reviewed By: kimishpatel
Differential Revision: D22240670
fbshipit-source-id: 51960266ce58699fe6830bcf75632b92a122f638
Summary:
Switch windows CPU testers from `windows.xlarge` to `windows.medium` class.
Remove VS 14.16 CUDA build
Only do smoke force-on-cpu tests using VS2019+CUDA10.1 config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40592
Differential Revision: D22259351
Pulled By: malfet
fbshipit-source-id: f934ff774dfc7d47f12c3da836ca314c12d92208
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/38555
I did an audit of `native_functions.yaml` and found several functions in addition to `reshape` which were not reporting that they could alias:
```
torch.jit.script
def foo(t: torch.Tensor):
new_value = torch.tensor(1, dtype=t.dtype, device=t.device)
t.flatten()[0] = new_value
t.reshape(-1)[1] = new_value
t.view_as(t)[2] = new_value
t.expand_as(t)[3] = new_value
t.reshape_as(t)[4] = new_value
t.contiguous()[5] = new_value
t.detach()[6] = new_value
return t
```
Currently none of the values are assigned after dead code elimination, after this PR all are. (And the JIT output matches that of eager.)
I don't think this needs to be unit tested; presumably the generic machinery already is and this just brings these ops under the same umbrella.
**BC-breaking note**: This updates the native operator schema and the aliasing rules for autograd. JIT passes will no longer incorrectly optimize mutations on graphs containing these ops, and inplace ops on the result of `flatten` will now properly be tracked in Autograd and the proper backward graph will be created.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39794
Differential Revision: D22008358
Pulled By: robieta
fbshipit-source-id: 9d3ff536e58543211e08254a75c6110f2a3b4992
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40512
Fixes https://github.com/pytorch/pytorch/issues/32454
The heart of this diff is changing this:
```
inline const KernelFunction& Dispatcher::dispatch_(const DispatchTable& dispatchTable, DispatchKey dispatchKey) c
nst {
const KernelFunction* backendKernel = dispatchTable.lookup(dispatchKey);
if (nullptr != backendKernel) {
return *backendKernel;
}
const auto& backendFallbackKernel = backendFallbackKernels_[dispatchKey];
if (backendFallbackKernel.isValid()) {
return backendFallbackKernel;
}
const KernelFunction* catchallKernel = dispatchTable.lookupCatchallKernel();
if (C10_LIKELY(nullptr != catchallKernel)) {
return *catchallKernel;
}
reportError(dispatchTable, dispatchKey);
}
```
to this:
```
const KernelFunction& OperatorEntry::lookup(DispatchKey k) const {
const auto& kernel = dispatchTable_[static_cast<uint8_t>(k)];
if (C10_UNLIKELY(!kernel.isValid())) {
reportError(k);
}
return kernel;
}
```
The difference is that instead of checking a bunch of places to find the
right kernel to use for an operator, all of the operators are
precomputed into dispatchTable_ itself (so you don't have to consult
anything else at runtime.) OperatorEntry::computeDispatchTableEntry
contains that computation (which is exactly the same as it was before.)
By doing this, we are able to substantially simplify many runtime
components of dispatch.
The diff is fairly large, as there are also some refactors interspersed
with the substantive change:
- I deleted the DispatchTable abstraction, folding it directly into
OperatorEntry. It might make sense to have some sort of DispatchTable
abstraction (if only to let you do operator[] on DispatchKey without
having to cast it to integers first), but I killed DispatchTable to
avoid having to design a new abstraction; the old abstraction wasn't
appropriate for the new algorithm.
- I renamed OperatorEntry::KernelEntry to AnnotatedKernel, and use it
to store backend fallbacks as well as regular kernel registrations
(this improves error messages when you incorrectly register a backend
fallback twice).
- I moved schema_ and debug_ into an AnnotatedSchema type, to make the
invariant clearer that these are set together, or not at all.
- I moved catch-all kernels out of kernels_ into its own property
(undoing a refactor I did before). The main reason I did this was
because our intended future state is to not have a single catch-all,
but rather possibly multiple catch-alls which fill-in different
portions of the dispatch table. This may change some more in
the future: if we allow registrations for multiple types of
catch alls, we will need a NEW data type (representing bundles
of dispatch keys) which can represent this case, or perhaps
overload DispatchKey to also record these types.
The key changes for precomputation:
- OperatorEntry::updateDispatchTable_ is now updated to fill in the
entry at a DispatchKey, considering both kernels (what it did
before) as well as catch-all and backend fallback. There is also
OperatorEntry::updateDispatchTableFull_ which will update the
entire dispatch table (which is necessary when someone sets a
catch-all kernel). OperatorEntry::computeDispatchTableEntry
holds the canonical algorithm specifying how we decide what
function will handle a dispatch key for the operator.
- Because dispatch table entry computation requires knowledge of
what backend fallbacks are (which is recorded in Dispatcher,
not OperatorEntry), several functions on OperatorEntry now
take Dispatcher as an argument so they can query this information.
- I modified the manual boxing wrapper invariant: previously, kernels
stored in kernels_ did NOT have manual boxing wrappers and this
was maintained by DispatchTable. Now, we just ALWAYS maintain
manual boxing wrappers for all KernelFunctions we store.
- DispatchKeyExtractor is greatly simplified: we only need to maintain
a single per-operator bitmask of what entries are fallthrough
(we don't need the global bitmask anymore).
- Introduced a new debugging 'dumpComputedTable' method, which prints
out the computed dispatch table, and how we computed it to be some way.
This was helpful for debugging cases when the dispatch table and
the canonical metadata were not in sync.
Things that I didn't do but would be worth doing at some point:
- I really wanted to get rid of the C10_UNLIKELY branch for
whether or not the KernelFunction is valid, but it looks like
I cannot easily do this while maintaining good error messages.
In principle, I could always populate a KernelFunction which
errors, but the KernelFunction needs to know what the dispatch
key that is missing is (this is not passed in from the
calling convention). Actually, it might be possible to do
something with functors, but I didn't do it here.
- If we are going to get serious about catchalls for subsets of
operators, we will need to design a new API for them. This diff
is agnostic to this question; we don't change public API at all.
- Precomputation opens up the possibility of subsuming DispatchStub
by querying CPU capability when filling in the dispatch table.
This is not implemented yet. (There is also a mild blocker here,
which is that DispatchStub is also used to share TensorIterator
configuration, and this cannot be directly supported by the
regular Dispatcher.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22236352
Pulled By: ezyang
fbshipit-source-id: d6d90f267078451816b1899afc3f79737b4e128c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40469
- The old testing interface C._dispatch_import was based off the old
c10::import variation, which meant the API lined up in a strange
way with the actual torch/library.h. This diff reduces the
differences by letting you program the Library constructor directly.
- Using this newfound flexibility, we add a test for backend fallbacks
from Python; specifically testing that we disallow registering a
backend fallback twice.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22236351
Pulled By: ezyang
fbshipit-source-id: f8365e3033e9410c7e6eaf9f78aa32e1f7d55833
Summary:
Remove `-std=c++14` flag from `utils.cmake`, since PyTorch C++ API can be invoked by any compiler compliant with C++14 standard or later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40510
Differential Revision: D22253313
Pulled By: malfet
fbshipit-source-id: ff731525868b251c27928fc98b0724080ead9be2
Summary:
1. Modularize some bzl files to break circular buck load
2. Use query-based on instrumentation_tests
(Note: this ignores all push blocking failures!)
Test Plan: CI
Reviewed By: kwanmacher
Differential Revision: D22188728
fbshipit-source-id: affbabd333c51c8b1549af6602c6bb79fabb7236
Summary:
std::complex is gone. We are now using c10::complex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39830
Differential Revision: D22252066
Pulled By: malfet
fbshipit-source-id: cdd5bb03ec66825d82177d609cbcf0738922dba0
Summary:
edit: apparently we hardcode a lot more versions that I would've anticipated.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40519
Differential Revision: D22221280
Pulled By: seemethere
fbshipit-source-id: ba15a910a6755ec08c10f7783ed72b1e06e6b570
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40596
Previously the fusion patterns for {add/mul}_scalar is inconsistent since the op pattern
produces a non-quantized tensor and the op replacement graph produces a quantized tensor
Test Plan: Imported from OSS
Differential Revision: D22251072
fbshipit-source-id: e16eb92cf6611578cca1ed8ebde961f8d0610137
Summary: Adding python test file with image files wit the input image being p.jpg. Test for the quality difference between the raw image and the decoded image
Test Plan:
Parsing buck files: finished in 1.5 sec
Building: finished in 6.4 sec (100%) 10241/10241 jobs, 2 updated
Total time: 8.0 sec
More details at https://www.internalfb.com/intern/buck/build/387cb1c1-2902-4f90-ae9f-83fb6d473487
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 93e6ef88-ec68-41cb-9de7-7868a14e6d65
Trace available for this run at /tmp/tpx-20200623-055836.283269/trace.log
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/4222124679431330
✓ ListingSuccess: caffe2/test:test_bundled_images - main (18.865)
✓ Pass: caffe2/test:test_bundled_images - test_single_tensors (test_bundled_images.TestBundledInputs) (18.060)
✓ Pass: caffe2/test:test_bundled_images - main (18.060)
Summary
Pass: 2
ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4222124679431330
Reviewed By: dreiss
Differential Revision: D22046611
fbshipit-source-id: fabc604269a5a4d8a37135ce776200da2794a252
Summary:
Related to https://github.com/pytorch/pytorch/issues/40397
Inspired by ezyang's comment at https://github.com/pytorch/pytorch/issues/40397#issuecomment-648233001, this PR attempts to leverage using `__all__` to explicitly export private functions from `_VariableFunctions.pyi` in order to make `mypy` aware of them after:
```
if False:
from torch._C._VariableFunctions import *
```
The generation of the `__all__` template variable excludes some items from `unsorted_function_hints`, as it seems that those without hints end up not being explicitly included in the `.pyi` file: I leaned on the side of caution and opted for having `__all__` consistent with the definitions inside the file. Additionally, added some pretty-printing to avoid having an extremely long line.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40499
Differential Revision: D22240716
Pulled By: ezyang
fbshipit-source-id: 77718752577a82b1e8715e666a8a2118a9d3a1cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40528
Previously, an assignment like `self.foo : List[int] = []` would ignore
the type hint.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D22222927
Pulled By: suo
fbshipit-source-id: b0af19b87c6fbe0670d06b55f2002a783d00549d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40076
Pull Request resolved: https://github.com/pytorch/glow/pull/4606
[PyPer][quant] Add quantized embedding operators to OSS.
This is the first step in supporting Graph Mode Quantization for EmbeddingBag.
At a high level, the next steps would be
a) Implementation of Embedding prepack/unpack operators,
b) Implementation of torch.nn.quantized.dynamic.EmbeddingBag Module,
c) Implementation of torch.nn.quantized.EmbeddingBag Module,
d) Implementation (modification) of IR passes to support graph quantization of EmbeddingBag module.
More in-depth details regarding each step will be in the follow up diffs. Consider this as an initial diff that moves operators to respective places that's required for us to proceed.
Test Plan: ```buck test mode/no-gpu caffe2/test:quantization -- --stress-runs 100 test_embedding_bag```
Reviewed By: supriyar
Differential Revision: D21949828
fbshipit-source-id: cad5ed0a855db7583bddb1d93e2da398c128024a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40554
Get a sublist of `libtorch_python_cuda_sources` named `libtorch_python_cuda_core_sources`. Use it to replace the list which has the same content in `CMakeList.txt`.
This is a change to make consistency between CMakeList and bazel.
Test Plan: CI
Reviewed By: malfet
Differential Revision: D22223207
fbshipit-source-id: 2bde3c42a0b2d60d689581561075df4ef52ab694
Summary:
This code path is used to read tensor bodies, so we need it to respect
alignment and padding requirements.
Test Plan: Ran an internal test that was failing.
Reviewed By: zdevito
Differential Revision: D22225622
fbshipit-source-id: f2126727f96616366850642045ab9704f3885824
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40518
I overlooked this in the initial vmap frontend api PR. Right now we
want to restrict vmap to taking in functions that only return Tensors.
A function that only return tensors can look like one of the following:
```
def fn1(x):
...
return y
def fn2(x):
...
return y, z
```
fn1 returns a Tensor, while fn2 returns a tuple of Tensors. So we add a
check that the output of the function passed to vmap returns either a
single tensor or a tuple of tensors.
NB: These checks allow passing a function that returns a tuple with a
single-element tensor from vmap. That seems OK to me.
Test Plan: - `python test/test_vmap.py -v`
Differential Revision: D22216166
Pulled By: zou3519
fbshipit-source-id: a92215e9c26f6138db6b10ba81ab0c2c2c030929
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40517
This is necessary for implementing the vmap frontend API's out_dims
functionality.
Test Plan:
- `./build/bin/vmap_test`. The vmap python API can't accept inputs that
aren't integers right now. There are workarounds around that (use a
lambda) but that doesn't look too nice. In the future we'll test all
batching rules in Python.
Differential Revision: D22216168
Pulled By: zou3519
fbshipit-source-id: b6ef552f116fddc433e242c1594059b9d2fe1ce4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40455
These don't need to be implemented right now but are useful later down
the line. I thought I would use these in implementing vmap's `out_dims`
functionality, but it turns out they weren't necessary. Since the code
exists and is useful anyways, I am leaving this PR here.
Test Plan:
- `./build/bin/vmap_test`. We could test this using the vmap frontend API,
but there is the catch that vmap cannot directly take integers right
now (all inputs passed to vmap must be Tensors at the moment). It's
possible to hack around that by declaring lambdas that take in a single
tensor argument, but those don't look nice.
Differential Revision: D22216167
Pulled By: zou3519
fbshipit-source-id: 1a010f5d7784845cca19339d37d6467f5b987c32
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40549
Currently we didn't check if %weight_t is produced by `aten::t`, this will fuse some `matmul`/`addmm` that is
not 2d to `aten::linear`, which is incorrect
Test Plan: Imported from OSS
Differential Revision: D22225921
fbshipit-source-id: 9723e82fdbac6d8e1a7ade22f3a9791321ab12b6
Summary: `-Wrange-loop-analysis` is turned on by default for clang 10 (see https://reviews.llvm.org/D73834). This fixes a warning that's found with that.
Test Plan: Build with clang 10 and check there are no `range-loop-analysis` warnings.
Reviewed By: yinghai
Differential Revision: D22207072
fbshipit-source-id: 858ba8a36c653071eab961cb891ce945faf0fa87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40504
Make CUDA mem liek test not flaky
Test Plan: python test/test_profiler.py
Differential Revision: D22215527
Pulled By: ilia-cher
fbshipit-source-id: 5f1051896342ac50cd3a21ea86ce7487b5f82a19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40533
These ops are required by the demucs denoiser model
Test Plan: build
Reviewed By: kaustubh-kp, linbinyu
Differential Revision: D22216217
fbshipit-source-id: f300ac246fe3a7a6566a70bb89858770af68a90c
Summary:
These were changes that had to be made in the `release/1.6` branch in order to get backups to work.
They should be brought to the master branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40515
Differential Revision: D22221308
Pulled By: seemethere
fbshipit-source-id: 24e2a0196a8e775fe324a383c8f0c681118b741b
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38716, fixes https://github.com/pytorch/pytorch/issues/37234
This algorithm does the summation along a single axis with multiple "levels" of accumulator, each of which is designed to hold the sum of an order of magnitude more values than the previous.
e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16.
This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases.
WIP to write a vectorized version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39516
Reviewed By: ezyang
Differential Revision: D22106251
Pulled By: ngimel
fbshipit-source-id: b56de4773292439dbda62b91f44ff37715850ae9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40495
As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.
1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442
Test Plan:
1) Ran ddp_under_dist_autograd 500 times.
2) waitforbuildbot
Differential Revision: D22205180
fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40525
Move `USE_CUDNN` define under `USE_CUDA` guard, add `cuda/shared/cudnn.cpp` to filelist if either USE_ROCM or USE_CUDNN is set.
This is a prep change for PyTorch CUDA src filelist unification change.
Test Plan: CI
Differential Revision: D22214899
fbshipit-source-id: b71b32fc603783b41cdef0e7fab2cc9cbe750a4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40424
dictConstruct should preserve the inputs order
Test Plan: Imported from OSS
Differential Revision: D22202690
Pulled By: wanchaol
fbshipit-source-id: c313b531b7fa49e6f3486396d61bfc5d6400cd01
Summary:
https://github.com/pytorch/pytorch/issues/24697
VitalyFedyunin
glaringlee
Test script:
```Python
import timeit
setup_ones = """
import torch
a = torch.ones(({n}, {n}), dtype={dtype})
b = torch.ones(({n}, {n}), dtype={dtype})
"""
for n, t in [(1000, 10000), (2000, 10000)]:
for dtype in ('torch.bool', 'torch.int', 'torch.long', 'torch.bfloat16', 'torch.float', 'torch.double'):
#for dtype in ('torch.bool', 'torch.int', 'torch.long', 'torch.float', 'torch.double'):
print('torch.ones(({n}, {n})) equal for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
print(timeit.timeit(stmt='torch.equal(a, b)', setup=setup_ones.format(n=n, dtype=dtype), number=t))
setup_rand = """
import torch
a = torch.rand(({n}, {n}), dtype={dtype})
b = a.clone()
"""
for n, t in [(1000, 10000), (2000, 10000)]:
for dtype in ('torch.float', 'torch.double'):
print('torch.rand(({n}, {n})) for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
print(timeit.timeit(stmt='torch.equal(a, b)', setup=setup_rand.format(n=n, dtype=dtype), number=t))
setup_non_contiguous = """
import torch
a = torch.rand(({n}, {n}), dtype={dtype})
a2 = a[:, 500:]
a3 = a2.clone()
torch.equal(a2, a3)
"""
for n, t in [(1000, 10000), (2000, 10000)]:
for dtype in ('torch.float', 'torch.double'):
print('non_contiguous torch.rand(({n}, {n})) for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
print(timeit.timeit(stmt='torch.equal(a2, a3)', setup=setup_non_contiguous.format(n=n, dtype=dtype), number=t))
setup_not_equal = """
import torch
a = torch.rand(({n}, {n}), dtype={dtype})
b = torch.rand(({n}, {n}), dtype={dtype})
torch.equal(a, b)
"""
for n, t in [(1000, 10000), (2000, 10000)]:
for dtype in ('torch.float', 'torch.double'):
print('not equal torch.rand(({n}, {n})) for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
print(timeit.timeit(stmt='torch.equal(a, b)', setup=setup_not_equal.format(n=n, dtype=dtype), number=t))
```
TH
```
torch.ones((1000, 1000)) equal for 10000 times torch.bool
1.8391206220258027
torch.ones((1000, 1000)) equal for 10000 times torch.int
1.8877864250680432
torch.ones((1000, 1000)) equal for 10000 times torch.long
1.938108820002526
torch.ones((1000, 1000)) equal for 10000 times torch.bfloat16
3.184849138953723
torch.ones((1000, 1000)) equal for 10000 times torch.float
1.8825413499725983
torch.ones((1000, 1000)) equal for 10000 times torch.double
2.7266416549682617
torch.ones((2000, 2000)) equal for 10000 times torch.bool
7.227149627986364
torch.ones((2000, 2000)) equal for 10000 times torch.int
7.76215292501729
torch.ones((2000, 2000)) equal for 10000 times torch.long
9.631909006042406
torch.ones((2000, 2000)) equal for 10000 times torch.bfloat16
8.097328286035918
torch.ones((2000, 2000)) equal for 10000 times torch.float
5.5739822529722005
torch.ones((2000, 2000)) equal for 10000 times torch.double
8.444009944912978
torch.rand((1000, 1000)) for 10000 times torch.float
1.168096570065245
torch.rand((1000, 1000)) for 10000 times torch.double
1.6577326939441264
torch.rand((2000, 2000)) for 10000 times torch.float
5.49395391496364
torch.rand((2000, 2000)) for 10000 times torch.double
8.507486199960113
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.float
6.074504268006422
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.double
6.1426916810451075
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.float
37.501055537955835
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.double
44.6880351039581
not equal torch.rand((1000, 1000)) for 10000 times torch.float
0.029356416082009673
not equal torch.rand((1000, 1000)) for 10000 times torch.double
0.025421109050512314
not equal torch.rand((2000, 2000)) for 10000 times torch.float
0.026333761983551085
not equal torch.rand((2000, 2000)) for 10000 times torch.double
0.02748022007290274
```
ATen
```
torch.ones((1000, 1000)) equal for 10000 times torch.bool
0.7961567062884569
torch.ones((1000, 1000)) equal for 10000 times torch.int
0.49172434909269214
torch.ones((1000, 1000)) equal for 10000 times torch.long
0.9459248608909547
torch.ones((1000, 1000)) equal for 10000 times torch.bfloat16
2.0877483217045665
torch.ones((1000, 1000)) equal for 10000 times torch.float
0.606857153121382
torch.ones((1000, 1000)) equal for 10000 times torch.double
1.1388208279386163
torch.ones((2000, 2000)) equal for 10000 times torch.bool
2.0329296849668026
torch.ones((2000, 2000)) equal for 10000 times torch.int
3.534358019940555
torch.ones((2000, 2000)) equal for 10000 times torch.long
8.19841272290796
torch.ones((2000, 2000)) equal for 10000 times torch.bfloat16
6.595649406313896
torch.ones((2000, 2000)) equal for 10000 times torch.float
4.193911510054022
torch.ones((2000, 2000)) equal for 10000 times torch.double
7.931309659034014
torch.rand((1000, 1000)) for 10000 times torch.float
0.8877940969541669
torch.rand((1000, 1000)) for 10000 times torch.double
1.4142901846207678
torch.rand((2000, 2000)) for 10000 times torch.float
4.010025603231043
torch.rand((2000, 2000)) for 10000 times torch.double
8.126411964651197
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.float
0.602473056409508
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.double
0.6784545010887086
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.float
3.0991827426478267
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.double
5.719010795000941
not equal torch.rand((1000, 1000)) for 10000 times torch.float
0.046060710679739714
not equal torch.rand((1000, 1000)) for 10000 times torch.double
0.036034489050507545
not equal torch.rand((2000, 2000)) for 10000 times torch.float
0.03686975734308362
not equal torch.rand((2000, 2000)) for 10000 times torch.double
0.04189508780837059
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33286
Differential Revision: D22211962
Pulled By: glaringlee
fbshipit-source-id: a5c48f328432c1996f28e19bc75cb495fb689f6b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40405
This adds a finishAndThrow function that completes the work object,
sets an exception if one is provided by the user, and throws an exception (if
it is already set or passed by the caller). This is now done by grabbing the
lock just once and simplifies the wait functions in ProcessGroupGloo.
ghstack-source-id: 106516114
Test Plan: CI
Differential Revision: D22174890
fbshipit-source-id: ea74702216c4328187c8d193bf39e1fea43847f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40404
Adds docs to the finish function in ProcessGroup::Work. It's better to have some documentation around these functions since we have some PR's with API-changes/optimizations for these work-level functions here and in the subclasses.
ghstack-source-id: 106381736
Test Plan: CI (Docs change only)
Differential Revision: D22174891
fbshipit-source-id: 7901ea3b35caf6f69f37178ca574104d3412de28
Summary:
PyTorch should stop polluting global namespace with symbols such as `ERROR` `WARNING` and `INFO`.
Since `logging_is_not_google_glog.h` is a C++ header, define severity levels in namespace and add `GLOG_` prefix to match an unshortened glog severity levels.
Change `LOG` and `LOG_IF` macros to use prefix + namespaced severity levels.
Closes https://github.com/pytorch/pytorch/issues/40083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40491
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D22210925
Pulled By: malfet
fbshipit-source-id: 0ec1181a53baa8bca2f526f245e398582304aeab
Summary:
BC NOTE:
This change makes it so modules saved with torch.jit.save in PyTorch 1.6 can be loaded by previous versions of PyTorch unless they use torch.div or (soon) torch.full. It also lets tensors saved using torch.save be loaded by previous versions. So this is the opposite of BC-breaking, but I'm using that label to highlight this issue since we don't have a "BC-improving" label.
PR NOTE:
When an operator's semantics change in PyTorch we want to do two things:
1) Preserve the semantics of older serialized Torchscript programs that use the operator
2) Ensure the new semantics are respected
Historically, this meant writing a Versioned Symbol that would remap older versions of the operator into current PyTorch code (1), and bumping the produced file format version (2). Unfortunately, bumping the produced file format version is a nuclear option for ensuring semantics are respected, since it also prevents older versions of PyTorch from loading anything (even tensors!) from newer versions.
Dynamic versioning addresses the nuclear consequences of bumping the produced file format version by only bumping it when necessary. That is, when an operator with changed semantics is detected in the serialized Torchscript. This will prevent Torchscript programs that use the changed operator from loading on earlier versions of PyTorch, as desired, but will have no impact on programs that don't use the changed operator.
Note that this change is only applicable when using torch.jit.save and torch.jit.load. torch.save pickles the given object using pickle (by default), which saves a function's Python directly.
No new tests for this behavior are added since the existing tests for versioned division in test_save_load already validate that models with div are loaded correctly at version 4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40279
Reviewed By: dzhulgakov
Differential Revision: D22168291
Pulled By: mruberry
fbshipit-source-id: e71d6380e727e25123c7eedf6d80e5d7f1fe9f95
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40115
Closes https://github.com/pytorch/pytorch/issues/37790
Closes https://github.com/pytorch/pytorch/issues/37944
A user may wish to run DDP's forward + backwards step under a non-default CUDA stream such as those created by `with torch.cuda.Stream(stream)`. In this case, the user should be responsible for synchronizing events on this stream with other streams used in the program (per the documentation at https://pytorch.org/docs/stable/notes/cuda.html#cuda-semantics), but currently DDP has a bug which causes DDP under non-default streams to fail.
If a user does the following:
```
model = DDP(...)
loss = model(inptut).sum()
loss.backward()
grad = model.module.weight.grad()
average = dist.all_reduce(grad)
```
There is a chance that `average` and `grad` will not be equal. This is because the CUDA kernels corresponding to the `all_reduce` call may run before `loss.backward()`'s kernels are finished. Specifically, in DDP we copy the allreduced gradients back to the model parameter gradients in an autograd engine callback, but this callback runs on the default stream. Note that this can also be fixed by the application synchronizing on the current stream, although this should not be expected, since the application is not using the current stream at all.
This PR fixes the issue by passing the current stream into DDP's callback.
Tested by adding a UT `test_DistributedDataParallel_non_default_stream` that fails without this PR
ghstack-source-id: 106481208
Differential Revision: D22073353
fbshipit-source-id: 70da9b44e5f546ff8b6d8c42022ecc846dff033e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40494
Resubmit the diff because D22124313 (1ec4337b7d) was reverted due to CI test failures
Added the int8_gen_quant_params.cc to CMakeList.txt to fix the CI failures
Test Plan: buck test caffe2/caffe2/quantization/server:
Reviewed By: hx89
Differential Revision: D22204244
fbshipit-source-id: a2c8b668f199cc5b0c5894086f554f7c459b1ad7
Summary:
Currently, even if USE_OPENMP is turned off, ATEN_THEADING can still use OpenMP. This commit fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40146
Reviewed By: ezyang
Differential Revision: D22208758
Pulled By: pbelevich
fbshipit-source-id: 0866c9bb9b3b5b99d586aed176eb0fbe177efa4a
Summary:
Should close https://github.com/pytorch/pytorch/issues/35810.
I decided to keep sparse handling on the Python side for clarity, although it could be moved to the C++ side (into `_amp_non_finite_check_and_unscale_`) without much trouble.
For non-fp16 sparse grads the logic is simple (call `_amp_non_finite_check_and_unscale_` on `grad._values()`) instead of `grad` itself. At least I hope it's that easy.
For fp16 sparse grads, it's tricker. Sparse tensors can be uncoalesced. From the [Note](https://pytorch.org/docs/master/sparse.html#torch.sparse.FloatTensor):
> Our sparse tensor format permits uncoalesced sparse tensors, where there may be duplicate coordinates in the indices; in this case, the interpretation is that the value at that index is the sum of all duplicate value entries.
An uncoalesced scaled fp16 grad may have values at duplicate coordinates that are all finite but large, such that adding them to make the coalesced version WOULD cause overflows.** If I checked `_values()` on the uncoalesced version, it might not report overflows, but I think it should.
So, if the grad is sparse, fp16, and uncoalesced, I still call `_amp_non_finite_check_and_unscale_` to unscale `grad._values()` in-place, but I also double-check the coalesced version by calling a second `_amp_non_finite_check_and_unscale_` on `grad.coalesce()._values()`. `coalesce()` is out-of-place, so this call doesn't redundantly affect `grad._values()`, but it does have the power to populate the same `found_inf` tensor. The `is_coalesced()` check and `coalesce()` probably aren't great for performance, but if someone needs a giant embedding table in FP16, they're better than nothing and memorywise, they'll only create a copy of nnz gradient values+indices, which is still way better than changing the whole table to FP32.
An `unscale` variant with liberty to create unscaled grads out-of-place, and replace `param.grad` instead of writing through it, could get away with just one `_amp_non_finite_check_and_unscale_`. It could say `coalesced = grad.coalesced()`, do only the stronger `_amp_non_finite_check_and_unscale_` on `coalesced._values()`, and set `param.grad = coalesced`. I could even avoid replacing `param.grad` itself by going one level deeper and setting `param.grad`'s indices and values to `coalesced`'s, but that seems brittle and still isn't truly "in place".
** you could whiteboard an uncoalesced fp32 grad with the same property, but fp32's range is big enough that I don't think it's realistic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36786
Reviewed By: ezyang
Differential Revision: D22202832
Pulled By: ngimel
fbshipit-source-id: b70961a4b6fc3a4c1882f65e7f34874066435735
Summary:
Partial support for slicing of Sequential containers.
- works around missing Sequential slice functionality
by converting to tuple
- only supports iteration of resulting tuple values,
not direct call() on the sliced sequential
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40445
Differential Revision: D22192469
Pulled By: wconstab
fbshipit-source-id: 61c85deda2d58f6e3bea2f1fa1d5d5dde568b9b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40172
This PR introduces the initial vmap frontend API. It has the following
limitations that we can resolve in the future:
- the inputs must be a flat list of tensors
- the outputs must be a flat list of tensors
- in_dims = 0 (so we always vmap over dim 0 of input tensors)
- out_dims = 0 (so the returned tensors have their vmap dim appear at
dim 0)
- Coverage limited to operations that have batching rules implemented
(torch.mul, torch.sum, torch.expand).
There are some other semantic limitations (like not being able to handle
mutation, aside from pytorch operations that perform mutation) that will
be documented in the future.
I wanted to introduce the API before adding a slow fallback for the
coverage so that we can test future batching rules (and coverage) via
the python API to avoid verbosity in C++-land.
The way vmap works is that `vmap(func)(inputs)` wraps all Tensor inputs
to be batched in BatchedTensors, sends those into func, and then unwraps
the output BatchedTensors. Operations on BatchedTensors perform the batched
operations that the user is asking for. When performing nested vmaps,
each nested vmap adds a batch dimension upon entry and removes a batch
dimension on exit.
Coming up in the near future:
- Support for non-zero in_dims and out_dims
- docstring for vmap
- slow fallback for operators that do not have a batching rule
implemented.
Test Plan: - `pytest test/test_vmap.py -v`
Differential Revision: D22102076
Pulled By: zou3519
fbshipit-source-id: b119f0a8a3a3b1717c92dbbd180dfb1618295563
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40171
It checks that all of the bdims in BatchedTensorImpl are sorted in
order of ascending `level`.
Test Plan: - Check that nothing breaks in `./build/bin/vmap_test`
Differential Revision: D22102077
Pulled By: zou3519
fbshipit-source-id: 094b7abc6c65208437f0f51a0d0083091912decc
Summary:
This is just a minor doc fix:
the `MarginRankingLoss` takes 2 input samples `x1` and `x2`, not just `x`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40285
Reviewed By: ezyang
Differential Revision: D22195069
Pulled By: zou3519
fbshipit-source-id: 909f491c94dca329a37216524f4088e9096e0bc6
Summary: [WIP] Logit Fake16 Op
Test Plan: [WIP] Tests will be enabled in test_op_nnpi_fp16.py file.
Reviewed By: hyuen
Differential Revision: D22109329
fbshipit-source-id: fd73850c3ec61375ff5bbf0ef5460868a874fbf3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40475
As title
ghstack-source-id: 106474870
Test Plan: CI
Differential Revision: D22200640
fbshipit-source-id: 1f4c7bbf54be8c4187c9338fefdf14b501597d98
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40324
1) avoid the use of item 2) bypass the im2col for 1x1 conv
Test Plan:
unit test and perf benchmark to show improvement
```
num = 50
N = 1
C = 512
H = 4
W = 4
M = 512
kernel_h = 1
kernel_w = 1
stride_h = 1
stride_w = 1
padding_h = 0
padding_w = 0
X_np = np.random.randn(N, C, H, W).astype(np.float32)
W_np = np.random.randn(M, C, kernel_h, kernel_w).astype(np.float32)
X = torch.from_numpy(X_np)
conv2d_pt = torch.nn.Conv2d(
C, M, (kernel_h, kernel_w), stride=(stride_h, stride_w),
padding=(padding_h, padding_w), groups=1, bias=True)
class ConvNet(torch.nn.Module):
def __init__(self):
super(ConvNet, self).__init__()
self.conv2d = conv2d_pt
def forward(self, x):
return self.conv2d(x)
model = ConvNet()
def pt_forward():
# with torch.autograd.profiler.profile(record_shapes=True) as prof:
model(X)
# print(prof.key_averages().table(sort_by="self_cpu_time_total"))
torch._C._set_mkldnn_enabled(False)
t = Timer("pt_forward()", "from __main__ import pt_forward, X")
```
Before the optimization:
pt time = 5.841153813526034
After the optimization:
pt time = 4.513134760782123
Differential Revision: D22149067
fbshipit-source-id: 538d9eea5b729e6c3da79444bde1784bde828876
Summary:
BC-breaking NOTE:
In PyTorch 1.6 bool and integral fill values given to torch.full must set the dtype our out keyword arguments. In prior versions of PyTorch these fill values would return float tensors by default, but in PyTorch 1.7 they will return a bool or long tensor, respectively. The documentation for torch.full has been updated to reflect this.
PR NOTE:
This PR causes torch.full to throw a runtime error when it would have inferred a float dtype by being given a boolean or integer value. A versioned symbol for torch.full is added to preserve the behavior of already serialized Torchscript programs. Existing tests for this behavior being deprecated have been updated to reflect it now being unsupported, and a couple new tests have been added to validate the versioned symbol behavior. The documentation of torch.full has also been updated to reflect this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40364
Differential Revision: D22176640
Pulled By: mruberry
fbshipit-source-id: b20158ebbcb4f6bf269d05a688bcf4f6c853a965
Summary:
1. Use LoadLibraryEx if available
2. Print more info on error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40365
Differential Revision: D22194974
Pulled By: malfet
fbshipit-source-id: e8309f39d78fd4681de5aa032288882910dff928
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40461
It turned out `:inheried-members:` (see [doc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass)) is not really usable.
Because pybind11 generates a docstring that writes `self` as parent class, `rpc.PyRRef`, type.
As a workaround, I am pulling docstrings on parent-class, `PyRRef` class, into subclass, `RRef`. And do surgery on the docstring generated by pybind11.
{F241283111}
ghstack-source-id: 106472496
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork
buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_rref_str
buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_return_local_rrefs
buck test mode/dev-nosan //caffe2/torch/fb/distributed/model_parallel/tests:test_elastic_averaging -- 'test_elastic_averaging_center \(caffe2\.torch\.fb\.distributed\.model_parallel\.tests\.test_elastic_averaging\.TestElasticAveragingCenter\)'
P134031188
Differential Revision: D7933834
fbshipit-source-id: c03a8a4c9d98888b64492a8caba1591595bfe247
Summary:
This re-applies D21232894 (b9d3869df3) and D22162524, plus updates jni_deps in a few places
to avoid breaking host JNI tests.
Test Plan: `buck test @//fbandroid/mode/server //fbandroid/instrumentation_tests/com/facebook/caffe2:host-test`
Reviewed By: xcheng16
Differential Revision: D22199952
fbshipit-source-id: df13eef39c01738637ae8cf7f581d6ccc88d37d5
Summary:
Currently, torchvision annotates `batched_nms` with `torch.jit.script` so the function gets compiled when it is traced and ONNX will work. Unfortunately, this means we are eagerly compiling batched_nms, which fails if torchvision isn't built with `torchvision.ops.nms`. As a result, torchvision doesn't work on torch hub right now.
`_script_if_tracing` could solve our problem here, but right now it does not correctly interact with recursive compilation. This PR fixes that bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40468
Reviewed By: jamesr66a
Differential Revision: D22195771
Pulled By: eellison
fbshipit-source-id: 83022ca0bab6d389a48a478aec03052c9282d2b7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40442
Problem:
Nightly builds do not include libtorch headers as local build.
The reason is that on docker images path is different than local path when building with `scripts/build_pytorch_android.sh`
Solution:
Introducing gradle property to be able to specify it and add its specification to gradle build job and snapshots publishing job which run on the same docker image.
Test:
ci-all jobs check https://github.com/pytorch/pytorch/pull/40443
checking that gradle build will result with headers inside aar
Test Plan: Imported from OSS
Differential Revision: D22190955
Pulled By: IvanKobzarev
fbshipit-source-id: 9379458d8ab024ee991ca205a573c21d649e5f8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37243
*** Why ***
As it stands, we have two thread pool solutions concurrently in use in PyTorch mobile: (1) the open source pthreadpool library under third_party, and (2) Caffe2's implementation of pthreadpool under caffe2/utils/threadpool. Since the primary use-case of the latter has been to act as a drop-in replacement for the third party version so as to enable integration and usage from within NNPACK and QNNPACK, Caffe2's implementation is intentionally written to the exact same interface as the third party version.
The original argument in favor of C2's implementation has been improved performance as a result of using spin locks, as opposed to relinquishing the thread's time slot and putting it to sleep - a less expensive operation up to a point. That seems to have given C2's implementation the upper hand in performance, hence justifying the added maintenance complexity, until the third party version improved in parallel surpassing the efficiency of C2's implementation as I have verified in benchmarks. With that advantage gone, there is no reason to continue using C2's implementation in PyTorch mobile either from the perspective of performance or code hygiene. As a matter of fact, there is considerable performance benefit to be had as a result of using the third party version as it currently stands.
This is a tricky change though, mainly because in order to avoid potential performance regressions, of which I have witnessed none but just in abundance of caution, we have decided to continue using the internal C2's implementation whenever building for Caffe2. Again, this is mainly to avoid potential performance regressions in production C2 use cases even if doing so results in reduced performance as far as I can tell.
So to summarize, today, and as it currently stands, we are using C2's implementation for (1) NNPACK, (2) PyTorch QNNPACK, and (3) ATen parallel_for on mobile builds, while using the third party version of pthreadpool for XNNPACK as XNNPACK does not provide any build options to link against an external implementation unlike NNPACK and QNNPACK do.
The goal of this PR then, is to unify all usage on mobile to the third party implementation both for improved performance and better code hygiene. This applies to PyTorch's use of NNPACK, QNNPACK, XNNPACK, and mobile's implementation of ATen parallel_for, all getting routed to the
exact same third party implementation in this PR.
Considering that NNPACK, QNNPACK, and XNNPACK are not mobile specific, these benefits carry over to non-mobile builds of PyTorch (but not Caffe2) as well. The implementation of ATen parallel_for on non-mobile builds remains unchanged.
*** How ***
This is where things get tricky.
A good deal of the build system complexity in this PR arises from our desire to maintain C2's implementation intact for C2's use.
pthreadpool is a C library with no concept of namespaces, which means two copies of the library cannot exist in the same binary or symbol collision will occur violating ODR. This means that somehow, and based on some condition, we must decide on the choice of a pthreadpool implementation. In practice, this has become more complicated as a result of all the possible combinations that USE_NNPACK, USE_QNNPACK, USE_PYTORCH_QNNPACK, USE_XNNPACK, USE_SYSTEM_XNNPACK, USE_SYSTEM_PTHREADPOOL and other variables can result in. Having said that, I have done my best in this PR to surgically cut through this complexity in a way that minimizes the side effects, considering the significance of the performance we are leaving on the table, yet, as a result of this combinatorial explosion explained above I cannot guarantee that every single combination will work as expected on the first try. I am heavily relying on CI to find any issues as local testing can only go that far.
Having said that, this PR provides a simple non mobile-specific C++ thread pool implementation on top of pthreadpool, namely caffe2::PThreadPool that automatically routes to C2's implementation or the third party version depending on the build configuration. This simplifies the logic at the cost of pushing the complexity to the build scripts. From there on, this thread pool is used in aten parallel_for, and NNPACK and family, again, routing all usage of threading to C2 or third party pthreadpool depending on the build configuration.
When it is all said or done, the layering will look like this:
a) aten::parallel_for, uses
b) caffe2::PThreadPool, which uses
c) pthreadpool C API, which delegates to
c-1) third_party implementation of pthreadpool if that's what the build has requested, and the rabbit hole ends here.
c-2) C2's implementation of pthreadpool if that's what the build has requested, which itself delegates to
c-2-1) caffe2::ThreadPool, and the rabbit hole ends here.
NNPACK, and (PyTorch) QNNPACK directly hook into (c). They never go through (b).
Differential Revision: D21232894
Test Plan: Imported from OSS
Reviewed By: dreiss
Pulled By: AshkanAliabadi
fbshipit-source-id: 8b3de86247fbc3a327e811983e082f9d40081354
Summary:
This file should have been renamed as `complex.h`, but unfortunately, it was named as `complex_type.h` due to a name clash with FBCode. Is this still the case and is it easy to resolve the name clash? Maybe related to the comment at https://github.com/pytorch/pytorch/pull/39834#issuecomment-642950012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39885
Differential Revision: D22018575
Pulled By: ezyang
fbshipit-source-id: e237ccedbe2b30c31aca028a5b4c8c063087a30f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40184
Whenever requires_tensor is True, it is also the case that abstract
is true. Thus, it is not necessary to specify requires_tensor.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22187353
Pulled By: ezyang
fbshipit-source-id: d665bb69cffe491bd989495020e1ae32340aa9da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40182
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22187354
Pulled By: ezyang
fbshipit-source-id: 875a6a7837981b60830bd7b1c35d2a3802ed7dd7
Summary:
**Summary**
This commit adds an instance method `_reconstruct` that permits users
to reconstruct a `ScriptModule` from a given C++ `Module` instance.
**Testing**
This commit adds a unit test for `_reconstruct`.
**Fixes**
This pull request fixes https://github.com/pytorch/pytorch/issues/33912.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39979
Differential Revision: D22172323
Pulled By: SplitInfinity
fbshipit-source-id: 9aa6551c422a5a324b822a09cd8d7c660f99ca5c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40379
The current sum operator doesn't support Long .. hence modify the code
Test Plan: Write a test case
Reviewed By: jspark1105, yinghai
Differential Revision: D21917365
fbshipit-source-id: b37d2c100c70d17d2f89c309e40360ddfab584ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39492
This PR adds use_c10_dispatcher: full to ops taking TensorOptions. To allow this, since the c10 operator library doesn't know about TensorOptions, we need to register the operator kernels as optional<ScalarType>, optional<Device>, optional<Layout>, optional<bool> instead, and also call them this way.
Changes:
Add use_c10_dispatcher: full to those ops
Write hacky_wrapper_for_legacy_signatures which takes an old-style kernel (i.e. one written to take TensorOptions) an creates a wrapper kernel for it that takes the scattered optional<ScalarType>, optional<Device>, optional<Layout>, optional<bool> instead.
Change codegen so that all op registrations are wrapped into hacky_wrapper_for_legacy_signatures. This is added to all ops but is a no-op if the op doesn't take TensorOptions. This allows us in the future to just change a kernel signature from TensorOptions to the scattered version and have it work without having to touch codegen.
Change codegen so that the frontend calls those operators with expanded arguments instead of with a TensorOptions object. This is required because now the kernels are written in this way.
This PR does not remove TensorOptions special cases from codegen, but instead it separates kernels from the codegen/frontend issues. After this, kernels can be worked on separately without having to touch codegen and codegen can be worked on without having to touch kernels.
Codegen diff: P133121032
ghstack-source-id: 106426630
Test Plan: waitforsandcastle
Differential Revision: D21581908
fbshipit-source-id: 6d4a9f526fd70fae40581bf26f3ccf794ce6a89e
Summary:
This is a faster and more idiomatic way of using `itertools.chain`. Instead of computing all the items in the iterable and storing them in memory, they are computed one-by-one and never stored as a huge list. This can save on both runtime and memory space.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40156
Reviewed By: ezyang
Differential Revision: D22189038
Pulled By: vincentqb
fbshipit-source-id: 160b2c27f442686821a6ea541e1f48f4a846c186
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40367
If the tensor has no storage then do not inline as a constant. This
situation when Mkldnn tensors are used.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D22158240
Pulled By: bzinodev
fbshipit-source-id: 8d2879044f2429004983a1242d837367b75a9f2a
@ -178,8 +178,7 @@ CircleCI creates a final yaml file by inlining every <<* segment, so if we were
So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
* linux test jobs use the machine executor and spin up their own docker. Why this nonsense? It's cause we run nvidia-docker for our GPU tests; any code that calls into the CUDA runtime needs to be run on nvidia-docker. To run a nvidia-docker you need to install some nvidia packages on the host machine and then call docker with the '—runtime nvidia' argument. CircleCI doesn't support this, so we have to do it ourself.
* This is not just a mere inconvenience. **This blocks all of our linux tests from using more than 2 cores.** But there is nothing that we can do about it, but wait for a fix on circleci's side. Right now, we only run some smoke tests (some simple imports) on the binaries, but this also affects non-binary test jobs.
* linux test jobs use the machine executor in order for them to properly interface with GPUs since docker executors cannot execute with attached GPUs
* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
@ -419,8 +418,6 @@ You can build Linux binaries locally easily using docker.
# in the docker container then you will see path/to/foo/baz on your local
# machine. You could also clone the pytorch and builder repos in the docker.
#
# If you're building a CUDA binary then use `nvidia-docker run` instead, see below.
#
# If you know how, add ccache as a volume too and speed up everything
docker run \
-v your/pytorch/repo:/pytorch \
@ -444,9 +441,7 @@ export DESIRED_CUDA=cpu
**Building CUDA binaries on docker**
To build a CUDA binary you need to use `nvidia-docker run` instead of just `docker run` (or you can manually pass `--runtime=nvidia`). This adds some needed libraries and things to build CUDA stuff.
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a loong time).
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a long time).
For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.