Summary:
Currently re-implements the dataloader for stateful datasets. Outstanding work:
- Refactor DataLoader and DataLoader2 to have common base classes and only differ in specifi pieces of logic,
- Figure out how to not duplicate the `MapDataset` logic for stateful vs. non-stateful
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15096
Differential Revision: D13522043
Pulled By: goldsborough
fbshipit-source-id: 08e461ca51783047f11facc4d27dfa2e4f1e4c2a
Summary:
…done once
This allow no-op build to work correctly even when BUILD_CAFFE2_OPS is on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14982
Differential Revision: D13413960
Pulled By: zdevito
fbshipit-source-id: 6e5412a8c375af8a47c76f548cdd31cff15f3853
Summary:
This is broken out of https://github.com/pytorch/pytorch/pull/13733/
We want to install cpp tests so they can ultimately be runnable from that location for Caffe2 tests run from PyTorch builds.
cc pjh5 yf225 anderspapitto
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15000
Reviewed By: pjh5
Differential Revision: D13416253
Pulled By: orionr
fbshipit-source-id: 51280be0a22557a742f90c9f303c58c35cbd4a38
Summary:
1. Changes the prints along the 'rebuild' pathway to respect the '-q' flag of setup.py
A clean rebuild now only prints:
[zdevito@devgpu172.prn2 /data/users/zdevito/pytorch] python setup.py -q rebuild develop
[0/1] Install the project...
-- Install configuration: "RelWithDebInfo"
ninja: no work to do.
ninja: no work to do.
ninja: no work to do.
ninja: no work to do.
ninja: no work to do.
ninja: no work to do.
2. Deletes apparently dead calls to `generate_code`. Now that CMake builds these files,
it appears that it is getting called twice and the second version is never used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14972
Reviewed By: soumith
Differential Revision: D13396330
Pulled By: zdevito
fbshipit-source-id: 83c45143bbc6a6d2c1cfee929291ec059f2b5dc3
Summary:
This has 4 changes
1) propagate USE_SYSTEM_NCCL. Previously it was ignored and cmake always did a FindPackage
2) respect SCCACHE_DISABLE in our caffe2 sccache wrapper for circleci
3) use SCCACHE_DISABLE when building nccl, because it triggers the same bug as when using CCACHE (already tracked in https://github.com/pytorch/pytorch/issues/13362). This was hidden because we weren't respecting USE_SYSTEM_NCCL, and were never building nccl ourselves in CI
4) In one particular CI configuration (caffe2, cuda 8, cudnn 7), force USE_SYSTEM_NCCL=1. Building the bundled nccl triggers a bug in nvlink. I've done some investigation, but this looks like a tricky, preexisting bug, so rather than hold up this diff I'm tracking it separately in https://github.com/pytorch/pytorch/issues/14486
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14195
Differential Revision: D13237502
Pulled By: anderspapitto
fbshipit-source-id: 1100ac1269c7cd39e2e0b3ba12a56a3ce8977c55
Summary:
export - print a method with python_print
import - import a method with import_method
We want to ensure:
export(g) == export(import(export(g)))
That is after after exporting/importing once, the graph will stay exactly
the same. This is less strict that g == import(export(g)) which would
require us to maintain a lot more information about the structure of the
IR and about the names of debug symbols.
This PR addresses this with the following fixes:
* print out double-precision numbers with high enough precision such
that they always parse in the same way
* when creating loop-carried dependencies, sort them
by variable name, ensuring a consistent order
* parse nan correctly
* DCE: remove unused outputs of if statements, and loop-carried dependencies
in loops that are dead both after the loop and inside the body of the
loop.
* Do not set uniqueName for variables whose names are _[0-9]+, these
are probably rare in user code, and we need a way to communicate
that we do not care about a variable name when re-parsing the graph.
Otherwise temporary variable names will jitter around.
* Expand the definition of a constant in printing code to None,
and family.
* Allow re-treeing to work as long as the only thing in its way is a
constant node. These do not have side effects but are sometimes
inserted in a different order when tracing compared to how we print them.
* Print all constant nodes out first in the order in which they are used_val
(or, if they are inlined, ensure they get assigned CONSTANT.cX number
in a consistent order). Cleanup tuples (this is done in the compiler,
but not in the tracer, leading to some tuple indexing jitter if not
done).
* use strtod_l, not std::stod which can throw exceptions
Other:
* Add REL_WITH_DEB_INFO to setup.py. It already existed for the
cmake files. Threading it into setup.py allows us to turn on
debug symbols with optimization everywhere.
* enable round trip testing for all generated graphs. This only adds
~6 seconds to total build time but tests printing for every graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14064
Differential Revision: D13094637
Pulled By: zdevito
fbshipit-source-id: 0a1c6912194d965f15d6b0c6cf838ccc551f161d
Summary:
This is the next minimal step towards moving _C into cmake. For now,
leave _C in setup.py, but reduce it to an empty stub file. All of its
sources are now part of the new torch-python cmake target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12742
Reviewed By: soumith
Differential Revision: D13089691
Pulled By: anderspapitto
fbshipit-source-id: 1c746fda33cfebb26e02a7f0781fefa8b0d86385
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13838
According to Sebastian, the detail convention is specifically for header-private
functionality. That's not what c10/detail is; it's general, library private headers
which may be used in multiple places within PyTorch. Rename it to impl to avoid
the confusion in nomenclature.
Reviewed By: smessmer
Differential Revision: D13024368
fbshipit-source-id: 050f2632d83a69e3ae53ded88e8f938c5d61f0ef
Summary:
The python lib path on Windows was set to an incorrect path. This fixes it to be consistent with Linux.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13848
Differential Revision: D13030945
Pulled By: soumith
fbshipit-source-id: 7fb9013ffe66cff98018aea25fdb5cda03cbceb1
Summary:
1) Use the hip-thrust version of Thrust as opposed to the GH master. (ROCm 267)
2) CentOS 7.5 docker (ROCm 279)
* Always install the libraries at docker creation for ubuntu.
* Add Dockerfile for CentOS ROCm
* Enable the centos build
* Source devtoolset in bashrc
* Set locales correctly depending on whether we are on Ubuntu or CentOS
* Install a newer cmake for CentOS
* Checkout thrust as there is no package for CentOS yet.
PyTorch/Caffe2 on ROCm passed tests: https://github.com/ROCmSoftwarePlatform/pytorch/pull/280
For attention: bddppq ezyang
Docker rebuild for Ubuntu not urgent (getting rid of Thrust checkout and package install is mainly cosmetic). If docker for CentOS 7.5 is wanted, build is necessary. Build of PyTorch tested by me in CentOS docker. PyTorch unit tests work mostly, however, a test in test_jit causes a python recursion error that seems to be due to the python2 on CentOS as we haven't ever seen this on Ubuntu - hence please do not enable unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12899
Differential Revision: D13029424
Pulled By: bddppq
fbshipit-source-id: 1ca8f4337ec6a603f2742fc81046d5b8f8717c76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13342
This PR introduces a few new concepts:
- DeviceGuardImplInterface, and implementations for CPU and CUDA, which
provide a generic interface for interfacing with device and stream state,
without requiring a direct dependency on the code in question.
- InlineDeviceGuard, a general template for generating both specialized
and dynamically dispatched device guard implementations. Dynamic
dispatch is done by specializing it on a VirtualGuardImpl.
- Provide a device-independent DeviceGuard class, which can be used even
from CPU code. It uses the aforementioned dynamic dispatch.
- CUDA-specialized CUDAGuard class, which doesn't have a dynamic dispatch
but can only be used from CUDA.
- StreamGuard, which is the same as above, but for streams rather than
devices.
- Optional variants of all the aforementioned guards, which are a no-op if
no device/stream is specified
- CUDAMultiStreamGuard, specifically for the case when we want to set
a device on every guard.
There are some subtle semantic changes, which have been thoroughly documented
in the class definition.
BC-breaking changes:
- Move constructor/assignment have been removed from all device guard
implementations.
- In some cases where you previously wrote 'set_device' (or 'set_stream'), you now must write
'reset_device', because if you switch devices/device types, the stream/device on the
previous device is unset. This is different from previous behavior.
- CUDAGuard no longer handles streams, or multiple streams. Use CUDAStreamGuard
or CUDAMultiStreamGuard as appropriate for your use case.
Reviewed By: dzhulgakov
Differential Revision: D12849620
fbshipit-source-id: f61956256f0b12be754b3234fcc73c2abc1be04e
Summary:
We now have submodules that have submodules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13769
Reviewed By: soumith
Differential Revision: D13000203
Pulled By: SsnL
fbshipit-source-id: 63c0c19c6c9d25ae3bf255a2421a82ca68278866
Summary:
MKLDNN is not supported on ppc64le change USE_MKLDNN to OFF for ppc64le
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13759
Differential Revision: D12993121
Pulled By: soumith
fbshipit-source-id: 539d5cfcff2c03b59fa71e10b52fac333a64c381
Summary:
- fixes weights-contiguous requirement for THCUNN Convolutions
- Add tests that conv backward pass works for non-contiguous weights
- fix RNN tests / error messages to be consistent and pass
- relax weight grad precision for fp16 for a particular test
- fix regression of CMAKE_PREFIX_PATH not passing through
- add missing skipIfNoLapack annotations where needed
Differential Revision: D12918456
Pulled By: soumith
fbshipit-source-id: 8642d36bffcc6f2957800d6afa1e10bef2a91d05
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13217
Caffe2 proto headers are not included in pytorch package data (https://github.com/pytorch/pytorch/blob/master/setup.py#L1180). However, they are required for building custom Caffe2 ops living outside PyTorch/Caffe2 repo (e.g. custom Detectron ops).
Reviewed By: pjh5
Differential Revision: D12815881
fbshipit-source-id: 4d1aaa6a69a2193247586e85e4244fbbdb3e8192
Summary:
libcaffe2.so depends on libgloo.a for the ops in caffe2/contrib/gloo.
Symbols in libgloo.a that are not used are ignored and don't end up in
libcaffe2.so. libc10d.a depends on the caffe2 target, which in turn
depends on the gloo target, and it expects all libgloo.a symbols to be
part of libcaffe2.so. Symbols from libgloo.a that are not used in
libcaffe2.so remain undefined in libc10d.a.
To fix this, we link to libgloo.a when linking _C.so, such that any
gloo symbols in libc10d.a are resolved when linking _C.so.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13462
Differential Revision: D12892830
Pulled By: pietern
fbshipit-source-id: 7560b3899b62f76081b394498480e513a84cefab
Summary:
always build nccl from within the main cmake build, rather than via a separate invocation in build_pytorch_libs.sh. Use the existing caffe2 codepaths
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13150
Differential Revision: D12815674
Pulled By: anderspapitto
fbshipit-source-id: a710b6f242d159b9816911a25ee2c4b8c3f855aa
Summary:
We want to move _C into the same cmake invocation that builds
libcaffe2 and libtorch. However, _C depends on THD and c10d, which in
turn depend on libcaffe2. That means that we can't move _C into that
cmake file unless we do these two first. This change does so.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12775
Differential Revision: D10457374
Pulled By: anderspapitto
fbshipit-source-id: 2c1aa3b8a418a73d2112e93c7da53a2e70cf7bba
Summary:
rather than pass a list through a text file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12951
Differential Revision: D10528309
Pulled By: anderspapitto
fbshipit-source-id: d94befcd61b6304815859694b623046f256462df
Summary:
This is used to patch our cmake cuda scripts - should be in the installation script.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13013
Reviewed By: ir413
Differential Revision: D10519104
Pulled By: Yangqing
fbshipit-source-id: 542049224ea41068f32d4c0f6399c7e8b684f764
Summary:
This PR implements a DataLoader API for the C++ frontend.
The components present in this API largely match the Python API. It consists of:
- `Dataset`s: Conceptually a function from a set of indices to a batch of examples;
- `Transform`s: A functional transformation of a dataset. A `Map<D, T>` for Dataset `D` and transform `T` is itself a dataset;
- `Sampler`s: Specify a strategy for generating indices for a new batch;
- A `DataLoader`, with the ability to automatically parallelize fetching of samples across multiple worker threads;
Note that collation functions fall naturally out of the `Map<Dataset, Transform>` abstraction.
Things that are missing right now that maybe should be added:
- Memory pinning for CUDA tensors
The API was designed to be generalizable to almost any kind of dataset, transform or sampling strategy, while providing a convenient API out of the box. To achieve this, it is quite heavily templatized on various possible input types.
There are many parts to this PR! Right now, I would like feedback on:
- Your impression of the general usability of the API;
- Your impression of which parts seem too complex or overthought;
- The implementation of the parallelization aspects of the DataLoader. I've followed the Python implementation in some matters, but also differ in others. I think my implementation is a little cleaner and decouples components slightly better than the Python dataloader.
I haven't added too many comments yet, as this is fresh out of the oven. Let me know if anything is unclear from the code itself.
There also aren't any tests yet. I will write a comprehensive test suite once we agree on the API and implementation.
apaszke ezyang The controller you requested could not be found. pietern
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11918
Reviewed By: ezyang
Differential Revision: D9998881
Pulled By: goldsborough
fbshipit-source-id: 22cf357b63692bea42ddb1cc2abc71dae5030aea
Summary:
There will be a link error when the caffe2 doesn't use its protobuf under third_party. The pytorch will always link that protobuf. The pytorch doesn't use the protobuf directly. We could remove it from
the list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12451
Differential Revision: D10262676
Pulled By: ezyang
fbshipit-source-id: c2ff3fdf757fc21ed689e7f663c082064b1a0bca
Summary:
There are still a few work to be done:
- Move logging and unify AT_WARN with LOG(ERROR).
- A few header files are still being plumbed through, need cleaning.
- caffe2::EnforceNotMet aliasing is not done yet.
- need to unify the macros. See c10/util/Exception.h
This is mainly a codemod and not causing functional changes. If you find your job failing and trace back to this diff, usually it can be fixed by the following approaches:
(1) add //caffe2/c10:c10 to your dependency (or transitive dependency).
(2) change objects such as at::Error, at::Optional to the c10 namespace.
(3) change functions to the c10 namespace. Especially, caffe2::MakeString is not overridden by the unified c10::str function. Nothing else changes.
Please kindly consider not reverting this diff - it involves multiple rounds of rebasing and the fix is usually simple. Contact jiayq@ or AI Platform Dev for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12354
Reviewed By: orionr
Differential Revision: D10238910
Pulled By: Yangqing
fbshipit-source-id: 7794d5bf2797ab0ca6ebaccaa2f7ebbd50ff8f32
Summary:
Properly set cmake python_library and include_dirs hints, so that systems with multiple version of python can still find the correct libraries and header files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12569
Differential Revision: D10359910
Pulled By: soumith
fbshipit-source-id: 2238dcbed7aac8a818c9435e6bba46cda5f81cad
Summary:
- Removed the old nccl file
- Make open-source NCCL a submodule
- CMake to make NCCL itself
NCCL2 now is in the default build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12359
Reviewed By: orionr, yns88
Differential Revision: D10219665
Pulled By: teng-li
fbshipit-source-id: 134ff47057512ba617b48bf390c1c816fff3f881
Summary:
Previously, we were only enabling Flush-To-Zero (FTZ) and
Denormals-Are-Zero (DAZ) when compiling with SSE3 enabled. After,
Christian's patch (https://github.com/pytorch/pytorch/pull/12109) we
won't be compiling core files with SSE3 or SSE4 enabled, to better
support older AMD processors.
This moves the FTZ and DAZ code behind a runtime CPU check in
preparation for that change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12386
Differential Revision: D10222237
Pulled By: colesbury
fbshipit-source-id: 7ffe32561ab965e1e5f9eb6e679602bbf4775538
Summary:
- Removed the old nccl file
- Make open-source NCCL a submodule
- CMake to make NCCL itself
NCCL2 now is in the default build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12312
Differential Revision: D10190845
Pulled By: teng-li
fbshipit-source-id: 08d42253b774149a66919d194f88b34628c39bae