There are small typos in:
- caffe2/python/recurrent.py
- test/distributed/test_c10d_nccl.py
- test/test_fx.py
- torch/csrc/jit/runtime/autodiff.cpp
- torchgen/gen.py
Fixes:
- Should read `propagation` rather than `propogation`.
- Should read `multiplied` rather than `multuplied`.
- Should read `eliminate` rather than `elminate`.
- Should read `dispatcher` rather than `disaptcher`.
Semi-automated pull request generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81435
Approved by: https://github.com/ngimel
When autodiff is constructing the Gradient object, it looks at the
forward graph and records all the outputs that requires_grad into
df_input_vjps. Then at runtime, graph_executor.cpp will detach the
tensors before running the autodiff forward graph, and then add
requires_grad back onto the outputs if they need requires_grad.
Before, the require_grad check was done by just checking
`output->requires_grad()`. But at the point when autodiff is called by
profiling executor, the profiled information is still in the profile
nodes, not on values. So requires_grad would not be set on the output
values, and requires_grad() would default to True on all tensors. As a
result more output tensors than expected would require_grad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78392
Approved by: https://github.com/eellison
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345
FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership.
ghstack-source-id: 140044165
Test Plan:
CI
perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial.
Reviewed By: hlu1
Differential Revision: D31027361
fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:
- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).
To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.
internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`
updates affecting integration:
1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745
Reviewed By: saketh-are
Differential Revision: D30752939
Pulled By: malfet
fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.
Fixes https://github.com/pytorch/pytorch/issues/49299
I still need to look into a handful of failing tests, but maybe it can be a discussion basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49433
Reviewed By: ngimel
Differential Revision: D25681374
Pulled By: Krovatkin
fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43634
Because differentiable graphs detach the gradients of input Tensors, creating and inlining differentiable graphs changes the requires_grad property of tensors in the graph. In the legacy executor, this was not a problem as the Fuser would simply ignore the gradient property because it would be invariant that the LegacyExecutor only passed tensors with grad = False. This is not the case with the profiler, as the Fuser does it's own guarding.
Updating the type also helps with other typechecks, e.g. the ones specializing the backward, and with debugging the graph.
Other possibilities considered were:
- Fuser/Specialize AutogradZero always guards against requires_grad=False regardless of the profiled type
- Re-profile forward execution of differentiable graph
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23358803
Pulled By: eellison
fbshipit-source-id: b106998accd5d0f718527bc00177de9af5bad5fc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43632
Specialize the backward graph by guarding on the undefinedness of the input tensors. The graph will look like:
```
ty1, ty2, succesful_checks = prim::TypeCheck(...)
if (succesful_checks)
-> optimized graph
else:
-> fallback graph
```
Specializing on the undefinedness of tensors allows us to clean up the
```
if any_defined(inputs):
outputs = <original_computation>
else:
outputs = autograd zero tensors
```
blocks that make up the backward graph, so that we can fuse the original_computation nodes together.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23358808
Pulled By: eellison
fbshipit-source-id: f5bb28f78a4a3082ecc688a8fe0345a8a098c091
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35115
This commit runs the newly added tools/clang_format.py on the JIT
codebase and includes all of the formatting changes thus produced.
Testing:
Ran the script, CI.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D20568523
Pulled By: SplitInfinity
fbshipit-source-id: e09bdb982ccf090eecfb7c7b461b8d0681eef82b