Commit Graph

70323 Commits

Author SHA1 Message Date
26740f853e Remove unnecessary use of ctx.resolve_tools. (#120493)
In this case, it's simpler to use ctx.actions.run(executable = ...), which already ensures that the runfiles associated with the executable are present.

(It's also possible to use ctx.actions.run_shell(tools = ...) with a custom command line, but it's unclear to me that indirecting through the shell is needed here.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120493
Approved by: https://github.com/ezyang
2024-03-07 22:33:17 +00:00
d14d62b7aa [dynamo] add more refleak tests (#120657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120657
Approved by: https://github.com/jansel
2024-03-07 22:25:43 +00:00
6490441d8f Remove dead get_shape_groups (#120813)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120813
Approved by: https://github.com/albanD
2024-03-07 22:20:30 +00:00
18d574a07a [Inductor] Use indices for constants in triton_meta (#121427)
@bertmaher pointed out that constants are passed with their indices, not their names. Looking at triton source, this appears to be true 392370b303/python/triton/runtime/jit.py (L381-L385)
I'm guessing both indices and names work here but lets be consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121427
Approved by: https://github.com/aakhundov
2024-03-07 21:59:43 +00:00
f61192b014 Fix for Wait kernel lowering in inductor not accepting MultiOutputs from non-collective calls (#121428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121428
Approved by: https://github.com/yifuwang
2024-03-07 21:29:25 +00:00
76f1461892 [export] Serialize union fields with single entry dict. (#121263) (#121337)
Summary:

remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly.

bypass-github-export-checks

Test Plan: CI

Differential Revision: D54600943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121337
Approved by: https://github.com/tugsbayasgalan
2024-03-07 21:24:28 +00:00
4c58f2b675 [PyTorch] Use uint32_t for ProcessedNode::num_outputs (#121335)
We already use uint32_t for indexing, and the notion of a single graph node with more than four billion outputs stretches credulity.

Differential Revision: [D54598821](https://our.internmc.facebook.com/intern/diff/D54598821/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121335
Approved by: https://github.com/Skylion007
2024-03-07 21:15:05 +00:00
ea8f6e2e54 Subclass view fake-ification via reified ViewFuncs (#118405)
This PR:
* Uses reified ViewFuncs to swap in fake tensors / symbolic SymInts for view replay during subclass view fake-ification
* Enables automatic dynamic on view bases -> fakeifies according to the resultant symbolic context instead of the old "all-static" approach
* Covers the following view types:
    * subclass -> dense
    * dense -> subclass
    * subclass -> subclass
* Dense -> dense views are handled the old way via an `as_strided()` call, as it's likely there is no view func available

Differential Revision: [D54269082](https://our.internmc.facebook.com/intern/diff/D54269082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118405
Approved by: https://github.com/ezyang
2024-03-07 19:56:16 +00:00
63ec5cd158 TD Heuristic for tests mentioned in PR body, less verbose TD printing (#120621)
Move tests that are mentioned in PR body or commit message to front.  Also attempts to find any issues/PRs mentioned in the PR body and search for those too (ex if you link a disable issue and that issue contains the test file that it was failing on)

looking for: dynamo/test_export_mutations

Also removes some printed information in TD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120621
Approved by: https://github.com/osalpekar
2024-03-07 19:36:11 +00:00
c7a65f58b0 [CI] Script to fetch creds from current AWS session (#121426)
Because some implementations, like OpenDAL does not work with AWS IMDSv2, but this script will bridge the gap and enables more recent `sccache` releases(that switched from simple-s3 to OpenDAL) to work in current CI system

When launched it prints something like:
```
export AWS_ACCESS_KEY_ID=XXXXX
export AWS_SECRET_ACCESS_KEY=YYYY
export AWS_SESSION_TOKEN=ZZZZ
```
which can be `eval`ed and passed then sccache can use those failures.

Validated in https://github.com/pytorch/pytorch/pull/121323
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121426
Approved by: https://github.com/Skylion007
2024-03-07 19:25:54 +00:00
2b1661c7a0 Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681)"
This reverts commit 05c256849b464deee16ccd70152fd54071c6c79c.

Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D54617701 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1984214079))
2024-03-07 18:53:51 +00:00
60aaba4128 create function to get ProcessGroupNCCL uid (#121132)
Summary: expose ProcessGroupNCCL uid

Differential Revision: D54446056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132
Approved by: https://github.com/aaronenyeshi
2024-03-07 18:34:38 +00:00
83d095c213 [BE] Remove unnecessary requires_cuda in common_optimizers.py (#121249)
@mlazos had already added the needed decorator on the test itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121249
Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/albanD
ghstack dependencies: #121183
2024-03-07 17:57:02 +00:00
53bdae736d Add capturable single tensor Adamax (#121183)
Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop.

Next steps:
* This PR discovered two bugs: #121178 and #121238.
* Move the now hefty graph optim tests in test_cuda to use OptimInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183
Approved by: https://github.com/albanD
2024-03-07 17:57:02 +00:00
af88425cdc Forward fix lint after 121202 (#121425)
Forward fix after #121202, where the lintrunner job failed due to being unable to checkout the pytorch repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121425
Approved by: https://github.com/ezyang, https://github.com/aakhundov, https://github.com/malfet
2024-03-07 17:20:13 +00:00
suo
c3c15eb9a6 [export] update docs to not export raw functions (#121272)
as title

Differential Revision: [D54555101](https://our.internmc.facebook.com/intern/diff/D54555101/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121272
Approved by: https://github.com/zhxchen17
2024-03-07 17:18:07 +00:00
862b99b571 Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)"
This reverts commit 3239f86a3df133b5977d988324639e0de7af8749.

Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/malfet due to Breaks internal tests, likely due to the increased memory requirements ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1983875400))
2024-03-07 16:16:07 +00:00
eea37c6db4 [profiler] record nccl version in distributed info (#121044)
Summary: Add a field of NCCL version in distributed info if backend is NCCL

Differential Revision: D54432888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121044
Approved by: https://github.com/aaronenyeshi
2024-03-07 15:56:02 +00:00
cyy
3aa512cd72 [Clang-tidy header][23/N] Enable clang-tidy coverage on aten/src/ATen/*.{cpp,h} (#121380)
This PR finishes the works beginning with #https://github.com/pytorch/pytorch/pull/120763 by enabling clang-tidy on aten/src/ATen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121380
Approved by: https://github.com/Skylion007
2024-03-07 15:11:07 +00:00
9a45001905 [dynamo] relax missing symbols runtime assert (#121339)
Differential Revision: [D54603361](https://our.internmc.facebook.com/intern/diff/D54603361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121339
Approved by: https://github.com/ezyang
2024-03-07 14:53:38 +00:00
0339f1ca82 [Inductor] Allocate another shard for testing cpp-wrapper JIT (#121310)
Summary: The ABI-compatible for cpp wrapper has not been turned on as default, so test them separately. Expect to add more tests for the shard.

Differential Revision: [D54617287](https://our.internmc.facebook.com/intern/diff/D54617287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121310
Approved by: https://github.com/chenyang78
ghstack dependencies: #121309
2024-03-07 14:24:21 +00:00
7e598c0053 [Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309)
Differential Revision: [D54617284](https://our.internmc.facebook.com/intern/diff/D54617284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121309
Approved by: https://github.com/chenyang78
2024-03-07 14:22:06 +00:00
57fc35a3af [ Inductor ] Shape padding honors output stride preservation (#120797)
This fix makes sure that shape padding honors inductors 'keep_output_strides' setting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120797
Approved by: https://github.com/eellison
2024-03-07 13:52:29 +00:00
cyy
4305c64fea Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-07 09:52:21 +00:00
1ce5049692 [inuctor] fix the layout problem for nll_loss2d_backward (#121173)
Fixes https://github.com/pytorch/pytorch/issues/120759 .

The CUDA implementation of nll_loss2d_backward.default requires that the 'self' tensor to be contiguous. These implicit assumption may be broken by layout optimizations. The fix here is to add the constraint when we explicitly defining the fallback for the op.

Not sure if we can improve the cuda kernel to release the constraints though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121173
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-07 09:05:07 +00:00
b3065f6899 add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
2024-03-07 08:41:43 +00:00
e8e3049f57 [FSDP2] Relaxed check for parent mesh (#121360)
Mixing 1D and 2D `DTensor`s in the same sharded state dict should be okay, so we can remove the check that a parameter for FSDP to shard must be a `DTensor` if passing a child mesh to FSDP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121360
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
ghstack dependencies: #120351, #121328
2024-03-07 08:09:25 +00:00
db36d21f5c Add SDPA pattern for HuggingFace models BF16 (#121202)
### Description

- Add pattern for bf16 input type with fp32 attention mask. (Example model: ElectraForCausalLM)
- Add pattern with batch_size=1 to avoid some clones in graph. (Example model: text-classification+prajjwal1-bert-tiny)

### Newly matched models
Dtype: bf16, machine: SPR

#### Dynamo HuggingFace models

- ElectraForCausalLM (speedup=2.09x)
- ElectraForQuestionAnswering (speedup=4.22x)
- AlbertForQuestionAnswering (speedup=1.36x)
- AlbertForMaskedLM (speedup=1.39x)

#### OOB HuggingFace models

- multiple-choice+google-electra-base-discriminator
- text-classification+prajjwal1-bert-tiny
- text-classification+prajjwal1-bert-mini
- text-classification+google-electra-base-generator
- text-classification+bert-large-cased
- casual-language-modeling+xlm-roberta-base
- text-classification+roberta-base
- text-classification+xlm-roberta-base
- text-classification+albert-base-v2
- token-classification+google-electra-base-generator
- masked-language-modeling+bert-base-cased

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121202
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-07 07:40:00 +00:00
953c6c37cb Wrap remote cache creation with a try-catch (#121340)
Summary: In production I am seeing errors like "AttributeError: module 'triton.runtime' has no attribute 'fb_memcache'", likely due to some package skew. Until this is resolved, lets wrap this code with try-catch.

Test Plan: CI

Differential Revision: D54604339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121340
Approved by: https://github.com/aakhundov
2024-03-07 07:05:49 +00:00
291ce86a6c Modify StorageImplCreateHelper (#118459)
I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``:
bb6eba189f/torch/csrc/Storage.cpp (L525-L540)

Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459
Approved by: https://github.com/albanD
2024-03-07 06:26:55 +00:00
f848e9c646 [Quant][Inductor] Fix q/dq per channel lowering with 64-bit qparams (#120984)
Fixes #120869

Fix lowering of `quantize_per_channel` and `dequantize_per_channel` with float64 scale and int64 zero point.
Generated codes are incorrect without explicit type conversion. Add type conversion to the lowering pass, i.e., float64 (double) -> float32 and int64 -> int32.

**Test plan**
python test/inductor/test_cpu_repro.py -k test_per_channel_fake_quant_module_uint8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120984
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-07 06:23:52 +00:00
4f9d4e1ab0 [DTensor][XLA] refactor DTensor _xla API (#113214)
In response to the change pytorch/xla#5776 and #92909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113214
Approved by: https://github.com/wanchaol
2024-03-07 06:18:05 +00:00
cyy
c723514ef4 [CUDACachingAllocator] Simplify update_stat and avoid casts (#120964)
update_stat in CUDACachingAllocator.cpp was split into increase and decrease functions in this PR to simplify the implementation and avoid type casts throughout the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120964
Approved by: https://github.com/albanD
2024-03-07 05:55:38 +00:00
55232c4e1c Make CausalBias a torch.Tensor subclass again (#121358)
# Summary
This was removed in #116071 in order to enable compile support and re-adding this seems to still work with compile
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121358
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-03-07 05:20:47 +00:00
df2ad1fecc [dtensor][debug] have visualize_sharding correctly print for sub-mesh DTensor (#121216)
**Summary**
In `visualize_sharding` we chose to only print on rank 0 (global rank) which means calling `visualize_sharind` will never print anything when the dtensor object's mesh doesn't include rank 0 (i.e. a sub-mesh). This PR has `visualize_sharding` always print on rank whose mesh coordinate is (0, 0, ..., 0) instead of whose global rank is 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121216
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179, #120260
2024-03-07 04:50:15 +00:00
77873f6fe5 [dtensor][1/N] add torchrec even row-wise sharding example (#120260)
**Summary**
our goal is to demonstrate that DTensor's capability to represent TorchRec's parameter sharding. Currently this is done with `ShardedTensor` and theoretically `DTensor` can replace it with minor change.

This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.ROW_WISE` using DTensor. Note that this PR only covers the even sharding case.

**Test Run**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120260
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179
2024-03-07 04:50:15 +00:00
9cc0f23e5c [dtensor][debug] allow visualize_sharding to print header (#121179)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121179
Approved by: https://github.com/wanchaol
2024-03-07 04:50:06 +00:00
a2854ae904 Bugfix consume_prefix_in_state_dict_if_present function to keep the order of the state_dict (#117464)
This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present.

In the original code, ``_metadata`` was handled as a ``key``.

```
    # also strip the prefix in metadata if any.
    if "_metadata" in state_dict:
```

This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to:

```
    # also strip the prefix in metadata if any.
    if hasattr(state_dict, "_metadata"):
```

This PR also includes the necessary test.

Fixes #106942

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464
Approved by: https://github.com/mikaylagawarecki
2024-03-07 04:00:49 +00:00
edd80f87b8 Prevent infinite recursion within Tensor.__repr__ (#120206)
`Tensor.__repr__` calls functions which can perform logging which ends up logging `self` (with `__repr__`) causing an infinite loop. Instead of logging all the args in FakeTensor.dispatch log the actual parameters (and use `id` to log the tensor itself).

The change to torch/testing/_internal/common_utils.py came up during testing - in some ways of running the test parts was `('test', 'test_testing.py')` and so `i` was 0 and we were doing a join on `()` which was causing an error.

Repro:
```
import torch
from torch.testing import make_tensor
from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
t = torch.sparse_coo_tensor(((0, 1), (1, 0)), (1, 2), size=(2, 2))
t2 = FakeTensor.from_tensor(t, FakeTensorMode())
print(repr(t2))
```
and run with `TORCH_LOGS=+all`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120206
Approved by: https://github.com/yanboliang, https://github.com/pearu
2024-03-07 02:24:45 +00:00
eb4d87f237 graph break on sparse tensors constructions (#120458)
Fix some tests in https://github.com/pytorch/pytorch/issues/119780
sparse_bsc_tensor is not supported
https://github.com/pytorch/pytorch/pull/117907

Also more about the issue here.
https://docs.google.com/document/d/1EIb4qG88-SjVFn5TloLERliYdxIu2hwYoAA8skjOVfo/edit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120458
Approved by: https://github.com/ezyang
2024-03-07 02:17:41 +00:00
1a28ebffb3 [TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295)
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.

I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294
2024-03-07 02:04:59 +00:00
967dd31621 [cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862)
Follow-up of #95722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862
Approved by: https://github.com/Skylion007
2024-03-07 01:46:25 +00:00
b9087f8571 [profiler] Add execution_trace_observer as an optional argument to profiler (#119912)
# Update Profiler API to collect Execution Traces

## TLDR
We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware.
```
import torch

def main():
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        …
        excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW
    ) as prof:
        ...
        prof.step()
```

See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API.

## What are Execution Traces?
[Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads.  It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies.
- Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too.
- At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki)

Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)]

## Why correlate Execution Trace with PyTorch/Kineto Trace

Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly.
Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths.

## Proposal
The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section

# Testing
Updated the unit test for collecting kineto and Execution Trace together.
- Check the collected ET has right range of events.
- Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference.

```
pytest test/profiler/test_profiler.py  -k test_execution_trace_with_kineto -rP

Running 1 items in this shard

test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[W execution_trace_observer.cpp:694] Disabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-03-07 01:30:26 +00:00
eb1145436a [DCP] Adds main in format utils (#120128)
Adds main in format utils. Usage:

`python -m torch.distributed.checkpoint.format_utils dcp_to_torch dcp_dir torch_file.pt`

or

`python -m torch.distributed.checkpoint.format_utils torch_to_dcp torch_file.pt dcp_dir`

Differential Revision: [D53791355](https://our.internmc.facebook.com/intern/diff/D53791355/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120128
Approved by: https://github.com/fegin, https://github.com/wz337
2024-03-07 01:18:17 +00:00
cyy
5cc511f72f Use c10::irange and fix other index types in ForeachReduceOp.cu (#121123)
This PR follows the suggestions in #121066 and changes most loops to c10::irange.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121123
Approved by: https://github.com/soulitzer
2024-03-07 00:11:27 +00:00
c268ce4a6d Make ATen-cpu cuda/rocm agnostic (#121082)
Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system.

Test Plan: sandcastle + oss ci

Differential Revision: D54453492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082
Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD
2024-03-06 23:51:40 +00:00
e50ded03a6 Use type check for also is_not (#113859)
Handle `is_not` for:

9647a251cb/torch/_dynamo/variables/builtin.py (L1314-L1317)

I noticed https://github.com/pytorch/pytorch/issues/111713 exists, I think it's no harm to land this first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113859
Approved by: https://github.com/Skylion007
2024-03-06 23:12:42 +00:00
a88356f45c [dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294)
add_.Tensor and div_.Scalar should support linearity so that we delay the partial
results.

This fixes the additional collective in the layernorm layer that we seen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294
Approved by: https://github.com/tianyu-l
2024-03-06 22:52:18 +00:00
2f064d895c Switch TORCH_TRACE to accept a directory by default (#121331)
Directory is better because it works smoothly with distributed
runs; otherwise you'd need to modify torchrun to setup distinct
log names for each file.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D54597814](https://our.internmc.facebook.com/intern/diff/D54597814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121331
Approved by: https://github.com/albanD
2024-03-06 22:46:18 +00:00
372f192050 [DTensor] Initialized RNG tracker if needed (#121328)
Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`).

```
pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328
Approved by: https://github.com/wanchaol
ghstack dependencies: #120351
2024-03-06 22:21:44 +00:00