Commit Graph

25291 Commits

Author SHA1 Message Date
d14d62b7aa [dynamo] add more refleak tests (#120657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120657
Approved by: https://github.com/jansel
2024-03-07 22:25:43 +00:00
ea8f6e2e54 Subclass view fake-ification via reified ViewFuncs (#118405)
This PR:
* Uses reified ViewFuncs to swap in fake tensors / symbolic SymInts for view replay during subclass view fake-ification
* Enables automatic dynamic on view bases -> fakeifies according to the resultant symbolic context instead of the old "all-static" approach
* Covers the following view types:
    * subclass -> dense
    * dense -> subclass
    * subclass -> subclass
* Dense -> dense views are handled the old way via an `as_strided()` call, as it's likely there is no view func available

Differential Revision: [D54269082](https://our.internmc.facebook.com/intern/diff/D54269082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118405
Approved by: https://github.com/ezyang
2024-03-07 19:56:16 +00:00
2b1661c7a0 Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681)"
This reverts commit 05c256849b464deee16ccd70152fd54071c6c79c.

Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D54617701 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1984214079))
2024-03-07 18:53:51 +00:00
60aaba4128 create function to get ProcessGroupNCCL uid (#121132)
Summary: expose ProcessGroupNCCL uid

Differential Revision: D54446056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132
Approved by: https://github.com/aaronenyeshi
2024-03-07 18:34:38 +00:00
53bdae736d Add capturable single tensor Adamax (#121183)
Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop.

Next steps:
* This PR discovered two bugs: #121178 and #121238.
* Move the now hefty graph optim tests in test_cuda to use OptimInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183
Approved by: https://github.com/albanD
2024-03-07 17:57:02 +00:00
0339f1ca82 [Inductor] Allocate another shard for testing cpp-wrapper JIT (#121310)
Summary: The ABI-compatible for cpp wrapper has not been turned on as default, so test them separately. Expect to add more tests for the shard.

Differential Revision: [D54617287](https://our.internmc.facebook.com/intern/diff/D54617287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121310
Approved by: https://github.com/chenyang78
ghstack dependencies: #121309
2024-03-07 14:24:21 +00:00
57fc35a3af [ Inductor ] Shape padding honors output stride preservation (#120797)
This fix makes sure that shape padding honors inductors 'keep_output_strides' setting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120797
Approved by: https://github.com/eellison
2024-03-07 13:52:29 +00:00
cyy
4305c64fea Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-07 09:52:21 +00:00
1ce5049692 [inuctor] fix the layout problem for nll_loss2d_backward (#121173)
Fixes https://github.com/pytorch/pytorch/issues/120759 .

The CUDA implementation of nll_loss2d_backward.default requires that the 'self' tensor to be contiguous. These implicit assumption may be broken by layout optimizations. The fix here is to add the constraint when we explicitly defining the fallback for the op.

Not sure if we can improve the cuda kernel to release the constraints though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121173
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-07 09:05:07 +00:00
b3065f6899 add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
2024-03-07 08:41:43 +00:00
e8e3049f57 [FSDP2] Relaxed check for parent mesh (#121360)
Mixing 1D and 2D `DTensor`s in the same sharded state dict should be okay, so we can remove the check that a parameter for FSDP to shard must be a `DTensor` if passing a child mesh to FSDP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121360
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
ghstack dependencies: #120351, #121328
2024-03-07 08:09:25 +00:00
db36d21f5c Add SDPA pattern for HuggingFace models BF16 (#121202)
### Description

- Add pattern for bf16 input type with fp32 attention mask. (Example model: ElectraForCausalLM)
- Add pattern with batch_size=1 to avoid some clones in graph. (Example model: text-classification+prajjwal1-bert-tiny)

### Newly matched models
Dtype: bf16, machine: SPR

#### Dynamo HuggingFace models

- ElectraForCausalLM (speedup=2.09x)
- ElectraForQuestionAnswering (speedup=4.22x)
- AlbertForQuestionAnswering (speedup=1.36x)
- AlbertForMaskedLM (speedup=1.39x)

#### OOB HuggingFace models

- multiple-choice+google-electra-base-discriminator
- text-classification+prajjwal1-bert-tiny
- text-classification+prajjwal1-bert-mini
- text-classification+google-electra-base-generator
- text-classification+bert-large-cased
- casual-language-modeling+xlm-roberta-base
- text-classification+roberta-base
- text-classification+xlm-roberta-base
- text-classification+albert-base-v2
- token-classification+google-electra-base-generator
- masked-language-modeling+bert-base-cased

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121202
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-07 07:40:00 +00:00
291ce86a6c Modify StorageImplCreateHelper (#118459)
I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``:
bb6eba189f/torch/csrc/Storage.cpp (L525-L540)

Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459
Approved by: https://github.com/albanD
2024-03-07 06:26:55 +00:00
f848e9c646 [Quant][Inductor] Fix q/dq per channel lowering with 64-bit qparams (#120984)
Fixes #120869

Fix lowering of `quantize_per_channel` and `dequantize_per_channel` with float64 scale and int64 zero point.
Generated codes are incorrect without explicit type conversion. Add type conversion to the lowering pass, i.e., float64 (double) -> float32 and int64 -> int32.

**Test plan**
python test/inductor/test_cpu_repro.py -k test_per_channel_fake_quant_module_uint8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120984
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-07 06:23:52 +00:00
a2854ae904 Bugfix consume_prefix_in_state_dict_if_present function to keep the order of the state_dict (#117464)
This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present.

In the original code, ``_metadata`` was handled as a ``key``.

```
    # also strip the prefix in metadata if any.
    if "_metadata" in state_dict:
```

This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to:

```
    # also strip the prefix in metadata if any.
    if hasattr(state_dict, "_metadata"):
```

This PR also includes the necessary test.

Fixes #106942

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464
Approved by: https://github.com/mikaylagawarecki
2024-03-07 04:00:49 +00:00
eb4d87f237 graph break on sparse tensors constructions (#120458)
Fix some tests in https://github.com/pytorch/pytorch/issues/119780
sparse_bsc_tensor is not supported
https://github.com/pytorch/pytorch/pull/117907

Also more about the issue here.
https://docs.google.com/document/d/1EIb4qG88-SjVFn5TloLERliYdxIu2hwYoAA8skjOVfo/edit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120458
Approved by: https://github.com/ezyang
2024-03-07 02:17:41 +00:00
1a28ebffb3 [TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295)
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.

I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294
2024-03-07 02:04:59 +00:00
967dd31621 [cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862)
Follow-up of #95722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862
Approved by: https://github.com/Skylion007
2024-03-07 01:46:25 +00:00
b9087f8571 [profiler] Add execution_trace_observer as an optional argument to profiler (#119912)
# Update Profiler API to collect Execution Traces

## TLDR
We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware.
```
import torch

def main():
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        …
        excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW
    ) as prof:
        ...
        prof.step()
```

See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API.

## What are Execution Traces?
[Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads.  It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies.
- Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too.
- At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki)

Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)]

## Why correlate Execution Trace with PyTorch/Kineto Trace

Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly.
Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths.

## Proposal
The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section

# Testing
Updated the unit test for collecting kineto and Execution Trace together.
- Check the collected ET has right range of events.
- Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference.

```
pytest test/profiler/test_profiler.py  -k test_execution_trace_with_kineto -rP

Running 1 items in this shard

test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[W execution_trace_observer.cpp:694] Disabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-03-07 01:30:26 +00:00
a88356f45c [dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294)
add_.Tensor and div_.Scalar should support linearity so that we delay the partial
results.

This fixes the additional collective in the layernorm layer that we seen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294
Approved by: https://github.com/tianyu-l
2024-03-06 22:52:18 +00:00
372f192050 [DTensor] Initialized RNG tracker if needed (#121328)
Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`).

```
pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328
Approved by: https://github.com/wanchaol
ghstack dependencies: #120351
2024-03-06 22:21:44 +00:00
69cedc16c5 Add padding dimension checks and tests (#121298)
Fixes #121093

Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault:
```
torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d
```

To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298
Approved by: https://github.com/mikaylagawarecki
2024-03-06 21:55:34 +00:00
d7a5e59647 [dynamo] support group=None when rewriting collectives (#121043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043
Approved by: https://github.com/awgu
2024-03-06 21:37:19 +00:00
e865700f6a [FSDP2] Added initial meta-device init support (#120351)
This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`.

We override `_apply` to achieve the following:
- Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this
- Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor

We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`.

```
# Pre-training flow (no checkpoint)
global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp"))
dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"]
with torch.device("meta"):
  model = ...
  parallelize_module(model, tp_mesh, ...)
  fully_shard(model, mesh=dp_mesh, ...)
for param in model.parameters():
  assert param.device.type == "meta"

model.to_empty(device="cuda")
random.manual_seed(42, global_mesh)
for module in model.modules():
  if hasattr(module, "reset_parameters"):
    module.reset_parameters()
```

This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351
Approved by: https://github.com/wanchaol
2024-03-06 21:18:25 +00:00
76f3663efe Fixed a memory leak when calling from_numpy on a numpy array with an … (#121156)
…unsupported dtype.

Fixes #121138.

The lambda function that DECREFs the object is not called when the dtype conversion functions throws. This PR moves the conversion before the INCREF, which prevents the memory leak.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121156
Approved by: https://github.com/soulitzer, https://github.com/albanD
2024-03-06 19:37:38 +00:00
360761f7d0 [Torchelasic] Create root log directory by default (#121257)
Summary:
After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent.

Reverting the behavior to:
- making tempdir when log dir is not specified
- allowing non-empty root log dir
    - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294

Differential Revision: D54531851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257
Approved by: https://github.com/d4l3k
2024-03-06 18:50:38 +00:00
418568d2e3 Add Float8 support to onnx exporter (#121281)
Fixes #106877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121281
Approved by: https://github.com/BowenBao, https://github.com/titaiwangms
2024-03-06 18:46:56 +00:00
c5ef4df274 guard on grads being None in compiled optimizers (#121291)
Fixes #115607

We were missing guards when the grads were set to `None`. So if we compiled the optimizer with grads set to their proper value, and then with the grads set to `None` we'd continuously run the `None` version because all of the guards would pass and it would be ordered before the correct version in the cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121291
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-03-06 18:33:23 +00:00
c66d68ba51 [PT2] Add tolist() to FunctionalTensor for torch.export (#121242)
Adding tolist() to FunctionalTensor for torch.exporting TorchRec data types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121242
Approved by: https://github.com/ezyang
2024-03-06 18:10:44 +00:00
05c256849b [compiled autograd] support custom ops backed by c++ autograd::Function (#120681)
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports

limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
2024-03-06 18:01:56 +00:00
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5b7d4aa412a3988d3d91ca4c5679303.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
a427d90411 add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-06 16:25:53 +00:00
54d92f2e37 Add jacrev support in torch.compile (#121146)
Changes are simple. Moved a few entries on trace_rules.py and included tests to compare the graph generated by jacrev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121146
Approved by: https://github.com/zou3519
2024-03-06 16:05:33 +00:00
49d1fd31cf Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...) (#120077)
Description:
- PR tries to fuse nodes with compatible sizes, for example `node1: (s0, s1, s2)` and `node2: (s0 * s1 * s2)`. On `main` these two nodes can be fused due to different sizes. With this PR we can recompute node2 size, body etc using node1 indexing constraint and thus be able to fuse two nodes.
- this should influence only cpu device

Example:
```python
from unittest.mock import patch
import torch
from torch._inductor.graph import GraphLowering
from torch._inductor import config

# Force multple scheduler nodes creation to fuse them
config.realize_opcount_threshold = 1

@torch.compile(fullgraph=True, dynamic=True)
def fn(x: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor) -> torch.Tensor:
    o1 = x * w1.view(1, 1, 1, -1)
    o2 = x * w2.view(1, 1, 1, -1)
    output = o1 + o2
    return output

in_nodes = []
outputs = []
run_node = GraphLowering.run_node

graph_lowering_obj = None

def run_node_alt(self, n):
    global graph_lowering_obj

    graph_lowering_obj = self
    in_nodes.append(n)
    output = run_node(self, n)
    outputs.append(output)

    return output

x = torch.rand(1, 3, 32, 32)
w1 = torch.randn(32)
w2 = torch.randn(32)

with patch.object(GraphLowering, "run_node", run_node_alt):
    fn(x, w1, w2)

print("graph_lowering_obj.buffers:", graph_lowering_obj.buffers)
print("graph_lowering_obj.scheduler:", graph_lowering_obj.scheduler.nodes)
```

Output on `main`:
```
graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1), SchedulerNode(name='buf2')]
```
Output on this PR:
```
graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1_buf2)]
```

Context:
While working on https://github.com/pytorch/pytorch/pull/120411, upsampling bicubic decomposition, I saw an extra for-loop in C++ generated code summing up two buffers. Exploring the cause, it happend due to buffer number of ops goes beyond `config.realize_opcount_threshold`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120077
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10
2024-03-06 12:19:45 +00:00
aa0b0944d5 [dynamo] Re-dispatch torch.Tensor.new into torch.Tensor.new_empty method. (#121075)
Fix: https://github.com/pytorch/xla/issues/6009

This PR adds another case to `TensorVariable.method_new` special case, where it
re-dispatches `new` into `new_empty`.

Since we are using fake tensors, the `new` call doesn't actually gets to the corresponding
backend (e.g. XLA). So, things like the following might happen:

```python
@torch.compile(backend="openxla")
def foo(x):
    new_x = x.new(*x.size())

    # new_x.device() == "xla"
    # x.device() == "xla:0"

    return new_x + x

a = torch.arange(10)
foo(a.to(xm.xla_device()))
```

Resulting in the following error:

```python
Traceback (most recent call last):
  ...
  File "torch/_dynamo/utils.py", line 1654, in get_fake_value
    ret_val = wrap_fake_exception(
  File "torch/_dynamo/utils.py", line 1190, in wrap_fake_exception
    return fn()
  File "torch/_dynamo/utils.py", line 1655, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "torch/_dynamo/utils.py", line 1776, in run_node
    raise RuntimeError(make_error_message(e)).with_traceback(
  File "torch/_dynamo/utils.py", line 1758, in run_node
    return node.target(*args, **kwargs)
  File "torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "torch/_subclasses/fake_tensor.py", line 885, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1224, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 955, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1445, in _dispatch_impl
    return self.wrap_meta_outputs_with_default_device_logic(
  File "torch/_subclasses/fake_tensor.py", line 1575, in wrap_meta_outputs_with_default_device_logic
    return tree_map(wrap, r)
  File "torch/utils/_pytree.py", line 900, in tree_map
    return treespec.unflatten(map(func, *flat_args))
  File "torch/utils/_pytree.py", line 736, in unflatten
    leaves = list(leaves)
  File "torch/_subclasses/fake_tensor.py", line 1550, in wrap
    ) = FakeTensor._find_common_device(func, flat_args)
  File "torch/_subclasses/fake_tensor.py", line 625, in _find_common_device
    merge_devices(arg)
  File "torch/_subclasses/fake_tensor.py", line 620, in merge_devices
    raise RuntimeError(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., device='xla', size=(10,), dtype=torch.int64), FakeTensor(..., device='xla:0', size=(10,), dtype=torch.int64)), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices xla, xla:0
```

Using `new_empty`, instead, fixes this error because it uses the device from the source
tensor, instead of inferring from the current dispatch key set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121075
Approved by: https://github.com/jansel
2024-03-06 11:49:27 +00:00
b6b2d5b00a [dynamo][guards-cpp-refactor] Pass source name for debug ease (#121154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121154
Approved by: https://github.com/jansel
ghstack dependencies: #121121, #121147
2024-03-06 08:36:45 +00:00
52d89d8491 [dynamo][guards-cpp-refactor] Simplify DictGuardManager by removing KeyValueDictGuardManager (#121147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121147
Approved by: https://github.com/jansel
ghstack dependencies: #121121
2024-03-06 08:36:45 +00:00
0b9bfcf9bb [non-strict export] support tensor attribute without other args (#121176)
Summary: Without args we have a hard time detecting fake modes. This causes a fake mode mismatch error in non-strict (specifically, `aot_export_module`) when the module contains tensor attributes, because we create a fresh fake mode when we cannot detect one. The fix is to pass the same fake mode throughout.

Test Plan: added test

Differential Revision: D54516595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121176
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-03-06 08:10:00 +00:00
099ff51d45 torch check the division by zero in batch_norm_update_stats (#120882)
Fixes #120803

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120882
Approved by: https://github.com/CaoE, https://github.com/malfet
2024-03-06 05:40:21 +00:00
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
dad1b76584 Introduce EphemeralSource for symbols that should be simplified out (#120948)
Context: view fake-ification should handle closed-over state in ViewFuncs for use in view replay by:
* fake-ifying tensors
* symbolicizing SymInts

This avoids invalid specialization during view replay. However, the symbols / tensors created as intermediates in the view chain should not stick around or be guarded on. This PR introduces an `EphemeralSource` intended to be used as a source for this purpose. It has the following properties:
* Considered first to be simplified out in symbol simplification logic
* Errors if guarded on

Differential Revision: [D54561597](https://our.internmc.facebook.com/intern/diff/D54561597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120948
Approved by: https://github.com/ezyang
2024-03-06 02:30:52 +00:00
eqy
8dafc81ba9 [cuBLAS][cuBLASLt] Fix expected failures for int_mm on sm75 (turing) (#121277)
CC @malfet @atalman @ptrblck @tinglvv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121277
Approved by: https://github.com/malfet
2024-03-06 01:51:01 +00:00
4b3903379a Add assign argument to torch.Tensor.module_load (#121158)
Make `torch.__future__.get_swap_module_params_on_conversion() == True` account for `assign` argument to `nn.Module.load_state_dict`

Similar to when `torch.__future__.set_swap_module_params_on_conversion()` is `False`, `assign=True` means that we do not incur a `self.copy_(other)` and the properties of `other` will be preserved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121158
Approved by: https://github.com/albanD
ghstack dependencies: #121157
2024-03-06 01:32:06 +00:00
27389e03f0 [easy] Fixed requires_grad preservation for nn.Module.load_state_dict(assign=True) (#121157)
Always preserve requires_grad of param in module. Documentation fixed in PR stacked above.
Also fix test case to test load a state_dict generated with `keep_vars=False` (the default)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121157
Approved by: https://github.com/albanD
2024-03-06 01:32:06 +00:00
412c687e2e Fix permuted sum precision issue for lower precision on CPU (#108559)
Fixes #83149
There is a limitation of `TensorIterator` reductions:
The non-permuted input tensor will be coalesced down to a 2-d tensor by `TensorIterator` whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim).
Since the cpu reduction loop of `TensorIterator` only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108559
Approved by: https://github.com/mingfeima, https://github.com/peterbell10
2024-03-06 01:01:35 +00:00
34e3f6f3c9 fix segfault in torch.native_channel_shuffle when input is empty (#121199)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

fix https://github.com/pytorch/pytorch/issues/121092

`torch.channel_shuffle` could handle empty inputs correctly. `torch.native_channel_shuffle` bypassed the `numel == 0` check, this causes divided by zero in underlying kernel.

* __->__ #121199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121199
Approved by: https://github.com/malfet
2024-03-06 00:46:36 +00:00
cac36e232e [PyTorch] Split StaticModule out of test_static_runtime (#121028)
I want to use StaticModule in another (internal) test, so splitting it out.

Differential Revision: [D54384817](https://our.internmc.facebook.com/intern/diff/D54384817/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121028
Approved by: https://github.com/suo
2024-03-05 23:14:07 +00:00
b3a9d677a3 [ez] Add super() calls in test_custom_ops (#121239)
Some disable issues are getting spammed
Check that test_impl_invalid_devices gets skipped by the disable issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121239
Approved by: https://github.com/zou3519
2024-03-05 21:16:06 +00:00
34a28f01dd [Autograd] Improve error for leaf tensors as out argument to fallback (#121089)
Closes  #120988

Currently operators that hit the autograd fallback call `check_inplace`
on all mutated inputs, including out arguments. This leads to a slightly
confusing error message:
```
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
```

Compared to functions that don't fallback, which raise
```
RuntimeError: add(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad.
```

This changes the error message to make clear the issue is with the out argument,
but does not tighten the check to outright ban out arguments that require grad.
Instead, I use the same checks from `check_inplace` which allows non-leaf tensors
that require grad to pass without error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121089
Approved by: https://github.com/lezcano, https://github.com/soulitzer
ghstack dependencies: #121142
2024-03-05 21:13:27 +00:00
eae9751e82 Fix linalg_eigvals invalid use of composite dispatch key (#121142)
`linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA
strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals`
also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op
as not all types support out variants. Instead, I add a new helper
`_linalg_eigvals` which does the same thing in a non-composite operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142
Approved by: https://github.com/lezcano
2024-03-05 21:13:27 +00:00