Compare commits

...

1376 Commits

Author SHA1 Message Date
3dfc029d19 disable 2024-07-15 16:54:30 -07:00
54a932b0ac Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/eqy, https://github.com/eellison
2024-07-15 23:23:23 +00:00
006020ff6e Fix the cudagraph capture of SDPA (#130712)
Summary: The scalar tensor by default is on CPU, which failed the cuda graph capture. To fix the issue, we put the scalar tensor on GPU

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator -- --exact 'gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator - gen_ai.llm_inference.fb.tests.test_llama2_multimodal_generator.TestGenerator: test_multimodal_decode_gen2'

Differential Revision: D59740639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130712
Approved by: https://github.com/Skylion007, https://github.com/chenyang78
2024-07-15 23:05:48 +00:00
50ef099ad0 Learn a heuristic to decide whether to pad before mm (#128643)
This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code.

The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code.

The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set.

The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen.

On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often.

The results for randomly generated data and huggingface look as follows:
`max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`.

The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set.

A100
```
       max_depth  min_samples_leaf dataset  correct  wrong  unsure  total  max_wrong_speedup  gman_wrong_speedup  threshold
15         5.0                10     train     2730      4    3023   5757           1.372220            1.193873   1.702530
16         5.0                10       val      878      0    1042   1920                NaN                 NaN   1.702530
17         5.0                10      test      925      2     993   1920           1.741708            1.354954   1.702530
18         5.0                10  hf-train       14      0      22     36                NaN                 NaN   1.702530
19         5.0                10    hf-inf        7      0       1      8                NaN                 NaN   1.702530
```

The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned:
A100 hf-train: 60
A100 hf-inf: 10

## Results on running huggingface locally
This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds.
#pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure").

I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency.
Results on huggingface training
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.19 (+/- 0.00)  1.19 (+/- 0.00)         -0.00              40.33 (+/- 1.13)             40.95 (+/- 0.78)                     -0.62                     1.52                   3                         2
  BartForConditionalGeneration   1.53 (+/- 0.06)  1.47 (+/- 0.05)          0.06              81.93 (+/- 5.20)             82.23 (+/- 1.92)                     -0.30                     0.36                   3                         1
    BlenderbotSmallForCausalLM   1.86 (+/- 0.04)  1.86 (+/- 0.00)          0.00              36.76 (+/- 0.49)             37.62 (+/- 1.33)                     -0.87                     2.31                   3                         2
                     CamemBert   2.36 (+/- 0.01)  2.35 (+/- 0.01)          0.01              97.60 (+/- 1.91)             98.69 (+/- 1.35)                     -1.09                     1.11                   2                         1
                   DistillGPT2   2.57 (+/- 0.01)  2.57 (+/- 0.01)          0.00              57.33 (+/- 0.77)             58.26 (+/- 1.41)                     -0.93                     1.59                   3                         2
             PLBartForCausalLM   2.07 (+/- 0.01)  2.06 (+/- 0.01)          0.01              32.54 (+/- 0.83)             34.65 (+/- 0.71)                     -2.11                     6.10                   3                         2
PLBartForConditionalGeneration   1.87 (+/- 0.00)  1.88 (+/- 0.00)         -0.01              58.45 (+/- 1.24)             58.95 (+/- 1.92)                     -0.50                     0.85                   3                         1
            RobertaForCausalLM   2.39 (+/- 0.01)  2.40 (+/- 0.01)         -0.01              97.38 (+/- 1.52)             97.69 (+/- 1.18)                     -0.31                     0.32                   2                         1
              TrOCRForCausalLM   1.70 (+/- 0.00)  1.70 (+/- 0.00)         -0.00              44.79 (+/- 1.33)             45.25 (+/- 1.08)                     -0.46                     1.01                   3                         2

Mean difference in speedup: 0.01
Mean compilation latency saved: -0.80s
Mean compilation latency reduction: 1.68%
```

Results on huggingface inference
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.11 (+/- 0.00)  1.11 (+/- 0.00)          0.00              19.02 (+/- 0.28)             19.40 (+/- 0.35)                     -0.38                     1.95                   3                         2
  BartForConditionalGeneration   1.26 (+/- 0.01)  1.23 (+/- 0.03)          0.03              36.84 (+/- 0.40)             36.55 (+/- 0.75)                      0.30                    -0.81                   3                         1
    BlenderbotSmallForCausalLM   1.87 (+/- 0.02)  1.87 (+/- 0.01)          0.00              17.53 (+/- 0.31)             18.03 (+/- 0.43)                     -0.49                     2.74                   3                         2
                   DistillGPT2   2.50 (+/- 0.02)  2.50 (+/- 0.01)          0.00              16.16 (+/- 0.29)             16.40 (+/- 0.18)                     -0.24                     1.46                   3                         2
             PLBartForCausalLM   1.93 (+/- 0.01)  1.94 (+/- 0.01)         -0.00              15.30 (+/- 0.22)             16.01 (+/- 0.71)                     -0.71                     4.43                   3                         2
PLBartForConditionalGeneration   1.98 (+/- 0.01)  1.98 (+/- 0.01)          0.00              25.90 (+/- 0.32)             26.58 (+/- 0.62)                     -0.67                     2.53                   3                         1
              TrOCRForCausalLM   1.61 (+/- 0.00)  1.62 (+/- 0.00)         -0.01              21.38 (+/- 0.37)             21.85 (+/- 0.16)                     -0.47                     2.16                   3                         2

Mean difference in speedup: 0.00
Mean compilation latency saved: -0.38s
Mean compilation latency reduction: 2.07%
```

For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-07-15 23:04:06 +00:00
9a5204dc2d [inductor] Remove "spawn" as an option for parallel compile method (#130746)
Summary: Looks like "spawn" is broken. Since we have "subprocess", I don't think we need it any more, so just remove as an option.

Test Plan: Verified that we get: `AssertionError: Invalid start method: spawn`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130746
Approved by: https://github.com/Skylion007
2024-07-15 22:55:54 +00:00
3f031b96c6 [Fix] Correctly identifying arguments for sub-blocks with renaming logic during TorchScript to ExportedProgram conversion (#128386)
#### Issue
Fix two issues related to inputs lifting when there are sub-blocks.
* Some inputs may appear in the nested sub-blocks, which need a recursive search to identify which arguments need to be lifted / passed in the top-level block.
* Some inputs to the sub-block are intermediate results, meaning their names are only number. This will cause issue during code generation (i.e., invalid argument name). We rename those to valid names.

#### Test Plan
* `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param`
* `test/export/test_converter.py -s -k test_hidden_input_name`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128386
Approved by: https://github.com/angelayi
2024-07-15 22:48:13 +00:00
b893aa71ca Rename generate_numeric_debug_handle to numeric_debugger (#130590)
Summary:
att

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130590
Approved by: https://github.com/dulinriley, https://github.com/tarun292
2024-07-15 22:42:27 +00:00
535016967a Enable UFMT on all of torch/sparse (#130545)
Partially addresses #123062
Ran lintrunner on:
- torch/sparse

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130545
Approved by: https://github.com/ezyang
2024-07-15 22:35:52 +00:00
7d4f50de19 dynamo add support for defaultdict(set) (#130745)
Fixes #130554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130745
Approved by: https://github.com/Skylion007
2024-07-15 22:23:33 +00:00
3928ca2ab6 [dynamo] update call map to allow multiple input parameters (#130748)
Fixes https://github.com/pytorch/pytorch/issues/128072.

Commandeering https://github.com/pytorch/pytorch/pull/128282 since the issue is now hi pri.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130748
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-07-15 22:16:49 +00:00
eqy
6f32dc0c7b Don't pass error message as places in assertGreaterAlmostEqual (#130648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130648
Approved by: https://github.com/awgu
2024-07-15 22:14:49 +00:00
dff9d68f18 Revert "Fix names conflict when lifting (#129817)"
This reverts commit 53cf46b8c602f8512d49a5c30bca7fcf5411e25c.

Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to Failing inductor/test_flex_attention.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27478084137 74da2a467f Sorry for the churn, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2229519886))
2024-07-15 22:08:45 +00:00
78799e82b0 Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)"
This reverts commit 1bc390c5f5ac065c156f55f4eceed267ecc67b41.

Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 1bc390c5f5. Test was introduced by fa5f572748 which is before the merge base ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2229508737))
2024-07-15 21:59:46 +00:00
db3a641b71 Implement operator for micro-pipelined all-gather -> _scaled_mm (#129289)
This PR implements `torch.ops.symm_mem.fused_all_gather_scaled_matmul`. It's similar to `torch.ops.symm_mem.fused_all_gather_matmul`, except that it takes scales and calls ` _scaled_mm`.

[Profiling Trace vs. Baseline](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0gmg1f2_) (FB internal only)

Co-authored-by: Will Feng <yf225@cornell.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129289
Approved by: https://github.com/Chillee, https://github.com/weifengpy, https://github.com/drisspg
2024-07-15 21:48:35 +00:00
77fb5b0e23 [c10d] a new Pytorch API (split_group) to create a process group (#130507)
This is the implementation following the RFC: https://github.com/pytorch/pytorch/issues/130407

ncclCommSplit
Summary:
In current Pytorch/c10d, the new_group API is used to create a new
process group from the default pg.  When device_id is specified in
init_process_group and nccl is used as the backend, the new_group call
will use ncclCommSplit to create the nccl communicators to save
communicator resources. It has a few drawbacks:

Redundant calls
Suppose the default group has 256 ranks, we need to have 32 children PGs
and each child PG has 8 ranks. in this case, each rank needs to call
new_group and ncclCommSplit 32 times because of how we implement
new_group API and the collective requirement of ncclCommSplit. For a
specific global rank, 31 calls of ncclCommSplit would be no_color split,
and only 1 of them is colored split. With the proposed new split_group
API, we expect only 1 call of split_group/ncclCommSplit is needed per
rank in the above example case

new_group can only split from default_pg
Ideally, a new pg should be able to be split from any pg

With the new split_group API, users can create new PGs using
ncclCommSplit with less number of calls and initialize the PG eagerly.
This is also useful in the cases of creating many P2P communicators.
Test Plan:
New UTs:
e.g., python test/distributed/test_c10d_nccl.py -k
test_comm_split_group_larger_scale
Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130507
Approved by: https://github.com/wconstab
2024-07-15 21:26:43 +00:00
ac3e2cb64a [BE] Delete unused -rg.yml workflow (#130759)
As well as `_linux-test-label.yml` as ARC experiment is dead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130759
Approved by: https://github.com/ZainRizvi
2024-07-15 20:41:59 +00:00
ee6f0ab190 [DeviceMesh][Reland] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495) (#130685)
Summary:
As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.

This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.

As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```

Adding an additional is_initialized() check since APF has a test mocking the backend without pg initialized. Therefore, we need to add the is_initialized() check to avoid test failure. In real use case, we should have a pg initialized before the get_backend() check. Not sure if we want to add this specifically for the test, but temporarily adding it to unblock APF conveyor runs.

Test Plan:
```
[irisz@devgpu051.cln3 /data/users/irisz/fbsource/fbcode (38e4a0a3b)]$ buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'
```

Reviewed By: gag1jain

Differential Revision: D59725924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130685
Approved by: https://github.com/gag1jain
2024-07-15 20:05:26 +00:00
27322355de Added some more documentation to block mask creation (#130649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130649
Approved by: https://github.com/drisspg
ghstack dependencies: #130626
2024-07-15 19:48:42 +00:00
0e79e1f958 [NJT+SDPA]Fix flash_attention output when batch_size=1 and seq_len=1 (#130652)
fix issue  #130196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130652
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jbschlosser
2024-07-15 19:44:04 +00:00
074a5c0c9b Revert "[BE] bump optree version to 0.12.1 (#130139)"
This reverts commit 8fcb156e8b5697a8f292db6db2a1803c5f4ce2d7.

Reverted https://github.com/pytorch/pytorch/pull/130139 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_codegen_dynamic_shapes.py and test_sympy_utils.py 8fcb156e8b ([comment](https://github.com/pytorch/pytorch/pull/130139#issuecomment-2229248447))
2024-07-15 19:42:11 +00:00
f1456c74a0 Fix mkl-static issue for Windows. (#130697)
Background:
We found the pytorch Windows release/2.4 performance regression: https://github.com/pytorch/pytorch/issues/130619

After some debug works, I found the pytorch Windows static mkl build options are wrong:
<img width="1049" alt="image" src="https://github.com/user-attachments/assets/38692142-bfca-4c98-8092-6e105c82bb13">
1. Thread lib is wrong.
2. Miss `openmp` lib and config.
> Debug history: https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226782504 and https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226418611

This PR will fix `mkl-static` build options issue.
<img width="863" alt="image" src="https://github.com/user-attachments/assets/834f6cee-7e6d-4d74-b2bc-8a270f05e429">

Reference:
<img width="482" alt="image" src="https://github.com/user-attachments/assets/8184dadb-f230-4062-a49f-51df1d7285f5">

https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html#gs.c6izlg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130697
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-07-15 19:28:11 +00:00
a7cfe40c9b [dtensor] Improve from_local API with run_check (#130289)
as titled, this PR:
1. switch `run_check` to be by default False and add extra doc/comments
   about the correctness guarantee. Since I observed so many calls
forget to use run_check=False, we should simply switch to not perform
metadata check and make our documentation explicit
2. Implement metadata check by picking up the changes from https://github.com/pytorch/pytorch/pull/115229
3. Improve the from_local documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130289
Approved by: https://github.com/awgu, https://github.com/wz337
ghstack dependencies: #130286, #130287, #130288
2024-07-15 18:52:55 +00:00
3342f3aa4e [dtensor] simplify sdpa strategies (#130288)
as titled, this PR simplifies both flash and efficient attention op
strategy generation paths

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130288
Approved by: https://github.com/tianyu-l
ghstack dependencies: #130286, #130287
2024-07-15 18:52:55 +00:00
7d82dc2c23 [dtensor] slice_backward to use op strategy (#130287)
as titled. slice_backward right now forward the sharding
unconditionally, which is wrong mathmatically. This PR switch it to op
strategy and only allow replication

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130287
Approved by: https://github.com/awgu
ghstack dependencies: #130286
2024-07-15 18:52:49 +00:00
53cf46b8c6 Fix names conflict when lifting (#129817)
## Bug description
When pending args that are potentially to be lift [here](58f346c874/torch/_dynamo/output_graph.py (L1866)) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](58f346c874/torch/_dynamo/output_graph.py (L2081)) can finally create a name ([here](58f346c874/torch/fx/graph.py (L1008))) that overwrite args to lift. And thus causing a wrong output of graph.

## Reproducing
Below is an reproduceable example,
```python
import logging
from typing import List

import torch
from functorch.compile import aot_module_simplified, make_boxed_func

@torch.library.custom_op("mylib::somefunc_forward", mutates_args=())
def somefunc_forward(
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    return torch.ones_like(input_)

@somefunc_forward.register_fake
def _(input_, shape, weight):
    return torch.empty_like(input_)

@torch.library.custom_op("mylib::somefunc_backward", mutates_args=())
def somefunc_backward(
    grad_output: torch.Tensor,
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    print(f"backward.{grad_output.shape=}")
    print(f"backward.{input_.shape=}")
    print(f"backward.{weight.shape=}")
    print(f"backward.{shape=}")
    assert list(weight.shape) == shape
    return torch.ones_like(weight)

@somefunc_backward.register_fake
def _(grad_output, input_, weight, shape):
    return torch.empty_like(weight)

def a_func(grad_output, input_, weight_, shape):
    return torch.ones_like(input_.sum() * weight_)

class SomeFunc(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input, weight, normalized_shape):
        ctx.normalized_shape = normalized_shape
        input_ = input.contiguous()
        weight_ = weight.contiguous()
        output = somefunc_forward(input_, weight_, ctx.normalized_shape)
        ctx.save_for_backward(input_, weight_)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        input_, weight_ = ctx.saved_tensors
        # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape)
        grad_weight = somefunc_backward(
            grad_output.contiguous(),
            input_,
            weight_,
            ctx.normalized_shape,
        )
        return None, grad_weight, None

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.ones(7))

    def forward(self, x):
        return SomeFunc.apply(x, self.weight, [7])

model = MyModel()
torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True)

def aot_print_backend(gm, sample_inputs):
    # Forward compiler capture
    def fw(gm, sample_inputs):
        print(f"----- fw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Backward compiler capture
    def bw(gm, sample_inputs):
        print(f"----- bw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Call AOTAutograd
    gm_forward = aot_module_simplified(
        gm, sample_inputs, fw_compiler=fw, bw_compiler=bw
    )
    return gm_forward

model = torch.compile(
    model,
    backend=aot_print_backend,
    dynamic=False,
)
out = model(torch.rand((128, 4, 7)))
out.mean().backward()
```

I can see log that showing calling into create_graph_input like
```log
V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none)
V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none)
```

And the backward graph generate will be like
```log
class GraphModule(torch.nn.Module):
    def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"):
        contiguous_1 = contiguous
        contiguous_2 = contiguous_1

        # No stacktrace found for following nodes
        _set_grad_enabled = torch._C._set_grad_enabled(False)

         # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(),
        contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous();  somefunc_forward_default = None

         # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(*args, **kwargs)
        somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]);  contiguous = contiguous_1 = contiguous_2 = None

        # No stacktrace found for following nodes
        _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
        return (None, somefunc_backward_default)
```

The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`.

## Environment
```log
Collecting environment information...
PyTorch version: 2.5.0a0+git0b7e8df
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.5 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.26.4
Libc version: N/A

Python version: 3.9.19 (main, May  6 2024, 14:39:30)  [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==2.0.0
[pip3] optree==0.11.0
[pip3] torch==2.5.0a0+git0b7e8df
[pip3] torchgraph==0.0.1
[conda] numpy                     2.0.0                    pypi_0    pypi
[conda] optree                    0.11.0                   pypi_0    pypi
[conda] torch                     2.5.0a0+git0b7e8df           dev_0    <develop>
[conda] torchgraph                0.0.1                     dev_0    <develop>
```

## How to fix?

I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you.

@zou3519 @oulgen

Co-authored-by: rzou <zou3519@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817
Approved by: https://github.com/zou3519
2024-07-15 18:49:12 +00:00
b4b64f76e5 Ensure tensors devices match on torch.index_put batch rule impl (#130479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130479
Approved by: https://github.com/zou3519
2024-07-15 18:16:31 +00:00
00d71b3e86 Tweak tolerances for test_vjp_linalg_tensorsolve_cuda_float32 to pass in Windows / debug builds (#130449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130449
Approved by: https://github.com/zou3519, https://github.com/malfet
ghstack dependencies: #128238, #130360
2024-07-15 17:35:34 +00:00
9e161af179 Revert "Increase tolerance for tensorsolve tests (#130620)"
This reverts commit 103b6ccab2bd025dfacc8c8a91f71f3d68e50426.

Reverted https://github.com/pytorch/pytorch/pull/130620 on behalf of https://github.com/clee2000 due to didn't work, test is still failing on this PR and on main, reverting in favor of https://github.com/pytorch/pytorch/pull/130449 instead ([comment](https://github.com/pytorch/pytorch/pull/130620#issuecomment-2229036418))
2024-07-15 17:35:04 +00:00
8fcb156e8b [BE] bump optree version to 0.12.1 (#130139)
0.12.0 Major Updates:

- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support

0.12.1 Updates:

- Fix warning regression during import when launch with strict warning filters

Closes #130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
2024-07-15 17:27:07 +00:00
1e897a0ca4 Revert "Fix names conflict when lifting (#129817)"
This reverts commit 74da2a467f166e00316aee82ba24835ca563ed87.

Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to broke dynamo/test_inline_inbuilt_nn_modules.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27461141919 74da2a467f.  Test passed on PR, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2228993570))
2024-07-15 17:09:52 +00:00
0099e15b47 Also put unbacked symbols in symbol_to_node in split_module pass (#130535)
This is not a complete fix but it is a simple one, full fix tracked
in https://github.com/pytorch/pytorch/issues/130534

Internal xref:
https://fb.workplace.com/groups/6829516587176185/posts/7510238679103969/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130535
Approved by: https://github.com/malfet
2024-07-15 16:56:01 +00:00
ca2d424c6e Tighten torch.library.infer_schema input types (#130705)
Made the following changes:
- mutates_args is now keyword-only and mandatory. This is to align with
  torch.library.custom_op (which makes it mandatory because it's easy to
  miss)
- op_name is now keyword-only. This helps the readability of the API
- updated all usages of infer_schema

This change is not BC-breaking because we introduced
torch.library.infer_schema a couple of days ago.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705
Approved by: https://github.com/yushangdi
2024-07-15 16:43:57 +00:00
9df4bc6a0d Revert "Constant folding for dynamic shape node (#129686)"
This reverts commit b7d287fbec0a05a3d4c9524006e6bfd1de6a71a0.

Reverted https://github.com/pytorch/pytorch/pull/129686 on behalf of https://github.com/atalman due to Failing internally.  Test: https://github.com/pytorch/ao/blob/main/test/prototype/mx_formats/test_mx_linear.py ([comment](https://github.com/pytorch/pytorch/pull/129686#issuecomment-2228755295))
2024-07-15 15:19:24 +00:00
7cd48df2da Refine the logic of device construction when only device index is given (#129119)
# Motivation
Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered.
```bash
>>> import torch
>>> device = torch.device(0)
>>> device.type
'cuda'
>>> a = torch.tensor([1, 2])
>>> b = a.to(0)
>>> b
tensor([1, 2], device='cuda:0')
```
It works well on CUDA GPU. But it will raise unexpected information and error running on XPU.
```bash
>>> import torch
>>> device = torch.device(0)
>>> device.type
'cuda'
>>> a = torch.tensor([1, 2])
>>> b = a.to(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
```
With this PR, refine the logic to use the currently available device type instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129119
Approved by: https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang
ghstack dependencies: #129463, #129205, #129363
2024-07-15 14:34:29 +00:00
9cae2160f5 Introduce the concept of Accelerators to PyTorch doc (#129363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129363
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #129463, #129205
2024-07-15 14:24:46 +00:00
74da2a467f Fix names conflict when lifting (#129817)
## Bug description
When pending args that are potentially to be lift [here](58f346c874/torch/_dynamo/output_graph.py (L1866)) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](58f346c874/torch/_dynamo/output_graph.py (L2081)) can finally create a name ([here](58f346c874/torch/fx/graph.py (L1008))) that overwrite args to lift. And thus causing a wrong output of graph.

## Reproducing
Below is an reproduceable example,
```python
import logging
from typing import List

import torch
from functorch.compile import aot_module_simplified, make_boxed_func

@torch.library.custom_op("mylib::somefunc_forward", mutates_args=())
def somefunc_forward(
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    return torch.ones_like(input_)

@somefunc_forward.register_fake
def _(input_, shape, weight):
    return torch.empty_like(input_)

@torch.library.custom_op("mylib::somefunc_backward", mutates_args=())
def somefunc_backward(
    grad_output: torch.Tensor,
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    print(f"backward.{grad_output.shape=}")
    print(f"backward.{input_.shape=}")
    print(f"backward.{weight.shape=}")
    print(f"backward.{shape=}")
    assert list(weight.shape) == shape
    return torch.ones_like(weight)

@somefunc_backward.register_fake
def _(grad_output, input_, weight, shape):
    return torch.empty_like(weight)

def a_func(grad_output, input_, weight_, shape):
    return torch.ones_like(input_.sum() * weight_)

class SomeFunc(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input, weight, normalized_shape):
        ctx.normalized_shape = normalized_shape
        input_ = input.contiguous()
        weight_ = weight.contiguous()
        output = somefunc_forward(input_, weight_, ctx.normalized_shape)
        ctx.save_for_backward(input_, weight_)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        input_, weight_ = ctx.saved_tensors
        # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape)
        grad_weight = somefunc_backward(
            grad_output.contiguous(),
            input_,
            weight_,
            ctx.normalized_shape,
        )
        return None, grad_weight, None

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.ones(7))

    def forward(self, x):
        return SomeFunc.apply(x, self.weight, [7])

model = MyModel()
torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True)

def aot_print_backend(gm, sample_inputs):
    # Forward compiler capture
    def fw(gm, sample_inputs):
        print(f"----- fw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Backward compiler capture
    def bw(gm, sample_inputs):
        print(f"----- bw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Call AOTAutograd
    gm_forward = aot_module_simplified(
        gm, sample_inputs, fw_compiler=fw, bw_compiler=bw
    )
    return gm_forward

model = torch.compile(
    model,
    backend=aot_print_backend,
    dynamic=False,
)
out = model(torch.rand((128, 4, 7)))
out.mean().backward()
```

I can see log that showing calling into create_graph_input like
```log
V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none)
V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none)
```

And the backward graph generate will be like
```log
class GraphModule(torch.nn.Module):
    def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"):
        contiguous_1 = contiguous
        contiguous_2 = contiguous_1

        # No stacktrace found for following nodes
        _set_grad_enabled = torch._C._set_grad_enabled(False)

         # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(),
        contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous();  somefunc_forward_default = None

         # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(*args, **kwargs)
        somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]);  contiguous = contiguous_1 = contiguous_2 = None

        # No stacktrace found for following nodes
        _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
        return (None, somefunc_backward_default)
```

The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`.

## Environment
```log
Collecting environment information...
PyTorch version: 2.5.0a0+git0b7e8df
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.5 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.26.4
Libc version: N/A

Python version: 3.9.19 (main, May  6 2024, 14:39:30)  [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==2.0.0
[pip3] optree==0.11.0
[pip3] torch==2.5.0a0+git0b7e8df
[pip3] torchgraph==0.0.1
[conda] numpy                     2.0.0                    pypi_0    pypi
[conda] optree                    0.11.0                   pypi_0    pypi
[conda] torch                     2.5.0a0+git0b7e8df           dev_0    <develop>
[conda] torchgraph                0.0.1                     dev_0    <develop>
```

## How to fix?

I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you.

@zou3519 @oulgen

Co-authored-by: rzou <zou3519@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817
Approved by: https://github.com/zou3519
2024-07-15 13:41:46 +00:00
ee039c0614 [custom_op] triton_op API V0 (#130637)
This is the initial version of an API to create custom operators whose
implementations are backed by triton kernels. While user-defined triton
kernels work out-of-the-box with triton kernels, you may wish to
construct a custom operator if you need to compose with other PyTorch
subsystems, like Tensor subclasses or vmap.

I'm hoping to get design feedback on this and ship it so that we can
begin experimenting with customers.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130637
Approved by: https://github.com/albanD
2024-07-15 13:00:54 +00:00
cyy
6beec34b1c [structural binding][9/N] Replace std::tie with structural binding (#130404)
Follows  #130544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130404
Approved by: https://github.com/janeyx99
2024-07-15 10:14:52 +00:00
ac28ae18dc [BE][Ez]: Update pybind11 submodule to v2.13.1 (#129827)
Updates pybind11 submodule to v2.13.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129827
Approved by: https://github.com/XuehaiPan, https://github.com/atalman, https://github.com/albanD
2024-07-15 08:58:56 +00:00
1d983bbb28 [easy][inline-inbuilt-nn-module] Update test output (#130681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130681
Approved by: https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #130654, #130420
2024-07-15 06:19:53 +00:00
1a266def4f [dynamo][unsoundness but very controlled] Skip guards on inbuilt nn module hooks (#130420)
Reduces the guard overhead from 2.1k units to 1k units. Compared to no-inlining (0.4k units), this reduces the slowdown from 5x to 2.5x.

This introduces unsoundness, but only for hooks for inbuilt nn modules (user defined nn module hooks are fine).

Each builtin nn module adds 4 empty ordered dict checks in the check_fn. This blows up for models with large numbers of builtin nn modules. With this PR, we skip those guards. There is no other easy way I can think of right now to control the guard overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130420
Approved by: https://github.com/jansel
ghstack dependencies: #130654
2024-07-15 06:19:53 +00:00
dc7725cc16 [halide-backend] Random number generation (#130211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130211
Approved by: https://github.com/jansel
2024-07-15 05:03:24 +00:00
1bc390c5f5 Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
2024-07-15 04:16:17 +00:00
a3c0bab502 [inductor] [cpp] use non-temporal tile load for A (#129455)
Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1.
Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding https://github.com/pytorch/pytorch/pull/129348 (also in this ghstack) on top of this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129455
Approved by: https://github.com/jgong5
2024-07-15 04:07:29 +00:00
c547b2e871 Fix python detection in cuda.cmake (#130651)
If Python package has not been detected previously, call it here

This fixes regression introduced by https://github.com/pytorch/pytorch/pull/128801 that results in annoying, but harmless warning reported in https://github.com/pytorch/pytorch/issues/129777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130651
Approved by: https://github.com/Skylion007
2024-07-15 03:45:31 +00:00
c0897919da Revert " [5/N] Change static functions in headers to inline (#130673)"
This reverts commit 4410c44ae6fd8eb36f2358ac76f7d988ca7537c5.

Reverted https://github.com/pytorch/pytorch/pull/130673 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes CUDA build 12.1/12.4 to timeout in trunk, I am not sure what I am looking at yet, so attempt to revert to see if it fixes trunk.  Plz keep in mind that a cancelled job is counted as a failure ([comment](https://github.com/pytorch/pytorch/pull/130673#issuecomment-2227641368))
2024-07-15 03:27:11 +00:00
cyy
28f6ae2718 [9/N] Replace c10::optional with std::optional (#130674)
Follows  #130509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674
Approved by: https://github.com/Skylion007
2024-07-15 00:48:43 +00:00
774ca93fd2 Added zb1p schedule (#130210)
Adds the ZB1P schedule in https://arxiv.org/pdf/2401.10241.

The ZB2P schedule might not be zero bubble when pp_group_size > 4. Proof:

![image](https://github.com/pytorch/pytorch/assets/13212964/fac4a738-c323-47c7-bcaa-c6cdd1cf20d7)

Since ZB2P generates longer schedules for some cases, and we might need a collective for fault tolerance all reduce at the end of every iteration for llama 4, so holding off to implement a more fancier ZBV schedule for now unless it would be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130210
Approved by: https://github.com/H-Huang
2024-07-14 17:32:59 +00:00
cyy
5fe9515d35 [structural binding][8/N] Replace std::tie with structural binding (#130544)
Follows #130216
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130544
Approved by: https://github.com/ezyang
2024-07-14 13:23:20 +00:00
81322aee74 [Inductor][CPP] Support more than one LocalBuffer (#129121)
**Summary**
Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion
```

**Next Step**

- [✓] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126967
2024-07-14 11:31:14 +00:00
adaa0fea5a [Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967)
**Summary**
Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)).

In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach.

In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion
```

**Next Step**

- [ ] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-07-14 11:28:10 +00:00
dcaa111dc8 support intersection by polyfill (#130672)
Fixes https://github.com/pytorch/pytorch/issues/130557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130672
Approved by: https://github.com/anijain2305
2024-07-14 10:44:26 +00:00
4d7bf72d93 [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206
Approved by: https://github.com/malfet
2024-07-14 08:17:52 +00:00
fa5f572748 [cudagraph] fallback to eager if re-record too many times (#129349)
Summary:
CUDAGraph Trees previously relies on an assumption that static inputs (parameters and buffers) does not change tensor addresses across multiple function invocations. This assumption can be used to reduce the number of tensor copies to improve performance. We also use `check_static_inputs_are_stable()` to check whether this assumption holds at runtime.

While this assumption is True in most cases, we recently observe a few cases that this assumption is not valid:
- [Inline inbuilt nn modules](https://github.com/pytorch/pytorch/pull/126822): the same function (a nn module) is used in multiple places and different parameters and buffers are passed to this function with different tensor addresses
- Some user code changes tensor addresses of parameters/buffers. See [internal example]( https://www.internalfb.com/mlhub/pipelines/runs/mast/sw-935450288-OfflineTraining_08ba1cf0?job_attempt=1&version=0&env=PRODUCTION)
- Compiled Autograd may also pass parameters/buffers with different tensor addresses across runs.

Previous PR [#126822](https://github.com/pytorch/pytorch/pull/126822) (by @mlazos) allows detecting static tensor address changes during runtime and re-recording a cudagraph if that happened. However, if the same function is re-recorded too many times, it may introduce large overhead and hurt performance. This PR adds `torch._inductor.config.triton.cudagraph_max_recording` (=5) to fallback to eager if a function has been recorded more than `cudagraph_max_recording` times for a specific node in the CUDAGraph Trees.

A summary on how static tensor address changes are handled now:

- For each child node, check the assumption via `check_invariants`. If this holds, execute node with the assumption.
- If the assumption does not hold for all child nodes, re-record if the function_id has not been recorded too many times for the current_node.
- If the function_id has been re-recorded too many times, fallback to eager function and warning.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129349
Approved by: https://github.com/eellison
2024-07-14 04:17:24 +00:00
cyy
4410c44ae6 [5/N] Change static functions in headers to inline (#130673)
Follows #128286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130673
Approved by: https://github.com/ezyang
2024-07-14 03:15:28 +00:00
6f275ae4d0 Add kwinputs to Kineto Traces (#130373)
Summary: On the autograd side of things, we are currently saving the kwinputs but we aren't doing anything with them on the profiler side. This diff enables the use of the kwinputs for both FunctionEvents and Chrome Traces.

Test Plan: Added unit testing for both chrome traces and FunctionEvents. Used RecordFunctionFast to test kwinputs since test already had kwargs being passed in but not tested.

Differential Revision: D59472345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130373
Approved by: https://github.com/davidberard98
2024-07-14 00:40:59 +00:00
f9f85bfc0b [Inductor] FlexAttention supports partial masking (#130415) (#130626)
This is the new version of https://github.com/pytorch/pytorch/pull/130415

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```
Approved by: https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626
Approved by: https://github.com/drisspg, https://github.com/yanboliang
2024-07-14 00:37:26 +00:00
cbb7e26acd [3.13, dynamo] fix jump target offset calculation (#130458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130458
Approved by: https://github.com/jansel
ghstack dependencies: #130383, #130384, #130385
2024-07-13 23:32:06 +00:00
0b5792c0ae [3.13, dynamo] fix NULL ordering in symbolic_convert CALL (#130385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130385
Approved by: https://github.com/jansel
ghstack dependencies: #130383, #130384
2024-07-13 23:32:05 +00:00
87b406d7e5 [3.13, dynamo] codegen TO_BOOL before conditional jump (#130384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130384
Approved by: https://github.com/jansel
ghstack dependencies: #130383
2024-07-13 23:32:02 +00:00
92ac9ee83c [3.13, dynamo] swap null and pop_null in codegen (#130383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130383
Approved by: https://github.com/jansel
2024-07-13 23:31:57 +00:00
97cfc65dbc Back out "[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495)" (#130676)
Summary:
Original commit changeset: 80c2ca639146

Original Phabricator Diff: D59612200

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'

Differential Revision: D59719562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130676
Approved by: https://github.com/xunnanxu
2024-07-13 23:19:22 +00:00
e5de25896f Fixed CUDA randint generation for large ranges. (#126066)
Fixes #125224

For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224).

This also affects multiple other random functions, such as `torch.rand` and `torch.randn`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-07-13 21:42:27 +00:00
1f162a5fce Revert "[Inductor][CPP] Support vectorization of remainder (#129849)"
This reverts commit 5bc18ec0a181fac0994522fefaf664f917d64b86.

Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to fails the compilation of executorch benchmark internally ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2227054413))
2024-07-13 19:28:34 +00:00
8714b7fc69 [dynamo][cpp-guards] Use dict tags to skip guards on immutable dict getitems (#130654)
Reduces the guard overhead from 3.7k units to 2.1k units.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130654
Approved by: https://github.com/jansel
2024-07-13 15:31:10 +00:00
cyy
7c83f5f7d5 [8/N] Replace c10::optional with std::optional (#130509)
Follows #130510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130509
Approved by: https://github.com/ezyang
2024-07-13 13:05:36 +00:00
0effcb70ef Revert "[ONNX] Remove beartype usage (#130484)"
This reverts commit f44739cf42e22a569bd1bdb0c113f8a069c17a41.

Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/huydhn due to Sorry for reverting your change but those failures show up in trunk after the commit landed f44739cf42, I am reverting it to see if it fix trunk ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2226812311))
2024-07-13 07:52:59 +00:00
567482973d typing fake_tensor.py (#128041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128041
Approved by: https://github.com/eellison
ghstack dependencies: #129182
2024-07-13 06:07:40 +00:00
1ad0f38a37 Fix IMAs in FlexAttention + autotuning (#130352)
# Summary

Makes error message better for non divisible sequence lengths.

Updates this PR was blocked due to two IMAs.
- The first, is that when the kv indices ends up being an 'arange' I.e. there are non sparse blocks, we end up loading off of kv_indices + 1.
- The second I dont really have a clear answer for. We were hitting an ima here:
9f401187c7/torch/_inductor/kernel/flex_attention.py (L846)
I noticed that the for our inputs 2048 and q_blocksize = 128 we were again exactly at 16. Something felt fishy. I suspect we launch one extra sparse_q block,  But why only during autotuning...

### Repro:
https://gist.github.com/drisspg/f312a66426f3440b7756c6c0cc037f4c
### After this change:
```
========= COMPUTE-SANITIZER
AUTOTUNE flex_attention(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x16, 1x1x16x16)
  triton_flex_attention_0 2.1118 ms 100.0% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_3 2.4306 ms 86.9% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_1 2.5729 ms 82.1% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_4 2.8035 ms 75.3% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_2 2.8837 ms 73.2% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.7225 seconds and 1.5218 seconds precompiling
AUTOTUNE flex_attention_backward(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x2048, 1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x16, 1x1x16x16, 1x1x16, 1x1x16x16)
  triton_flex_attention_backward_30 2.7763 ms 100.0% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_15 3.1404 ms 88.4% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_backward_14 3.2604 ms 85.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_7 3.4176 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_backward_8 3.4182 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=4, num_warps=4
  triton_flex_attention_backward_34 3.4939 ms 79.5% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8
  triton_flex_attention_backward_6 3.6517 ms 76.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_26 3.7000 ms 75.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8
  triton_flex_attention_backward_22 4.0120 ms 69.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_18 4.5052 ms 61.6% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8
SingleProcess AUTOTUNE benchmarking takes 6.6558 seconds and 6.3567 seconds precompiling
torch.Size([1, 1, 2048, 64])
Test completed successfully!
========= ERROR SUMMARY: 0 errors
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130352
Approved by: https://github.com/Skylion007, https://github.com/Chillee
2024-07-13 05:27:39 +00:00
c03e667276 [Inductor][PatternMatcher] Always prevent match across mutations (#130584)
Preventing match across mutations should always be the safe thing to do. This will be especially important for Traceable FSDP2 because in that case we do have mutation ops (`.set_` and `.resize_(0)`) in the middle of the graph for both joint-graph and post-grad graph, so making sure the pattern matcher passes work well with middle-of-graph mutation ops is important.

Q: Why can't we move these mutation ops to the end of graph, to make pass writing easier?
A: We attempted to do that in https://github.com/pytorch/pytorch/pull/129852, but the custom FX passes (in `torch/_functorch/_aot_autograd/fx_passes.py`) for the re-functionalization is complicated to maintain, and the changes to partitioner (in `torch/_functorch/partitioners.py`) also feels hacky. Hence we want to preserve these mutation ops in the middle of graph to avoid the complexity.

Test commands:
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_uint4x2_mixed_mm`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_serialized_patterns_up_to_date`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130584
Approved by: https://github.com/jansel
2024-07-13 03:39:21 +00:00
3710a79622 Flex Attention HOP: Add support for flex decoding (#129415)
# Flex Decoding
tl;dr This PR adds `flex_decoding` kernel to higher-order-op: `flex_attention` as the backend for multi-head attention decoding.

Higher-order-op `flex_attention` was introduced in (https://github.com/pytorch/pytorch/pull/121845) to accept a user defined score modification callable (`score_mod`) and through `torch.compile`to create an efficient fused flash attention kernel instatiation. The `flex_attention` kernel is efficient for long queries (>512 tokens) attention. This PR introduces `flex_decoding` kernel as an alternative backend for `flex_attention` HOP to handle LLM inference where short queries (<32 tokens) attends to long key/value sequences.

### Details

LLM decoding iteratively attends each newly generated token ( query length = 1 ) to a long key/value context (up to 132k). `flex_attention` kernel only parallelizes attention along query length (M), batch size (B) and number of heads (H) dimension. LLM decoding lacks enough parallelism in the M dimension to fill up all SMs on the modern GPUs.

`flex_decoding` adds parallelization along key/value sequence length (N). The key/value cache of a single head are split into multiple blocks and the query tokens attends to them in parallel. The results for the same head are then reduced across KV blocks to generate a global output.

## Examples

Consider a Group Query Attention (GQA) decoding case, where a query token of 16 query heads (Hq) attends to 2 kv head (Hkv). Assume a batch size of 2 (B=2) and kv cache length of 4096 (N=4096). The attention kernel iteratively attends to newly generated query token (Mq = 1).

We transform this problem into a Multiheaded Attention (MHA) problem by assuming a query length equal to number of query heads per kv heads, i.e. M=Hq//Hkv.
The inputs to `flex_attention` HOP is thus a query of shape (B=2, H=Hkv=2, M=Hq//Hkv=8, D=64), key,value of shape (B=2, H=Hkv=2, N=4096, D=64, which lead to an intermediate attention score matrix of shape (2, 2, 8, 4096) and an output of shape (2, 2, 8, 64).

```Python
import torch
from torch.nn.attention._flex_attention import _flex_attention as flex_attention

torch.manual_seed(0)

# Lets create some input tensors
# query of shape (B, Hkv, Hq//Hkv, D)
# key/value of shape (B, Hkv, N, D)
query = torch.randn(2, 2, 8, 64, device="cuda", dtype=torch.float32)
key = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32)
value = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32)

# Lets create a new score_modification checkerboard.
def checkerboard(score, batch, head, token_q, token_kv):
    score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score)
    score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score)
    return score

# Lets call flex_attention with this new score modification for decoding.
# The flex_attention HOP will chose flex_decoding as its backend since our query length (M) is only 8.
output = flex_attention(query, key, value, score_mod=checkerboard)

compiled_flex_attention = torch.compile(flex_attention)
out_compiled = compiled_flex_attention (query, key, value, score_mod=checkerboard)

torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2)
```

## Future Plans
- This PR does not implement load mask for score_mod function. This means if the score_mod functions takes a captured buffer along the M dimension , it must be padded to q length of 16, or next 2^n of query length if q_len > 16.
i.e.
```python
q_scale = torch.randn(Hq//Hkv, device="cuda")
q_scale = torch.nn.functional.pad(q_scale, (0, 16-Hq//Hkv)) # Pad captured buffer
def bias_mod(score, batch, head, q, kv):
    score = score + q_scale[token_q]
    return score
```
- Backward path for short queries (<128 token) currently does not work because the `flex_attention_backward` kernel is lacking mask support and only takes query length of a multiple of 128.
- Dynamic shape and max_autotuning is currently not working
- Add block sparse mask support (#129216 is a draft for flex_attention kernel)
- Add explicit GQA support. (#130076 is a draft for GQA support on flex_attention kernel)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129415
Approved by: https://github.com/Chillee
2024-07-13 00:41:48 +00:00
f44739cf42 [ONNX] Remove beartype usage (#130484)
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:

1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
2024-07-13 00:08:25 +00:00
a7f54c7f8a [dynamo] add meta fn for aten.kthvalue.default (#130562)
I saw
```
torch._dynamo.exc.Unsupported: unsupported operator: aten.kthvalue.default
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130562
Approved by: https://github.com/jingsh, https://github.com/zou3519
2024-07-12 23:48:31 +00:00
634b62f111 typing proxy_tensor.py (#129182)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129182
Approved by: https://github.com/Chillee
2024-07-12 23:17:09 +00:00
ea78b0c177 Revert "Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)"
This reverts commit a17d1e5322229a31f868d98987996a04736933a6.

Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/izaitsevfb due to internal needs pybind update ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2226499397))
2024-07-12 23:07:37 +00:00
f422027fce fix torch.linalg.lstsq input check (#130612)
Fixes [#117236 ](https://github.com/pytorch/pytorch/issues/117236)
The current case does not meet the vector scenario requirements, and it lacks sufficient checks (relying solely on ```dim_diff``` is insufficient).  Consequently, it triggers an internal assertion error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130612
Approved by: https://github.com/lezcano
2024-07-12 23:06:52 +00:00
06ebf87a1e Fix and improve reorder_compute_for_overlap (#130573)
Since the raise_comms and sink_waits passes are also scheduling-based, we can now implement reorder_compute_for_overlap as an optional step in the same pass. Merging them into the same pass greatly simplifies the logic and makes it easier to reason about the synergy between different passes.

- The unit tests are now fixed and re-enabled.
- Verified that the pass produces good schedulling w/ Llama3 70B in torchtitan (the scheduling was sub-optimal before this PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130573
Approved by: https://github.com/Chillee
ghstack dependencies: #129980
2024-07-12 22:25:49 +00:00
619029e892 [easy] Small rendering fix in Tensor.module_load doc (#130489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130489
Approved by: https://github.com/janeyx99
2024-07-12 22:12:53 +00:00
95046c86e3 [HOP] add HOP x torch_dispatch interaction (#130606)
This involved beefing up the Python dispatcher to handle torch_dispatch.
Given a HOP and a torch_dispatch Tensor subclass:
- the HOP will show up in the subclass's `__torch_dispatch__`
- you can also use HOP.py_impl to register a rule for the HOP x
  subclass interaction
- (coming soon) we'll offer a way to open register HOP x subclass
  interaction without needing to touch the subclass's
  `__torch_dispatch__` or the HOP's .py_impl.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606
Approved by: https://github.com/ydwu4
2024-07-12 21:51:36 +00:00
f093cd4086 Fix custom ops warning during export (#130623)
Fixes https://github.com/pytorch/pytorch/issues/130588

The problem was we were warning on all custom ops, not just ones marked
as CompositeImplicitAutograd. This PR changes the warning to just warn
on CompositeImplicitAutograd ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130623
Approved by: https://github.com/williamwen42
2024-07-12 21:34:29 +00:00
7c289c2a5c Add torch.serialization.safe_globals context manager (#127939)
Add context manager mentioned in https://github.com/pytorch/pytorch/pull/127808#pullrequestreview-2096298486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127939
Approved by: https://github.com/albanD
2024-07-12 20:38:43 +00:00
f0d7164cb9 Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit 2abc7cc21b8a215f000ac037c316ca178e9ade81.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to breaks meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2226313943))
2024-07-12 20:36:00 +00:00
103b6ccab2 Increase tolerance for tensorsolve tests (#130620)
Fix current failure in periodic trunk https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-focal-cuda11.8-py3.10-gcc9-debug%20%2F%20test%20(default%2C%204%2C%205%2C%20linux.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22functorch%2Ftest_ops.py%3A%3ATestOperatorsCUDA%3A%3Atest_vjp_linalg_tensorsolve_cuda_float32%22%5D

Since it appeared with https://github.com/pytorch/pytorch/pull/128238 that only updates random seed for the test, I expect this is just bad luck of the draw. Thus increasing tolerance like we do for other tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130620
Approved by: https://github.com/lezcano, https://github.com/atalman, https://github.com/malfet
2024-07-12 20:08:18 +00:00
af4da0799c [PyTorch] Half: don't disable direct conversion to/from float on mobile (#130465)
As far as I can tell, `FCVT` (https://developer.arm.com/documentation/ddi0602/2024-06/SIMD-FP-Instructions/FCVT--Floating-point-convert-precision--scalar--?lang=en)
is part of the base aarch64 instruction set, so it should work fine on mobile.

Differential Revision: [D59589733](https://our.internmc.facebook.com/intern/diff/D59589733/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130465
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-07-12 19:46:30 +00:00
d727e2f2d1 add total wall time in calculate_time_spent (#130611)
Fixes #ISSUE_NUMBER

Actual wall time is fwd_entire_frame_time + bwd_inductor_compile.  `calculate_time_spent` is accessed internally for monitoring use https://fburl.com/code/iiurj5m6.  However, summing values up lose the info of fwd/bwd.

This PR adds a new key of `total_wall_time` without affecting dynamo counters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130611
Approved by: https://github.com/oulgen, https://github.com/Yuzhen11
2024-07-12 19:32:44 +00:00
eqy
60fc01d0ab [CUDA] Don't double-destroy CUDA graph when debug dump is used (#130401)
Repro from @eellison

Could have sworn we had another PR with this fix floating around somewhere but I couldn't find it...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130401
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-07-12 18:57:07 +00:00
43b98fa521 Add debug repr to SymNode (#129925)
Fixes #129403

Create a separate printing function to debug SymNode, since we can't easily change `__repr__` that is used by GraphModule.recompile() to create a pythonic version of a graph

This is my first contribution, please let me know if there is anything that I should look into in further details

Thank you for you guidance! 🙏 I hope to contribute more in the future!

@aorenste
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129925
Approved by: https://github.com/aorenste
2024-07-12 18:31:23 +00:00
2c4303c1d1 [ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters (#130617)
Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic.

The original code was:
```
if torch.version.hip is not None:
```

Which was incorrectly replaced by:
```
if self.device_props.type != "hip":
```

Perhaps we need to write some unit tests here in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130617
Approved by: https://github.com/masnesral
2024-07-12 18:29:59 +00:00
741c1710e8 [cond] inlining into one of the branches when pred is a python constant (#130493)
Reland https://github.com/pytorch/pytorch/pull/128709.

When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.

We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.

Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130493
Approved by: https://github.com/BoyuanFeng
2024-07-12 18:02:09 +00:00
0bf9a091ec [torchbind] add tracing_mode support (#129586)
Sometimes, it could be difficult to write a fake class e.g. when the original implementation is using some third-party libraries or users are certain that the class is safe to trace with the real object.

This PR allows user to specify their intention by implementing a "safe_to_trace_with_real_obj" method on their script class.

Test Plan:
`pytest test/export/test_torchbind.py -k safe`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129586
Approved by: https://github.com/zou3519
2024-07-12 18:01:47 +00:00
c3e77d144e [3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185)
Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython.

This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame.
We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12.

This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185
Approved by: https://github.com/jansel
2024-07-12 17:56:38 +00:00
b0a597fcb4 Fix #121334: graph break on constant method call (#130158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130158
Approved by: https://github.com/lezcano
2024-07-12 17:34:46 +00:00
4865c6425c Add new control plane handler (#129712)
Summary:
Add a new control plane handler to retrieve flight recorder data as
JSON.

Test Plan:
Unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129712
Approved by: https://github.com/wconstab
2024-07-12 17:32:01 +00:00
55dc82bef9 [EZ] Make test_pytree_inputs actually run tests on CUDA (#130593)
Right now it's only running it on CPU even when `self.device` is set to CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130593
Approved by: https://github.com/angelayi
2024-07-12 17:17:28 +00:00
988ed4d5db [export] clean up allow_complex_guards_as_runtime_asserts flag (#130596)
Summary: removes underscore, cleans up dead code in DimConstraints

Test Plan: existing export tests

Reviewed By: angelayi

Differential Revision: D59612746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130596
Approved by: https://github.com/angelayi
2024-07-12 17:17:11 +00:00
dafef3ff35 [CP] Make CP loss curve on par with TP (#129515)
Summary:
This PR changes two implementations to make CP (CP8) lose curve be on par with TP (TP8).

1. Making key and value contiguous before doing ring attention. It is unclear why this is a requirement as SDPA does not have this requirement.

2. Use the out, grad_out, softmax_lse passed by autograd to do the backward. This implementation is similar to the implementation in transformer engine. The original implementation reruns the SDPA to get the output and logsumexp and uses that reculcated results to infer the corrected softmax_lse. But that implementation does not give a better accuracy or lose curve. Instead, that implementation converges slower.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129515
Approved by: https://github.com/d4l3k, https://github.com/wanchaol
ghstack dependencies: #129512, #129514
2024-07-12 16:55:28 +00:00
c35f12c67c [EZ] Add formatting changes to .git-blame-ignore-revs (#130627)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130627
Approved by: https://github.com/izaitsevfb, https://github.com/clee2000
2024-07-12 16:37:46 +00:00
22fd89c904 [TEST][Inductor] Fix scaled_mm call (#130582)
`_scaled_mm` no longer returns `amax` (see #128683)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130582
Approved by: https://github.com/drisspg
2024-07-12 16:25:18 +00:00
34e57025e1 Add unsigned int types to torch/types.h (#130616)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130616
Approved by: https://github.com/NicolasHug, https://github.com/albanD
2024-07-12 16:24:29 +00:00
2b1df24877 Revert "Make hashing a SymInt raise an error again (#130548)"
This reverts commit 3100455b8eeebdfbc3428ff9454579ac50666faf.

Reverted https://github.com/pytorch/pytorch/pull/130548 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels.py https://github.com/pytorch/pytorch/actions/runs/9908970127/job/27377960411 3100455b8e. Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130548#issuecomment-2225912018))
2024-07-12 16:20:12 +00:00
2a1f22e57f Change BN to eval before QAT Convert phase (#130598)
**Summary**
In the QAT convert phase, we fold bn into conv and do DCE to this BN node. We should change `torch.ops.aten._native_batch_norm_legit.default` to `torch.ops.aten._native_batch_norm_legit_no_training.default`  for a safe DCE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130598
Approved by: https://github.com/jgong5, https://github.com/yushangdi
2024-07-12 16:03:56 +00:00
18418a7dbb [ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586)
The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586
Approved by: https://github.com/justinchuby
2024-07-12 15:47:59 +00:00
e5657024b5 Fix loss_parallel with BF16 logits (#130550)
Fixes #130549

This PR uses the specific dtype for the `grad_input` buffer and fixes the error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130550
Approved by: https://github.com/tianyu-l
2024-07-12 15:47:38 +00:00
ea4b80e6d6 [FX][export] strict DCE pass, check schema for node impurity (#130552)
Fixes the failure in `test/export/test_export_training_ir_to_run_decomp.py ` caused by dead code elimination removing node with side effects.

For background, in export, we may want to export higher-level IRs that are not functional, so we need to check for side effects more carefully.

 A call_function node is impure if it has at least one mutable argument.

Fixed the tests below:

test_to_module_with_mutated_buffer_multiple_update_sub_later
test_export_input_mutation_static_shape
test_buffer_util

Another attempt modifying the original DCE pass is made in PR #130395, but it breaks some other tests, so here we add a flag and use it for export only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130552
Approved by: https://github.com/pianpwk
2024-07-12 15:43:27 +00:00
febadda107 [MPS] Fix torch.[all|any] for 5+D tensors (#130542)
Workaround bug in `reductionAndWithTensor:` that kills app with the
following assert if 5+D tensor as an input
```
Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function encodeNDArrayOp, file GPUReductionOps.mm, line 76.
```
by reshaping the tensor to 2D/3D one before running the reduction.

Refactored common code into `all_any_common_impl_mps` as both `reductionOrWithTensor:` and `reductionAndWithTensor:` suffer from the same issue

Enabled `test_reduction_ops_5D` and  added regression test to it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130542
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #130541
2024-07-12 15:06:22 +00:00
d443fbc025 [inductor] Cache precompilation functions based on configs (#130350)
Summary: If we attempt to precompile sets of different choices (e.g. Triton vs Cutlass) that have the same key, the cached pool of futures doesn't work, since it only includes the first set of configs.  Add the config's hashes to the key to avoid this problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130350
Approved by: https://github.com/eellison
2024-07-12 14:21:49 +00:00
9c69684af8 [custom_ops] expose torch.library.register_torch_dispatch (#130261)
This is the API for defining the interaction between a torch_dispatch
class and a custom op. Taking API bikeshedding.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261
Approved by: https://github.com/albanD
ghstack dependencies: #130064
2024-07-12 14:13:01 +00:00
ba941769b5 Add API for open registration between operators and subclasses (and modes) (#130064)
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.

I'll make this public in a follow-up PR if we think the approach and API
is good.

Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
2024-07-12 14:13:01 +00:00
ae3ac9cb64 Only test _is_param if doing instance check on Parameter base (#130578)
Fixes https://github.com/pytorch/pytorch/issues/111348

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130578
Approved by: https://github.com/Skylion007
2024-07-12 13:55:13 +00:00
6f54e961ea Add trace_shape_events artifact tracing for ShapeEnv events (#130473)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130473
Approved by: https://github.com/lezcano
2024-07-12 13:50:25 +00:00
3100455b8e Make hashing a SymInt raise an error again (#130548)
See https://github.com/pytorch/pytorch/issues/130547

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-07-12 13:49:56 +00:00
b75cc70875 [Pipelining] add looped schedules to fsdp/ddp test (#130563)
It feels like an oversight that these were not tested, especially since
the test case already handles multi schedules specially but no
multi-schedules were being tested
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130563
Approved by: https://github.com/H-Huang
2024-07-12 13:39:47 +00:00
da030e7add Revert "[Inductor] FlexAttention supports partial masking (#130415)"
This reverts commit 207564bab1c4fe42750931765734ee604032fb69.

Reverted https://github.com/pytorch/pytorch/pull/130415 on behalf of https://github.com/janeyx99 due to Windows trunk test_proxy_tensor test failures look relevant  ([comment](https://github.com/pytorch/pytorch/pull/130415#issuecomment-2225575622))
2024-07-12 13:20:18 +00:00
207564bab1 [Inductor] FlexAttention supports partial masking (#130415)
This is the new version of #130235

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130415
Approved by: https://github.com/Chillee
2024-07-12 07:19:28 +00:00
e568c91a7b [CP] Fix the incorrect ring schedule in the fwd and bwd (#129514)
Summary:
1. The argument order for all_to_all_single is "block, output_split_size, input_split_sizes, pg".
2. Uses the correct ring order for the grad_kv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129514
Approved by: https://github.com/d4l3k, https://github.com/drisspg, https://github.com/wanchaol
ghstack dependencies: #129512
2024-07-12 07:05:36 +00:00
0d8dedb01b [dtensor] Add dtensor to TORCH_LOGS (#129512)
Summary:
Add the basic log for dispatcher of dtensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129512
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-07-12 06:50:53 +00:00
b6215f44ef DCP checkpoint_dist_client integration (#130452)
Summary:
Integrate scope tracking with `checkpoint/fb/logging_handlers.py`.

Add a map of uuid -> tracker context manager. when logging handler has following events:
* `start`: create scope_tracker object, call `__enter__`, add to map with uuid
* `end`: retrieve scope_tracker object by uuid, call `__exit__`.
* `exception`: retrieve scope_tracker object by uuid, call `__exit__` with current exception info.

Test Plan:
Test with bento notebook (attached).
with a runtime_error in finish_checkpoint method.

scuba records:
https://fburl.com/scuba/workflow_signpost/ddttgmv2

Differential Revision: D56654417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130452
Approved by: https://github.com/LucasLLC
2024-07-12 06:01:56 +00:00
ff25dfca5a Save quantization_tag in export graph serialization (#127473)
Summary: `quantization_tag` is a first class citizen metadata in quantization flows that is preserved by it. As we'll want to store the quantized exported graphs we also need to preserve this metadata as it's used in later flows. Only json supported metadata will be allowed to be serialized.

Test Plan: Added test case

Differential Revision: D57939282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127473
Approved by: https://github.com/angelayi
2024-07-12 05:06:40 +00:00
b7d287fbec Constant folding for dynamic shape node (#129686)
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops

We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.

Taken over from https://github.com/pytorch/pytorch/pull/128937

joint work with @imzhuhl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
2024-07-12 03:44:29 +00:00
ae0edadea0 [SDPA] Replace masked_fill_ with aten::where (#130281)
Summary:
full context in D59385876

Based on the offline discussion with PT2 folks, we switched to change the SDPA impl to mitigate the AOTI lowering issue

Test Plan: PYTORCH_TEST_FBCODE=1 buck2 run  mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true caffe2/test/inductor:test_inductor -- -r test_sdpa_inference_mode_aot_compile

Differential Revision: D59495634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130281
Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Skylion007, https://github.com/justinchuby
2024-07-12 03:04:31 +00:00
c16e90fe06 The device_suffix in a test_name is "privateuse1" sometimes. (#130091)
When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes.
For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes.
When setUpClass() didn't set it, the device_suffix would be "privateuse1".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091
Approved by: https://github.com/zou3519
2024-07-12 02:51:40 +00:00
9ae40c6bc0 Fix and improve raise_comms and sink_waits (#129980)
The tests for `raise_comms` and `sink_waits` passes were not enabled in CI. The passes are now broken due to functional collective v2 and possibly other changes.

Correctness issues:
- The original passes did not take mutation into consideration and may yield semantically different scheduling order. This may be due to the recent changes to how mutations are expressed in Inductor IR (e.g., MutationOutput).

Effectiveness issues:
- The original passes only moved the comm/wait nodes themselves. However, comm nodes can come with prologues (e.g., clone for all_reduce_, split-cat for non-zero dim all-gather). Whenever there are any prologues, the comms won't be raised at all.
- The prologues are often horizontally fused with other pointwise nodes. This can severely delay the scheduling of the comm node.

This PR:
- Make the passes handle mutation correctly.
- Instead of moving individual comm/wait nodes, schedule all node using a scored method. This way the comm nodes can be optimally raised even in the presence of prologues.
- The horizontal fusion of prolofues often severely delays the scheduling of the comm node. Horizontally fusing this clone can almost never out-perform scheduling the comm node earlier. Also in most cases, this clone is eliminated via in-place reuse. Therefore, we tell the scheduler to not fuse it.
- Enable the tests in CI.

Co-authored-by: Will Feng <yf225@cornell.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129980
Approved by: https://github.com/yf225
2024-07-12 01:55:47 +00:00
c6a676add4 [Traceable FSDP2][Inductor] Add GroupedSchedulerNode to contain nodes that must be scheduled together (#128568)
As discussed with @mlazos and @Chillee in the Inductor group chat, we need the concept of `GroupedSchedulerNode` to be able to express nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them).

This is particularly important for comm reordering and fine-grained control of peak memory. For Traceable FSDP2, there are two very important requirements:
- At any time, there must be only one AllGather in flight. However, our existing comm reordering pass will naturally raise **all** of AllGather ops to the beginning of the graph, which will clearly blow up memory usage. Instead, we leverage GroupedScheduleNode which provides simple connection points to build the "chaining" on. i.e. we use it to express the schedule `(copyin + AllGather1) -> (AllGather1Wait+copyout) -> (copyin + AllGather2) -> (AllGather2Wait+copyout) ...` by setting up fake dep between the GroupedScheduleNode, which is a very clean and easy-to-understand way to express this schedule.
- The "comms" in FSDP2 are not just comms, but a combination of compute and comm. We must prevent other nodes from being scheduled in-between that set of nodes, otherwise we are artificially delaying the release of comm buffer memory which makes the peak memory usage quite bad. This is particularly pronounced for `AllGatherWait+copyout`.

From these two requirements, we derive the behavior of `GroupedSchedulerNode`: it contains nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them).

----

Q: Can we leverage `ir.Subgraph`?
A: I looked into the possibility of using `ir.Subgraph` to implement this, but realized that:
1. `ir.Subgraph` requires defining the subgraph in FX IR.
2. There is no guarantee that the Inductor IR nodes that we want to group together will all have a corresponding FX IR node, because some of those Inductor IR nodes can potentially be dynamically generated by a custom pass in the scheduler (e.g. for merging multiple all-gathers into one big all-gather, and later we want to group that big all-gather with some other op). Dynamically generated Inductor IR node doesn't have a corresponding upstream FX IR node.
3. For the above reasons, we can't use the `ir.Subgraph`, and need to define a new (and more lightweight) concept of `GroupedSchedulerNode` to achieve the behavior we need (this PR).

----

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc::test_grouped_scheduler_node`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128568
Approved by: https://github.com/eellison, https://github.com/mlazos
2024-07-12 01:42:38 +00:00
c101c4517a Add python type for list iterators (#130511)
Fixes https://github.com/pytorch/pytorch/issues/117026

Also not sure why this was missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130511
Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/anijain2305
2024-07-12 01:14:18 +00:00
536b5b19b5 Revert "Simplify c10::string_view (#130009)"
This reverts commit 10c7f037fe3271cb3865816c216007ba403f5347.

Reverted https://github.com/pytorch/pytorch/pull/130009 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/130009#issuecomment-2224223526))
2024-07-12 00:46:49 +00:00
7f2436014e add MTIA as valid device type for prof averages (#130340)
Summary: Add MTIA as valid device option for getting profile averages

Test Plan: Tested with auto-trace on MTIA

Differential Revision: D59486392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130340
Approved by: https://github.com/aaronenyeshi
2024-07-12 00:39:01 +00:00
7ce5b5767c Revert "Make c10::string_view an alias of std::string_view (#130417)"
This reverts commit c9551a3f50efc8163d8508a3c2189536528577ac.

Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/izaitsevfb due to depends on #130009 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2224212227))
2024-07-12 00:37:04 +00:00
b5b91b418d [Easy] Update record_function Comment (#130561)
Summary: Users have been confused why user annotations on GPU tracks do not show when doing GPU only tracing. This comment should help users understand that to use this function they need to have CPU activies enabled.

Test Plan: N/A it is just updating a comment

Differential Revision: D59649390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130561
Approved by: https://github.com/aaronenyeshi
2024-07-11 23:51:25 +00:00
18b7633bfb [export] fix kwargs in run_decompositions() for training IR (#130553)
Re-exporting GraphModule expects all inputs to be in args, though not in pytree-flattened format. This avoids failing when we run with a fx.Interpreter subclass in [AOTAutograd tracing](973037be6a/torch/_functorch/_aot_autograd/traced_function_transforms.py (L760-L762)).

Removes 7 test failures for training IR export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130553
Approved by: https://github.com/zhxchen17, https://github.com/ydwu4
2024-07-11 22:53:18 +00:00
26c2b92525 [export] make with_effect mark op has_effect to prevent them from DCEed. (#129680)
Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph.

This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident.

Test Plan:
Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op

Differential Revision: [D59498728](https://our.internmc.facebook.com/intern/diff/D59498728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680
Approved by: https://github.com/angelayi
2024-07-11 22:46:21 +00:00
9c6c0deadc Add eager_compile_backwards_failure to tlparse (#130434)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130434
Approved by: https://github.com/albanD
2024-07-11 22:35:33 +00:00
d97d962082 Revert "Add decompositions for copy variants of view ops (#128416)"
This reverts commit 68751799b85aa7f659420801bdbb8451f01ab09a.

Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))
2024-07-11 22:09:23 +00:00
a2f630a9a4 Revert "Decompose expand_copy and permute_copy (#129476)"
This reverts commit 7d4cb2109823f1c4001dff62b461bb9eda07ca17.

Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))
2024-07-11 22:06:15 +00:00
fc872e98f3 Infer prim tags from equivalent aten ones (#130367)
Take intersection of all the tags for corresponding aten op overloads. Previously, some of the rng ops not having tags caused issues with constant folding (they should get decomposed but thats a separate issue).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130367
Approved by: https://github.com/ezyang
2024-07-11 20:53:52 +00:00
726a287271 [export] Expand verifier to be multiple on ExportedProgram (#130364)
Summary: This diff updates the ExportedProgram class in PyTorch to allow for multiple verifiers to be attached to it. This is done by adding a new field to the ExportedProgram schema called "verifiers" which is a list of strings representing the names of the verifiers to be attached to the program. The verifiers are loaded using the "load_verifier" function which is defined in the "torch._export.serde.serialize" module. The "exported_program.dialect" field is also deprecated in favor of the "verifiers" field.

Test Plan: CI

Differential Revision: D59408546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130364
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-07-11 20:34:49 +00:00
5c6edd29ec Turn on splitShare=1 to make the optimization of comm_split effective. (#129929)
Fixes #129865
Currently, new_group will call ncclCommSplit in some cases. In theory, ncclCommSplit will bring performance and memory benefits. However, the config parameter of the ncclCommSplit function in pytorch does not set "splitShare=1", which results in the optimization of ncclCommSplit being turned off and the benefits being invalid.
This PR turn on splitShare=1 to make the optimization of comm_split effective.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129929
Approved by: https://github.com/shuqiangzhang
2024-07-11 20:14:58 +00:00
c50b189280 Move trunk windows builds to CUDA-12.1 (#130446)
That should catch build regressions that were previously only detectable during the nightly builds
Win + CUDA-11.8 builds and tests are still run as part of periodic workflow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130446
Approved by: https://github.com/atalman
2024-07-11 19:50:57 +00:00
bc18863713 Corner-case fix for upscale_histogram in the new HistogramObserver (#130316)
Summary: Small fix to the bucketize function that caused a run-time error in some corner cases.

Test Plan: Unit tests

Differential Revision: D59508432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130316
Approved by: https://github.com/jerryzh168
2024-07-11 19:49:21 +00:00
cd9bae30de Allow kwargs in _remove_effect_tokens_pass (#130491)
Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it.

Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs

Reviewed By: angelayi

Differential Revision: D59603147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491
Approved by: https://github.com/angelayi
2024-07-11 19:03:19 +00:00
578388bed8 Revert "Support for expandable segments with cuda graph trees (#128068)"
This reverts commit fdc83610f272610ce50d1a6f5b6354f2df1baabb.

Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))
2024-07-11 18:58:13 +00:00
1cae60a87e Caching attr_proxy for nn_module attribute to fix guard check failure (#130280)
Fixes https://github.com/pytorch/pytorch/issues/129939

Differential Revision: [D59594605](https://our.internmc.facebook.com/intern/diff/D59594605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130280
Approved by: https://github.com/anijain2305
2024-07-11 18:21:35 +00:00
0a4fe2ff86 [DSD] Use no_grad() to make some operations faster and avoid possible memory leakage (#130355)
Use no_grad() to make some operations faster and avoid possible memory leakage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130355
Approved by: https://github.com/wz337
2024-07-11 18:18:08 +00:00
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
492de213e2 Revert "Change deprecated warning on dispatch_on_subclass to warn once (#130047)"
This reverts commit f21a21828ac6e16d903ee88f726fdb2278c04782.

Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/albanD due to The failure on the PR are valid, they should not have been ignored ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2223488933))
2024-07-11 17:24:02 +00:00
f21a21828a Change deprecated warning on dispatch_on_subclass to warn once (#130047)
Summary:
Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead.

More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/

Test Plan: Sandcastle

Differential Revision: D59338775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047
Approved by: https://github.com/XilunWu
2024-07-11 17:02:26 +00:00
3896ba3260 [DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495)
Fixes #ISSUE_NUMBER

As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.

This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.

As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130495
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-07-11 17:02:18 +00:00
72d9135679 increase tensor size to force out of memory exception on the latest generations of GPUs (#130334)
This PR fixes profiler/test_profiler.py::.TestProfiler::test_oom_tracing
Test expects OOM by allocating huge tensor. But MI300X has enough memory to allocate such a tensor.
This PR increases tensor size with a large margin to force OutOfMemory exception on MI300X and future GPU generations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130334
Approved by: https://github.com/jeffdaily, https://github.com/janeyx99
2024-07-11 16:59:40 +00:00
9c1ba5ac10 [BE] Cleanup unused vars in MPS (#130541)
And move `using namespace mps` outside of every function as there are no
need to repeat it
Use `getTensorsStringKey` instead of explicit
`getMPSShapeString(getMPSShape(t)) + getMPSDataTypeString(t)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130541
Approved by: https://github.com/Skylion007
2024-07-11 16:48:03 +00:00
68ad3eb722 Do not set hints for mark_unbacked quantities (#130483)
Fixes https://github.com/pytorch/pytorch/issues/130456

When we mark_unbacked a size, we actually DO have a hint for it
(because we have a real, input tensor) for it, and previously, we were
accidentally putting it into the hint field of SymNode.  If marked
unbacked size is zero or one, this can lead to inconsistency between
hint compute and static evaluation compute under guard size oblivious,
since that's the whole point of size oblivious.  Answer is to scrub out
hints on mark unbacked ints.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130483
Approved by: https://github.com/lezcano
2024-07-11 15:51:00 +00:00
ca023f77bc [CD] Add pytorch xpu wheel build in nightly (#129560)
Add pytorch xpu wheel build in nightly after the xpu build image enabling PR https://github.com/pytorch/builder/pull/1879 merged

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129560
Approved by: https://github.com/atalman
2024-07-11 15:49:04 +00:00
fb9bc6d74a [custom op] add doc for CustomOpDef.set_kernel_enabled (#130406)
<img width="1067" alt="Screenshot 2024-07-09 at 6 14 55 PM" src="https://github.com/pytorch/pytorch/assets/22356083/941751f8-8e12-43cb-8477-c739476e0096">
<img width="965" alt="Screenshot 2024-07-09 at 6 14 59 PM" src="https://github.com/pytorch/pytorch/assets/22356083/aa9be099-f26c-45a3-8a14-742a2bb7c28b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130406
Approved by: https://github.com/zou3519
2024-07-11 15:47:35 +00:00
5ed72ff5f5 Reduce all tensors to their metadata in AOTAutogradCache; add tests (#128583)
This PR makes it so that all tensors are reduced to their metadata in AOTAutogradCache. Because dynamo always embeds constant tensors into the FXgraph directly, there's no risk of a constant tensor whose values are semantically important being lost here. AOTAutograd itself may take a constant tensor and set it as an attribute on an FXGraph for inductor, but Dynamo never does this.

One other thing that this diff does is add `[pickler.fast](https://docs.python.org/3/library/pickle.html#pickle.Pickler.fast)` to our pickling algorithm for cache key generation. Pickle will often memoize/intern strings when pickling, leading to false cache misses due to inconsistent memoization. Turning on pickler.fast removes this behavior.

Technically `fast` is a "deprecated" feature according to python docs. But it's still supported in py3.8-3.12, and if it ever is removed, the only downside will just be a few more cache misses, so I think it's worth just adding here (and removing later as needed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128583
Approved by: https://github.com/oulgen
ghstack dependencies: #128335
2024-07-11 15:39:09 +00:00
be7bf20234 Add JK to enable fx graph cache for amd (#130463)
Test Plan: ad hoc testing

Differential Revision: D59593961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130463
Approved by: https://github.com/nmacchioni, https://github.com/mxz297
2024-07-11 15:28:38 +00:00
6f662e9575 update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-11 15:26:48 +00:00
cyy
c4a2b6a943 [2/N] Fix NVCC warnings (#130214)
Follows #130191
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130214
Approved by: https://github.com/ezyang
2024-07-11 14:46:53 +00:00
a833582dbb [dynamo][tuple] Optimize guard for small tuples - helps conv2d guards (#130400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130400
Approved by: https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #130285, #130368, #130416
2024-07-11 14:13:24 +00:00
f7d7b94017 [dynamo][unspecialized-nn-module] Distinguish between user-defined and builtin nn module (#130416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130416
Approved by: https://github.com/jansel
ghstack dependencies: #130285, #130368
2024-07-11 14:13:24 +00:00
fed8b0055f [dynamo][bufgix] Fix the value for key manager (#130368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130368
Approved by: https://github.com/jansel
ghstack dependencies: #130285
2024-07-11 14:13:19 +00:00
9c612df504 [dynamo][cpp-guards][QOL] Print NO_TENSOR_ALIASING guard once (#130285)
NO_TENSOR_ALIASING guard lists all tensors. Printing it on every occurence is ugly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130285
Approved by: https://github.com/jansel
2024-07-11 14:13:14 +00:00
bac10cdd6f [DCP] Fix duplicated logging messages when enable both c10d and dcp l… (#130423)
…ogger

Fixes #129951 . Would you take a moment to review it? @LucasLLC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130423
Approved by: https://github.com/Skylion007
2024-07-11 13:43:39 +00:00
0d66ccaf23 [IntraNodeComm] fix an issue where input check fails when running all-reduce on sub groups (#130492)
Tested against the following snippet with `ENABLE_INTRA_NODE_COMM=1`.

```python
import os
import torch
import torch.distributed as dist

def main():
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])

    torch.cuda.set_device(f"cuda:{local_rank}")
    dist.init_process_group("nccl")

    draft_group = dist.new_group([0, 1, 2, 3])
    target_group = dist.new_group([4, 5, 6, 7])

    inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda")
    dist.all_reduce(inp)
    expect = sum(range(world_size))
    assert inp.eq(expect).all()

    if 0 <= rank < 4:
        inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda")
        dist.all_reduce(inp, group=draft_group)
        expect = sum(range(4))
        assert inp.eq(expect).all()
    else:
        inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda")
        dist.all_reduce(inp, group=target_group)
        expect = sum(range(4, 8))
        assert inp.eq(expect).all()

    torch.cuda.synchronize()
    dist.destroy_process_group()

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130492
Approved by: https://github.com/Chillee
2024-07-11 13:39:14 +00:00
f261c6ebe8 Revert "[halide-backend] Update CI pin (#130258)"
This reverts commit 4fcfd475bea24b832da32a0c4d464dd87c73a2a9.

Reverted https://github.com/pytorch/pytorch/pull/130258 on behalf of https://github.com/albanD due to Seems to have broken trunk pretty bad 4fcfd475be ([comment](https://github.com/pytorch/pytorch/pull/130258#issuecomment-2222935064))
2024-07-11 13:26:01 +00:00
354edb232a Make public binding test only consider files that are packaged in the wheels (#130497)
In particular, when creating the PyTorch wheel, we use setuptools find_packages 551b3c6dca/setup.py (L1055) which explicitly skips packages without `__init__.py` files (namespace packages) https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#finding-simple-packages.

So this PR is reverting the change to stop skipping these namespace packages as, even though they are in the codebase, they are not in the published binaries and so we're ok relaxing the public API and importability rules for them.

A manual diff of the two traversal methods:
```
torch._inductor.kernel.bmm
torch._inductor.kernel.conv
torch._inductor.kernel.flex_attention
torch._inductor.kernel.mm
torch._inductor.kernel.mm_common
torch._inductor.kernel.mm_plus_mm
torch._inductor.kernel.unpack_mixed_mm
torch._strobelight.examples.cli_function_profiler_example
torch._strobelight.examples.compile_time_profile_example
torch.ao.pruning._experimental.data_sparsifier.benchmarks.dlrm_utils
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_disk_savings
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_forward_time
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_model_metrics
torch.ao.pruning._experimental.data_sparsifier.lightning.tests.test_callbacks
torch.ao.quantization.experimental.APoT_tensor
torch.ao.quantization.experimental.adaround_fake_quantize
torch.ao.quantization.experimental.adaround_loss
torch.ao.quantization.experimental.adaround_optimization
torch.ao.quantization.experimental.apot_utils
torch.ao.quantization.experimental.fake_quantize
torch.ao.quantization.experimental.fake_quantize_function
torch.ao.quantization.experimental.linear
torch.ao.quantization.experimental.observer
torch.ao.quantization.experimental.qconfig
torch.ao.quantization.experimental.quantizer
torch.csrc.jit.tensorexpr.codegen_external
torch.csrc.jit.tensorexpr.scripts.bisect
torch.csrc.lazy.test_mnist
torch.distributed._tensor.examples.checkpoint_example
torch.distributed._tensor.examples.comm_mode_features_example
torch.distributed._tensor.examples.comm_mode_features_example_argparser
torch.distributed._tensor.examples.convnext_example
torch.distributed._tensor.examples.torchrec_sharding_example
torch.distributed._tensor.examples.visualize_sharding_example
torch.distributed.benchmarks.benchmark_ddp_rpc
torch.distributed.checkpoint.examples.async_checkpointing_example
torch.distributed.checkpoint.examples.fsdp_checkpoint_example
torch.distributed.checkpoint.examples.stateful_example
torch.distributed.examples.memory_tracker_example
torch.fx.experimental.shape_inference.infer_shape
torch.fx.experimental.shape_inference.infer_symbol_values
torch.include.fp16.avx
torch.include.fp16.avx2
torch.onnx._internal.fx.analysis.unsupported_nodes
torch.onnx._internal.fx.passes._utils
torch.onnx._internal.fx.passes.decomp
torch.onnx._internal.fx.passes.functionalization
torch.onnx._internal.fx.passes.modularization
torch.onnx._internal.fx.passes.readability
torch.onnx._internal.fx.passes.type_promotion
torch.onnx._internal.fx.passes.virtualization
torch.utils._strobelight.examples.cli_function_profiler_example
torch.utils.benchmark.examples.sparse.compare
torch.utils.benchmark.examples.sparse.fuzzer
torch.utils.benchmark.examples.sparse.op_benchmark
torch.utils.tensorboard._convert_np
torch.utils.tensorboard._embedding
torch.utils.tensorboard._onnx_graph
torch.utils.tensorboard._proto_graph
torch.utils.tensorboard._pytorch_graph
torch.utils.tensorboard._utils
torch.utils.tensorboard.summary
torch.utils.tensorboard.writer

```

These are all either namespace packages (which we want to remove) or package that are not importable (and tagged as such in the test).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130497
Approved by: https://github.com/aorenste
2024-07-11 13:22:04 +00:00
215013daad [cuDNN][SDPA] Limit cuDNN SDPA head-dim to 128 (#130494)
Limit cuDNN SDPA to head-dim 128 globally. Apparently the support for 256 is only for the forward on sm90+, which would be clunky to maintain as it would mean dispatching different for forward/backward.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130494
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2024-07-11 13:21:18 +00:00
cyy
9822fdc354 [7/N] Replace c10::optional with std::optional (#130510)
Follows #130438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130510
Approved by: https://github.com/janeyx99
2024-07-11 13:21:05 +00:00
f52b2ee90f Modularize aten parameter parser and checker (#125308)
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.

```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```

With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`.

Differential Revision: [D59399546](https://our.internmc.facebook.com/intern/diff/D59399546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/atalman
2024-07-11 13:17:25 +00:00
2a51ccc77e When translation validation is enabled, assert that hint is consistent (#130478)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130478
Approved by: https://github.com/lezcano
2024-07-11 13:02:31 +00:00
cyy
c9551a3f50 Make c10::string_view an alias of std::string_view (#130417)
Follows #130009 to further facilitate the mitigation from c10::string_view to std::string_view. The old c10::string_view was renamed to c10::string_view_ext.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417
Approved by: https://github.com/ezyang
2024-07-11 12:31:06 +00:00
cyy
c5b66c3fe1 Enable -Werror=pedantic on torch targets (#130319)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130319
Approved by: https://github.com/ezyang
2024-07-11 12:27:32 +00:00
5db9bd467e Skip test_nnc_correctness for new op _unsafe_masked_index (#130375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130375
Approved by: https://github.com/lezcano
2024-07-11 08:17:16 +00:00
b1942a1af4 [fbgemm_gpu] Break up fbgemm_cuda_utils.cuh, pt 10 (#130468)
Summary:
X-link: https://github.com/pytorch/FBGEMM/pull/2814

X-link: https://github.com/facebookresearch/FBGEMM/pull/19

- Break up `fbgemm_cuda_utils.cuh`, pt 10

Test Plan:
```
buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/jagged/... | grep -v '-' | xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % || exit 255'

buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/tbe/... | grep -v '-' | xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % || exit 255'

buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/sparse/... | grep -v '-' | xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % || exit 255'

buck2 build --config fbcode.enable_gpu_sections=true --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//smart/inference_platform_sp/llm_predictor_amd:service

buck2 build --flagfile fbcode//mode/amd-gpu fbcode//hpc/ops:sparse_ops

buck2 build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2/benchmarks/operator_benchmark/pt:add_test
```

Reviewed By: spcyppt

Differential Revision: D59545097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130468
Approved by: https://github.com/ezyang
2024-07-11 07:10:27 +00:00
79c41bb58a [inductor] switch CppCodeCache to new cpp_builder. (#130132)
Changes:
1. switch CppCodeCache to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-11 07:03:43 +00:00
75ab027fbb [dtensor] move bernolli to op strategy (#130286)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130286
Approved by: https://github.com/awgu, https://github.com/yifuwang
2024-07-11 06:43:11 +00:00
fdc83610f2 Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/zdevito, https://github.com/eqy
2024-07-11 05:33:09 +00:00
da24823e06 [BE][EZ] Migrate to new dcp save and load APIs (#130475)
When I play with DCP for distributed inference, I found that we are still using deprecated APIs for DCP even in unit test. So this PR is using the new API with unified small letters "dcp".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130475
Approved by: https://github.com/wz337
2024-07-11 04:13:39 +00:00
5835ff1ed5 [Easy][Inductor] Add comment for .min_order and .max_order (#130390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130390
Approved by: https://github.com/anijain2305
2024-07-11 03:58:03 +00:00
a4576dad34 [reland][custom ops] infer schema (#130079)
Fixes #129617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079
Approved by: https://github.com/zou3519
2024-07-11 03:39:07 +00:00
9f401187c7 [pipelining] Refactor test_schedule to fix "-k" (#130294)
This is kind of a short-sighted workaround and we should actually come
up with a way to fix this in general, but I got annoyed that I can't use
-k to filter tests in test_schedule, and realized it's because we jam
tests using the new MultiProcContinuousTest fixture together with
old-style tests.

For now I separate the two types of tests so -k works again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294
Approved by: https://github.com/H-Huang
2024-07-11 03:18:02 +00:00
dfd1d1971e Fix warning when pickle.load torch.Storage (#130246)
Fixes https://github.com/pytorch/pytorch/issues/130242

Since `torch.save` does not use pickle for storages, the `torch.load` in `_load_from_bytes` should not ever be called when `torch.load`-ing a checkpoint. Setting weights_only=False explicitly in `_load_from_bytes` to avoid the weights_only warning when using the pickle module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130246
Approved by: https://github.com/albanD
2024-07-11 02:40:29 +00:00
4fcfd475be [halide-backend] Update CI pin (#130258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258
Approved by: https://github.com/eellison
2024-07-11 02:26:16 +00:00
df9d1b44e7 Preserve _numeric_debug_handle throguh deepcopy and re-export (#129287)
Summary:
* Added support for preserving it during deepcopy, need to remap the args since _numeric_debug_handle refers
to the nodes in the graph

TODO: need to fully support re-export, currently the metadata for output node is not preserved

Test Plan:
python test/test_quantization.py -k test_deepcopy_preserve_handle
python test/test_quantization.py -k test_copy_preserve_handle

all related tests:
python test/test_quantization.py -k TestGenerateNumericDebugHandle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129287
Approved by: https://github.com/zhxchen17
2024-07-11 02:19:41 +00:00
a205a53c50 Make sym_node log more useful (#130436)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130436
Approved by: https://github.com/Skylion007
2024-07-11 01:42:53 +00:00
79e34800c3 Suppress guards generated by empty_strided in ir_node_to_tensor (#130431)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130431
Approved by: https://github.com/IvanKobzarev
2024-07-11 01:19:11 +00:00
cyy
798b9652f7 [6/N] Replace c10::optional with std::optional (#130438)
Follows #130408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130438
Approved by: https://github.com/janeyx99
2024-07-11 01:15:37 +00:00
5bc18ec0a1 [Inductor][CPP] Support vectorization of remainder (#129849)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
ghstack dependencies: #130405
2024-07-11 00:50:50 +00:00
6adc725157 doc - fix the max_norm value in a note (#129687)
`max_norm=True` is currently written in the note, but `max_norm` can be a `float`, NOT a `bool` (as the [docstring](ec284d3a74/torch/nn/modules/sparse.py (L30)) says).
That note was created in #45595

The current pull request cleans it up.
The value `True` in the note can confuse the users to think it can be a boolean.

In fact, a counter-intuitive behavior will happen if users try to set it to `False`:
it will be interpreted as 0, so the values of the embedding will become 0 - not what the users were expecting by setting it to `False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129687
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
2024-07-11 00:01:17 +00:00
358da54be5 [inductor] Better messaging when triton version is too old (#130403)
Summary:
If triton is available, but we can't import triton.compiler.compiler.triton_key, then we see some annoying behavior:
1) If we don't actually need to compile triton, the subprocess pool will still spew error messages about the import failure; it's unclear to users if this is an actual problem.
2) If we do need to compile triton, we a) see the error messages from above and b) get a vanilla import exception without the helpful "RuntimeError: Cannot find a working triton installation ..."

Test Plan: Ran with and without torch.compile for a) recent version of triton, b) triton 2.2, and c) no triton. In all cases, verified expected output (success or meaningful error message)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130403
Approved by: https://github.com/eellison
2024-07-10 23:45:50 +00:00
ceedee23ec [DTensor] Included meshes in cross-mesh error msg (#130454)
The current error message is not actionable since we do not know which meshes are involved. Including the `__repr__` of each mesh in the error helps but is not always sufficient.

7d4cb21098/torch/distributed/device_mesh.py (L395-L408)

The problem is that `DeviceMesh.__eq__` is actually pretty involved, and we cannot see all parts of the `__eq__` criteria just from the `__repr__` (e.g. the thread ID).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130454
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-07-10 22:40:57 +00:00
2abc7cc21b [inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-10 22:28:29 +00:00
551b3c6dca Use irange to avoid -Wsign-compare errors (#130388)
Fixes meta-internal errors after importing #128753

(see [D59498679](https://www.internalfb.com/diff/D59498679))
```
fbcode/caffe2/aten/src/ATen/Context.cpp:286:34: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare]
      for (auto index = 0; index < at::getNumGPUs(); index++) {
                           ~~~~~ ^ ~~~~~~~~~~~~~~~~
1 error generated.
```
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130388
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-07-10 22:07:51 +00:00
ce499eee0c Revert "Add API for open registration between operators and subclasses (and modes) (#130064)"
This reverts commit c23d103afae65588772cb30037ea4110f01f6f41.

Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/izaitsevfb due to fails internal builds, see [D59553526](https://www.internalfb.com/diff/D59553526) ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2221587575))
2024-07-10 21:50:32 +00:00
83c95c48f7 Flight recoder data as JSON (#129505)
Summary:
Provide a new API to retrieve flight recorder data as JSON.
The one minor difference between flight recorder as Pickle v/s JSON is
that the JSON API does not retrieve stack traces at the moment.
This ends up being far too much data.

Test Plan:
unit test

Differential Revision: [D59536460](https://our.internmc.facebook.com/intern/diff/D59536460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129505
Approved by: https://github.com/wconstab, https://github.com/d4l3k
2024-07-10 21:50:27 +00:00
86bca69c5f Revert "[custom_ops] expose torch.library.register_torch_dispatch (#130261)"
This reverts commit bb9a73f767526e0d23c60360db5212b6bed0e8bc.

Reverted https://github.com/pytorch/pytorch/pull/130261 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130261#issuecomment-2221569707))
2024-07-10 21:43:28 +00:00
e14a0f45ed Revert "[reland][custom ops] infer schema (#130079)"
This reverts commit bef085bdfa62cc14589c70279de17108b2c2089f.

Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2221561483))
2024-07-10 21:40:16 +00:00
46c52661bc Use a better cherry-pick strategy for stable pytorch w/ distribute changes (#129987)
1. Update the branch name from internal feedback
2. Only cherry-pick in the changes to these folders
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129987
Approved by: https://github.com/seemethere
2024-07-10 20:55:36 +00:00
80a421a54d [TD] Pin numpy to 1.26.0 in indexer (#130442)
Temporarily pin 1.26.0 to get the workflow working while I go sort out which dependencies need to be updated

Succeeding run: https://github.com/pytorch/pytorch/actions/runs/9877733366/job/27280052419?pr=130442

Tested by adding my branch to the trust relationship for the policy and removing the environment
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130442
Approved by: https://github.com/atalman, https://github.com/malfet
2024-07-10 20:52:24 +00:00
cd2638be09 Revert "[pipelining] Refactor test_schedule to fix "-k" (#130294)"
This reverts commit 1352f13f7827cd1862a6e0507fb17dccddf73dc2.

Reverted https://github.com/pytorch/pytorch/pull/130294 on behalf of https://github.com/clee2000 due to broke lint https://github.com/pytorch/pytorch/actions/runs/9879591538/job/27286156803 ([comment](https://github.com/pytorch/pytorch/pull/130294#issuecomment-2221376073))
2024-07-10 20:26:58 +00:00
b81767161e Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)"
This reverts commit 08d5423d339ac4b302f8ae6b63b334e032104753.

Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9879109008/job/27286339304 08d5423d33 test was not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2221368245))
2024-07-10 20:22:24 +00:00
1b3b4c2fb9 [runtime asserts] deduplicate runtime asserts & CSE (#128599) (#130380)
original PR: https://github.com/pytorch/pytorch/pull/128599 (re-created after revert + poisoned diff train)

Summary:
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0)  # 2*s0
w = z.repeat(y.shape[0])  # 2*s0*s1
_w = w.shape[0]

s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```

Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)

torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```

Test Plan:
contbuild & OSS CI, see 940e4477ab

Original Phabricator Test Plan:
Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D59543603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130380
Approved by: https://github.com/izaitsevfb
2024-07-10 19:23:37 +00:00
1352f13f78 [pipelining] Refactor test_schedule to fix "-k" (#130294)
This is kind of a short-sighted workaround and we should actually come
up with a way to fix this in general, but I got annoyed that I can't use
-k to filter tests in test_schedule, and realized it's because we jam
tests using the new MultiProcContinuousTest fixture together with
old-style tests.

For now I separate the two types of tests so -k works again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294
Approved by: https://github.com/H-Huang
2024-07-10 18:32:51 +00:00
cf090e222e Update torch-xpu-ops pin (ATen XPU implementation) (#130333)
1. Fixing compilation error due to PyTorch update. The helper function prototype changes, `checkIndexTensorTypes`.
2. Fixing compilation error due to PyTorch update. PyTorch forced -Werror=unused-function.
3. Fixing inductor case failure due to CUDA bias implementation in the case. https://github.com/pytorch/pytorch/issues/130426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130333
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-07-10 18:10:53 +00:00
4b7ee51260 [BE][MPS] Cleanup optimizers code (#130453)
- Fix C++20 forward compatibility warnings, namely
```
warning: use of function template name with no prior declaration in function call with explicit template arguments is a C++20 extension [-Wc++20-extensions]
  multi_tensor_apply_for_fused_optimizer<2, 512>(kernel_name,
```
- Use nested namespaces
- Do not explicitly specify `at::` namespace for functions already implemented inside of that namespace
- Use more convenience methods (rather than call by hand)
- Use C++14 `return f();` for void functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130453
Approved by: https://github.com/Skylion007
2024-07-10 18:00:05 +00:00
08d5423d33 [aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)
Reland of:  https://github.com/pytorch/pytorch/pull/128016

Summary from previous PR:
We assume only two possible mutually exclusive scenarios:

Running compiled region for training (Any of inputs has requires_grad)

Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)

All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).

With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()

Changes in partitioner?

Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.

Why was revert?

There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```

Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).

Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.

As a result we started compiling training graph instead of inference.

Fix for view ops:

If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.

This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
2024-07-10 17:56:32 +00:00
0beeac35fa Revert "[cond] inlining into one of the branches when pred is a python constant (#128709)"
This reverts commit fe3e6878c4bb2a6001045c179fd7fa9838242558.

Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/ydwu4 due to causing error on truck due to a land racing: fe3e6878c4 ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2221104043))
2024-07-10 17:47:19 +00:00
b4b7477d3f Fix CPU Annotation Overlapping with Python Events (#129599)
Summary:
Currently we have an issue where CPU User annotations can overlap with python events in the event that a python event calls step() within the function itself. To combat this, we can move the left side of the user annotation to the beginning of the parent python function. We do this because when instantiating the profiler we already start on step 0.
To implement this, we start by collecting all instances of ProfilerStep during post processing. Since TorchOps and Python events are sorted already, we can easily check if the current python event partially overlaps with the current ProfilerStep and, if so, alter the start time of the current ProfilerStep. We then move to the next ProfilerStep and continue iterating through all the python events. This keeps the time complexity of adding events to 'out' at O(s + n) -> O(n) post sorting, where "s" is the number of ProfilerSteps and "n" is the length of all events.

Test Plan:
Added unit test in which step() is called midway through a function. Afterwards, we print out a trace and then load the json to check that there are no overlaps. Also make sure that there is no regression in performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129599
Approved by: https://github.com/aaronenyeshi
2024-07-10 17:33:56 +00:00
6b3460ae0d fix discrepancy from the export of #126601 (#130296)
#126601 (internally [D58103182](https://www.internalfb.com/diff/D58103182)) was exported missing one class definition. This PR brings github repo in sync with fbcode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130296
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
2024-07-10 17:26:44 +00:00
7d4cb21098 Decompose expand_copy and permute_copy (#129476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 17:12:01 +00:00
a7aa066b09 Fix link to dynamo in torch/fx readme (#130233)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130233
Approved by: https://github.com/janeyx99
2024-07-10 17:00:49 +00:00
a09910d3a9 add strobelight profile links to tlparse (#129703)
Summary: title.

Test Plan:
buck2TORCH_TRACE=~/my_trace_log_dir buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compile_time_profiler_example

tlparse ~/my_trace_log_dir

result
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpBrQJcL/index.html
 {F1726980413}

Differential Revision: D59130581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129703
Approved by: https://github.com/aorenste
2024-07-10 16:53:21 +00:00
fe3e6878c4 [cond] inlining into one of the branches when pred is a python constant (#128709)
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.

We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.

Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,

Differential Revision: [D59589709](https://our.internmc.facebook.com/intern/diff/D59589709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709
Approved by: https://github.com/zou3519
2024-07-10 16:44:27 +00:00
9d94b122f0 Fix usage of USE_ROCM when calling cudaFuncGetAttributes (#130441)
This fixes MSVC build regression introduced by https://github.com/pytorch/pytorch/pull/129710 as VC++ fails to unroll nested defines in the specific order and fails with
```
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\int4mm.cu(984): error: "#" not expected here
    do { const cudaError_t __err = cudaFuncGetAttributes( &funcAttr, #if defined(USE_ROCM) (void *)func #else func #endif ); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\cuda\\int4mm.cu", __func__, static_cast<uint32_t>(991), true); } while (0);
```

Fixes https://github.com/pytorch/pytorch/issues/130437

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130441
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-07-10 16:30:43 +00:00
ae73489b7d [codemod] Use C++17 [[fallthrough]] in 1 file inc caffe2/aten/src/ATen/native/cuda/DistributionTemplates.h (#130433)
Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D59528276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130433
Approved by: https://github.com/malfet
2024-07-10 16:30:37 +00:00
bef085bdfa [reland][custom ops] infer schema (#130079)
Fixes #129617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079
Approved by: https://github.com/zou3519
2024-07-10 16:18:36 +00:00
ce4d95143f Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250)
After this PR, our numerical error is within 3% of FA2 for forward and gradients. Prior, for `dq` our numerical error was 30% higher. I also added a `PRESCALE_QK` kernel option that increases perf by about 3-4% but incurs about 20-30% more numerical error.

![image](https://github.com/pytorch/pytorch/assets/6355099/7b5ff44e-219b-4a05-8a1b-2a0182c01ab2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250
Approved by: https://github.com/drisspg
ghstack dependencies: #130227
2024-07-10 16:14:45 +00:00
a7715e36de Add block mask utility support for batches and heads > 1 (#130227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227
Approved by: https://github.com/yanboliang
2024-07-10 16:14:45 +00:00
c83b941141 [export] add dynamic shapes argument and infer from graph nodes (#129928)
Fixes the example in #118304 for `torch._functorch.aot_autograd.aot_export_module` and `torch.export.export`.

On a high level, the issue is caused by not detecting fake_mode when there's no input.

Change plan:

1) we add a  `dynamic_shapes: Union[bool, None] = None` arg to `aot_export_module` and `_aot_export_function`.

2) if the input is not a graph module, then we can only rely on this `dynamic_shapes` input arg.

3) If the input is a graph module, then we can traverse the graph and check.

4) So we check if the input mod is a graph module or just a module, and do 2) or 3) depending on the type.

Fixes #129927

Bug source: dynamo's fake_mode is not detected correctly in `_convert_input_to_fake` in `_traced.py` when there’s no input to the graph). So in ` _strict_export_lower_to_aten_ir`, we create another fake_mode. `dynamo_fake_mode` is not the same as the fake_mode used by dynamo.

Change plan:
check `gm_torch_level` graph's node meta "example_value" for fake mode in addition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129928
Approved by: https://github.com/angelayi
2024-07-10 15:51:05 +00:00
cyy
d31f866b33 [BE] [CMake] Remove AT_CORE_STATIC_WINDOWS option (#130409)
AT_CORE_STATIC_WINDOWS was inherited from torch and is not used anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130409
Approved by: https://github.com/malfet
2024-07-10 15:50:47 +00:00
81ea298600 Wrap the test func with try/except to always call destroy_process_group (#124961)
This can avoid PG warning about not calling destry_pg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124961
Approved by: https://github.com/wanchaol, https://github.com/wz337
2024-07-10 15:36:38 +00:00
81df076bfd Fix Apple crash when running PyTorch with Metal API validation turned on (#130377)
Fixes #130376 (at least, for my usage)

There may be other places in the code base where `-setBytes:length:` is called with a length of 0 besides this, but this is the case that has triggered for me. Please let me know if there are any specific tests I should run.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130377
Approved by: https://github.com/malfet
2024-07-10 15:07:47 +00:00
417c83e7cf [ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966)
Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560

This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069

unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping.

The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966
Approved by: https://github.com/malfet
2024-07-10 14:53:41 +00:00
b38de2f9e2 [decomps] Fix aten._to_copy decomp (#130381)
`aten._to_copy` can receive a python number as input. This occurs in
torch.compile support for vmap (see #130188). Previously, this would
raise an assertion error. This PR changes it so that if we see a python
number, we call torch.scalar_tensor on it first (h/t @bdhirsh).

Fixes #130362

Fixes #130188

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130381
Approved by: https://github.com/Chillee
2024-07-10 14:34:28 +00:00
cyy
bd3452f431 [5/N] Change #include <c10/util/Optional.h> to #include <optional> (#130408)
Follows  #130329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130408
Approved by: https://github.com/malfet
2024-07-10 14:29:43 +00:00
99967e1119 [MPS][TYPE_PROMOTION] Fix Clamp (#130226)
Summary:
1. Fixed #130201 by adding type promotion.
2. Added proper tests.
3. Found torch's type promotion is different from numpy as follows:

```python
import torch
import numpy as np
np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype  # dtype('float64')
torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype  # torch.float32
```

~Not sure the proper way to handle it, it causes numpy ref tests to fail.~
Reason here, so think I'm gonna xfail it:
3c1cf03fde/test/test_ops.py (L260-L264)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226
Approved by: https://github.com/malfet
2024-07-10 14:27:39 +00:00
6ce0bd7d3b [HOP] Use user directed names for variables where possible (#130271)
Afaict the previous check was too strict. Removing it passes all the
mutation tests (mutation checks happen via the TensorVariable's mutable_local).

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271
Approved by: https://github.com/Chillee, https://github.com/ydwu4
2024-07-10 13:59:20 +00:00
637cc8d27f Revert "update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)"
This reverts commit 6367f02a0e136ced05c665301bcdaa4d76690457.

Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main 6367f02a0e ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))
2024-07-10 13:48:32 +00:00
a1590e16df Add single Python 3.10, single Cuda 12.1 build with dependencies included (#130349)
Build large wheel for Python 3.10, CUDA 12.1 that will be used in Colab. Build name: ``manywheel-py3_11-cuda12_1-full-build``

We still have all code to support the full build in builder repo, here:
https://github.com/pytorch/builder/blob/main/manywheel/build_cuda.sh#L151

Test:
```
import sys
import torch
sys.version_info
print(torch.__version__)
sys.version_info

2.3.0+cu121
sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130349
Approved by: https://github.com/malfet
2024-07-10 12:57:39 +00:00
cb2bce98de [MPS][BE] Reduce the number of parameters encoded for no momentum fused SGD (#130131)
Summary:

1. Reduce the number of parameters encoded for no momentum fused SGD
2. Use convenience functions `mtl_setBuffer` and `mtl_setBytes`.

Just a BE, no significant performance difference is observed.

Test plan: Relying on CI signals
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130131
Approved by: https://github.com/janeyx99, https://github.com/malfet
2024-07-10 07:58:38 +00:00
6367f02a0e update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-10 07:38:42 +00:00
e29657efb6 [Inductor][CPP] Fix typo in merge rules (#130405)
**Summary**
There is a typo of the `CPU Inductor` group in `merge_rules.yaml` which should be `test/inductor/test_cpu_repro.py` instead of `test/inductor/test_cpu_repo.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130405
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-07-10 07:13:03 +00:00
cyy
10c7f037fe Simplify c10::string_view (#130009)
Make c10::basic_string_view a subclass of std::basic_string_view for easier replacement in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130009
Approved by: https://github.com/ezyang
2024-07-10 05:02:16 +00:00
a17d1e5322 Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)
Fix static `py::object`s with `py::gil_safe_call_once_and_store`.

The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.

```c++
void func() {
    static py::object obj = py::module_::import("foo").attr("bar");

    ...
}
```

The correct code is to use raw pointers rather than the instance.

```c++
void func() {
    static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
    py::object obj = *obj_ptr;

    ...
}
```

This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.

```c++
void func() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    py::object obj = storage
                         .call_once_and_store_result(
                             []() -> py::object {
                                 return py::module_::import("foo").attr("bar");
                             }
                         )
                         .get_stored();

    ...
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341
Approved by: https://github.com/ezyang
2024-07-10 04:23:37 +00:00
5abe7ebd41 Add new (private) capture_triton API (#130178)
When applied to a triton kernel, capture_triton allows the triton kernel
to be captured when tracing with make_fx. It does this by transforming the
call to the triton kernel into a call to the
triton_kernel_wrapper_mutation HOP, which can actually be traced into a
graph via make_fx.

We have two main uses cases for this:
- non-strict export doesn't use Dynamo, but people want to use
  non-strict export to export programs with triton kernels.
  non-strict export uses make_fx tracing, so this is a necessary step in
  that direction.
- People want to write inductor passes that replace a sequence of
  operators with a call to a function that may contain a triton kernel.
  The way these passes work today is that we have a FX graph and want to
  replace a subgraph of it with a new subgraph. We obtain said subgraph
  from calling make_fx on the function; this won't work on raw triton
  kernels but will work if one uses capture_triton.

Test Plan:
- I wrote some manual tests to run make_fx over two of the triton
  kernels in test_triton_kernels. It would be nice to be able to run
  make_fx through all of the tests in the file but I'm not sure how to
  do that refactor right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130178
Approved by: https://github.com/oulgen
ghstack dependencies: #130177
2024-07-10 03:09:29 +00:00
99c68f7bea Refactor TritonKernelVariable's logic so it can be shared (#130177)
TritonKernelVariable's logic tells us how to go from a user-defined
triton kernel and a grid to a call to the triton_kernel_wrapper_mutation
HOP. We want to re-use this in a setting without Dynamo; in the next PR
up, we create a new decorator (capture_triton) that, when applied to a
triton kernel, transforms a call to the triton kernel into a call
to the triton_kernel_wrapper_mutation HOP.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130177
Approved by: https://github.com/oulgen, https://github.com/ydwu4
2024-07-10 03:09:29 +00:00
868d9a4f12 [cpu][flash attention] fix nan issue (#130014)
Fixes #127055.

NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130014
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-07-10 02:33:26 +00:00
68751799b8 Add decompositions for copy variants of view ops (#128416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 01:39:09 +00:00
cyy
007e75958f [4/N] Change #include <c10/util/Optional.h> to #include <optional> (#130329)
Follows #130300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130329
Approved by: https://github.com/ezyang
2024-07-10 01:26:50 +00:00
9912209743 check if the input fx graph of aot_compile return tuple (#129824)
Fixes https://github.com/pytorch/pytorch/issues/129719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129824
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2024-07-10 01:18:55 +00:00
cyy
85b8503621 [Caffe2] Remove Caffe2 documentation (#130089)
Due to the removal of Caffe2 code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130089
Approved by: https://github.com/r-barnes, https://github.com/albanD
2024-07-10 00:52:16 +00:00
cyy
7a3ab1fe79 [structural binding][7/N] Replace std::tie with structural binding (#130216)
Follows #120353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130216
Approved by: https://github.com/albanD
2024-07-10 00:52:04 +00:00
fb696bf264 Revert "Add block mask utility support for batches and heads > 1 (#130227)"
This reverts commit 64139987c0588f2eef198a0b9fd6904783b37b2c.

Reverted https://github.com/pytorch/pytorch/pull/130227 on behalf of https://github.com/izaitsevfb due to breaks internal builds, please see D59498662 ([comment](https://github.com/pytorch/pytorch/pull/130227#issuecomment-2218842579))
2024-07-09 22:34:39 +00:00
44815ed67e Revert "Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250)"
This reverts commit 3e48d927332915e1ecbd3c7f2c6b9680428f181e.

Reverted https://github.com/pytorch/pytorch/pull/130250 on behalf of https://github.com/izaitsevfb due to depends on #130227 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130250#issuecomment-2218840674))
2024-07-09 22:32:54 +00:00
5b5a1f5202 Add on to Mark some test_decomp tests as slow on win #130260 (#130337)
An add on to https://github.com/pytorch/pytorch/pull/130260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130337
Approved by: https://github.com/malfet
2024-07-09 22:30:53 +00:00
fd43a2ba27 Forward fix for test_compare_cpu_cuda_float32 (#130360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130360
Approved by: https://github.com/malfet
ghstack dependencies: #128238
2024-07-09 22:28:39 +00:00
3be4922a9d Revert "[HOP] Use user directed names for variables where possible (#130271)"
This reverts commit adb65682affdfc37f724c02ea8c8930d3925fc07.

Reverted https://github.com/pytorch/pytorch/pull/130271 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9863205414/job/27236960046 adb65682af Test not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130271#issuecomment-2218832643))
2024-07-09 22:24:39 +00:00
37d4d04309 [torchscript] Add logging for model id. (#130118)
Summary: as title.

Test Plan: CI

Reviewed By: angelayi

Differential Revision: D59348256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130118
Approved by: https://github.com/BoyuanFeng
2024-07-09 22:24:16 +00:00
fb5cb17fbe [torch][fx] Add normalize_args constructor argument to FxGraphDrawer (#130348)
Summary:
When writing out Graphviz files for graphs, sometimes the arguments are all
in a row and it's unclear which is which. Like for `aten.conv2d`, someone might not
remember the stride, padding, dilation order.

Add an option `normalize_args` (defaults to False) to normalize all args into kwargs.
This should help the readability of a graph.

Differential Revision: D59529417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130348
Approved by: https://github.com/mcremon-meta
2024-07-09 22:16:54 +00:00
df83142131 [CCA][Memory Snapshot] Stop duplicating annotations to all device_traces (#130315)
Summary: This diff fixes a bug, where all record_annotations will save a TraceEntry to each of the device_traces. Instead, we should only save annotations to the current device_trace that is being called by the thread calling the native allocator's recordAnnotation.

Test Plan: CI and ran workloads on MVAI WPR FBR.

Reviewed By: zdevito

Differential Revision: D59477339

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130315
Approved by: https://github.com/zdevito
2024-07-09 21:38:47 +00:00
bb9a73f767 [custom_ops] expose torch.library.register_torch_dispatch (#130261)
This is the API for defining the interaction between a torch_dispatch
class and a custom op. Taking API bikeshedding.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261
Approved by: https://github.com/albanD
ghstack dependencies: #130064
2024-07-09 21:11:27 +00:00
c23d103afa Add API for open registration between operators and subclasses (and modes) (#130064)
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.

I'll make this public in a follow-up PR if we think the approach and API
is good.

Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
2024-07-09 21:11:27 +00:00
9c9744c3ac Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)"
This reverts commit 940e4477ab0b81eea25051447cf5f599080c903f.

Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/izaitsevfb due to breaking internal APS tests, see D59498864 ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2218724762))
2024-07-09 21:03:49 +00:00
f85bda8bdd c10d/Handlers: expose running handlers from Python (#130149)
This adds a `_run_handler` method that will invoke a specific handler.

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130149
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
2024-07-09 20:20:59 +00:00
1d93367cfa Fix typo (#130305)
Fixes #130241

that is a reopen pr of #130244, for possibly fixing the failed job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130305
Approved by: https://github.com/Skylion007
2024-07-09 20:02:00 +00:00
721a798886 add bits16 to graph dtype_abbrs (#130339)
As title, patch the dtype in torch.fx.graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130339
Approved by: https://github.com/angelayi
2024-07-09 19:58:51 +00:00
42f647219a [ROCm] Add int4 support (#129710)
- Add AMD support for int4 kernel
  - Only supports CDNA2 and CDNA3 gpus for now
  - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply
  - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values
  - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types
- Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus
- Fix torchscript issues due to hipify for `__nv_bfloat16` type
  - TorchScript has its own implementation for bfloat16 type
    - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h)
    - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation
    - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify

Fixes #124699
Fixes pytorch-labs/gpt-fast/issues/154

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-09 19:49:12 +00:00
adb65682af [HOP] Use user directed names for variables where possible (#130271)
Afaict the previous check was too strict. Removing it passes all the
mutation tests (mutation checks happen via the TensorVariable's mutable_local).

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271
Approved by: https://github.com/Chillee, https://github.com/ydwu4
ghstack dependencies: #130255, #130268
2024-07-09 19:42:52 +00:00
cyy
a6345d3477 [CMake] [3/N] Remove unused code (#130322)
Some functions used by Caffe2 were removed along with some outdated checks. Follows #130006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130322
Approved by: https://github.com/r-barnes
2024-07-09 19:33:33 +00:00
3477ee38e4 fix the use of initial learning rate in the OneCycleLR example (#130306)
Fixes #127649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130306
Approved by: https://github.com/janeyx99
2024-07-09 18:58:07 +00:00
3689471ea4 [inductor] Add FileCheck to flex attention epilogue test (#129343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343
Approved by: https://github.com/lezcano
2024-07-09 18:15:55 +00:00
c6cce976b2 Fix an issue where ENABLE_INTRA_NODE_COMM=1 + multiple process groups leads to failure (#130269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130269
Approved by: https://github.com/Chillee
2024-07-09 17:42:09 +00:00
cb4bec311a Fix nodes has more than one output users after replace_set_grad_with_hop pass (#129716)
Summary: Previously, when we inline the subgraphs that doesn't have a different require_grad environment, we didn't clean up the nodes's users in subgraph and direcly used them to  to replace the output of  the call_modules. This records dead depencies in node.users. This PR fixes this.

Test Plan:
Added a new test.

Also see the torchrec tests:
Step 1:
buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 934687114 --output /tmp/934687114.zip --use-torchrec-eager-mp --use-manifold

Step 2:
buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true aimp/cli:cli --  --platform=aps --template=disagg_gpu_aps_pt2 --pt2 --model-entity-id=934687114 non-request-only-tagging torchrec-shard-and-quantize gpu-disagg-split assign-device materialize-weights script-and-save

Differential Revision: D59132214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129716
Approved by: https://github.com/angelayi
2024-07-09 17:04:03 +00:00
e4c51d22c5 [cuDNN] Cleanup < 8.5 #ifdefs (#130283)
We've said cuDNN 8.5 is the minimum supported version for a bit now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130283
Approved by: https://github.com/Skylion007
2024-07-09 16:35:39 +00:00
cab90b0049 [custom ops] disable kernel temporarily (#130190)
Fixes #128621

Sometimes we want to disable the backend implementation for testing/benchmarking purposes.

For example:

```python
@custom_op("mylib::f", mutates_args=())
def f(x: Tensor) -> Tensor:
    return torch.zeros(1)

print(f(torch.randn(1))) # tensor([0.])

@f.register_kernel("cpu")
def _(x):
    return torch.ones(1)

print(f(torch.randn(1))). # tensor([1.])

with f.set_kernel_enabled("cpu", enabled = False):
    print(f(0)) # tensor([0.])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130190
Approved by: https://github.com/williamwen42, https://github.com/zou3519
2024-07-09 16:13:50 +00:00
edf273edf4 Revert some PRs (#130303)
Summary:
Revert https://github.com/pytorch/pytorch/pull/129346 thru
https://github.com/pytorch/pytorch/pull/128893

For S430832

Test Plan: Tests

Differential Revision: D59503843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303
Approved by: https://github.com/bdhirsh
2024-07-09 14:46:00 +00:00
cyy
71efbf701d [3/N] Change #include <c10/util/Optional.h> to #include <optional> (#130300)
Follows #130236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130300
Approved by: https://github.com/ezyang
2024-07-09 13:32:57 +00:00
a5f816df18 Add more dtypes to __cuda_array_interface__ (#129621)
`__cuda_array_interface__` was missing some unsigned integer dtypes as well as BF16.

numba doesn't support BF16 so I skip tests for that one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129621
Approved by: https://github.com/lezcano
2024-07-09 10:47:19 +00:00
3e48d92733 Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250
Approved by: https://github.com/drisspg
ghstack dependencies: #130160, #130106, #130224, #130227
2024-07-09 09:24:06 +00:00
eqy
86fb76e871 [SDPA] Clean up print in test/test_transformers.py (#130302)
Left this in #125343, oops...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130302
Approved by: https://github.com/awgu
2024-07-09 09:20:52 +00:00
953c6476bd [CMAKE] Look for Development.Module instead of Development (#129669)
Based on the [cmake issue](https://gitlab.kitware.com/cmake/cmake/-/issues/23716) and [manylinux issue](https://github.com/pypa/manylinux/issues/1347), when building a python module, it should find the `Development.Module` module, not `Development`, which includes `Development.Module` and `Development.Embed`, and will expect the shared python library only. After this PR and before #124613, pytorch could be built with a static libpython (e.g. in manylinux).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129669
Approved by: https://github.com/malfet
2024-07-09 09:16:43 +00:00
b139b5090f [pytorch] Name threads in thread pools for better debugging (#130270)
Threads inside the thread pools are not named, so they inherit the main process name or the name of the first thread. In our case if we set `pt_main_thread` as the thread name when a thread does `import torch`, this name will be inherited by all the threads in the created pools.

This PR names the threads in the pools I was able to find. There are other pools created, like OpenMP ones and we need to follow-up on those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130270
Approved by: https://github.com/d4l3k, https://github.com/albanD
2024-07-09 08:03:47 +00:00
312652c325 [RFC] Add support for device extension autoloading (#127074)
Fixes #122468

- Load device extensions at the end of `torch/__init__.py`
- Enabled by default, or you can disable it with `TORCH_DEVICE_BACKEND_AUTOLOAD=0`

run test:

```python
python test/run_test.py -i test_autoload_enable
python test/run_test.py -i test_autoload_disable
```

doc:

https://docs-preview.pytorch.org/pytorch/pytorch/127074/miscellaneous_environment_variables.html

co-author:  @jgong5 @bsochack @bkowalskiINTEL @jczaja @FFFrog @hipudding

Co-authored-by: albanD <desmaison.alban@gmail.com>
Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127074
Approved by: https://github.com/albanD, https://github.com/jgong5
2024-07-09 06:14:13 +00:00
6c4efd4e95 [Memory Snapshot][BE] Clean up record function callback scope (#130265)
Summary: We can directly set the scope to at::RecordScope::USER_SCOPE for the at::RecordFunctionCallback object, rather than performing a check inside of the callback.

Test Plan:
Ran locally, works fine.

https://www.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-aaronshi-20240704-1709-7a80b83b/0/rank-0_itrn-1503.Jul_04_17_24_02.3577.snapshot.pickle

Differential Revision: D59477046

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130265
Approved by: https://github.com/davidberard98
2024-07-09 05:23:48 +00:00
ded469cfbd [issue scrubbing] Fix imports in test_memory_planning.py to work with pytest (#130275)
Summary: I actually don't grok why this pattern works; I guess pytest expects a different import syntax for these relative imports?? But this pattern is used in many other tests here (notably `test_aot_inductor.py`), so it must be right ;)

Test Plan:
Ran both ways:
* `python test/inductor/test_memory_planning.py`
* `pytest test/inductor/test_memory_planning.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130275
Approved by: https://github.com/zou3519
2024-07-09 05:20:56 +00:00
e235db98c9 [Inductor] Add aot_mode UT to new cpp_builder. (#130105)
Changes:
1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT.
2. Switch AotCodeCompiler vec isa command gen to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-09 04:08:35 +00:00
31df1d235e Support tensor stride (#129297)
Summary:
X-link: https://github.com/facebookresearch/param/pull/126

Support tensor stride for execution trace.

Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda profiler.test_execution_trace.TestExecutionTrace

Differential Revision: D58900476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129297
Approved by: https://github.com/sanrise, https://github.com/izaitsevfb
2024-07-09 03:55:46 +00:00
e836ee1955 Enhancements to recompiles logs (#130043)
----

- We now record on CacheEntry what the compile id that populated it was, so now we can say why a specific frame was rejected
- Add structured log for recompiles under name artifact "recompile_reasons". As it stands, it's not terribly structured, but this was the easiest thing I could do to start
- Slightly reformat multi-reason printing; since we only report one guard failure seems better to have it as a single line

Example output:

```
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] Recompiling function f in /data/users/ezyang/a/pytorch/b.py:3
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles]     triggered by the following guard failure(s):
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles]     - 0/0: tensor 'L['x']' size mismatch at index 0. expected 4, actual 5
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130043
Approved by: https://github.com/anijain2305
2024-07-09 03:40:56 +00:00
cyy
29861779ce [2/N] Change #include <c10/util/Optional.h> to #include <optional> (#130236)
Follows  #128301. The changes were made by grep and sed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130236
Approved by: https://github.com/ezyang
2024-07-09 03:17:24 +00:00
d1e0653fad [fx][easy] print_readable should recursively apply options (#130268)
For example, print_readable(colored=True) should also print submodules
with colors.

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130268
Approved by: https://github.com/Chillee
ghstack dependencies: #130255
2024-07-09 02:50:20 +00:00
f2c9f0c0db [HOP] improve naming for subgraph inputs (#130255)
Previously, subgraph input names were whatever the input proxies were,
which were confusing. This PR changes those names to be
whatever the names of the arguments the functions being
speculate_subgraph'ed are. This is best-effort: if we can't figure it
out then we go back to the previous strategy.

Test Plan:
- existing expecttests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130255
Approved by: https://github.com/ydwu4
2024-07-09 02:46:40 +00:00
abe81d5d05 Fix the rest of foreach flakers (#130277)
Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed https://github.com/pytorch/pytorch/pull/129004 for the same effect.

Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to https://github.com/pytorch/pytorch/pull/129003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130277
Approved by: https://github.com/soulitzer
2024-07-09 02:08:21 +00:00
d44c30e2f9 Revert "Add API for open registration between operators and subclasses (and modes) (#130064)"
This reverts commit 922d2737d5e0ad22ee1dcf91c48ab09d641de840.

Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_profiler_tree is failing in trunk after this lands 922d2737d5, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2216135497))
2024-07-09 01:48:38 +00:00
75fa10066d Mark some test_decomp tests as slow on win (#130260)
Auto slow test detection is marking and then un marking these as slow, so permanently mark them as slow on windows.

These tests take >500s on windows.

This is part of the reason why test_decomp keeps failing on windows (ex da66e50e6e)

The other part is something to do with reruns + thresholds that I am still investigating
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130260
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-07-09 00:16:31 +00:00
7f08d3d9a0 [C10D] Fix corrupt log due to uint_8 printing as char (#130184)
Previously, jobs would log lines like this due to interpreteting an int8 value as a signed char when streaming out.

"ProcessGroupNCCL created ncclComm_ 0x94960120 on CUDA device: ^@"

We need a better solution for avoiding this systematically, but at least
for now fix the spot we know about.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130184
Approved by: https://github.com/eeggl, https://github.com/Skylion007
2024-07-08 23:37:50 +00:00
4c19623800 Change numeric_debug_handle to store per-node id (#129811)
Summary:
Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack,
but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional
support for numerical debugging for inputs and willing to hack around to achieve this.

This PR changes the structure of numeric_debug_handle to store unique_id for each node instead.

e.g.
graph:
```
node = op(input_node, weight_node)
```
Before:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3}
```

After:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1
```

Test Plan:
python test/test_quantization.py -k TestGenerateNumericDebugHandle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811
Approved by: https://github.com/tarun292
2024-07-08 23:36:19 +00:00
a28bb3268d [Pipelining] Reorder _Action from F1_1 to 1F1 (#129786)
Also steers away from accesing _Action via positional unpacking since
that is error prone

Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129786
Approved by: https://github.com/H-Huang
2024-07-08 23:07:51 +00:00
60d9f3f7d9 Set the epoch timestamp when uploading data to dynamoDB (#130273)
This is to move away the `_event_time` field from Rockset, which we cannot use when reimport the data
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130273
Approved by: https://github.com/clee2000
2024-07-08 22:58:32 +00:00
b4cc25f126 [custom_op]Fix self in mutation_args (#130179)
Fixes #124933

## Issue Summary
If users define `self` as mutate args, there is an error occurs `TypeError: AutoFunctionalized.__call__() got multiple values for argument 'self'`. For the following example, the schema for mutates_args is parsed as {"self": FakeTensor}.  6df963a2c8/torch/_higher_order_ops/auto_functionalize.py (L234)
In the above line, it is unwrapped as `self=FakeTensor` and leads to wrong argument pass because `self` is the default keyword for functions of a class, such as https://github.com/pytorch/pytorch/compare/main...findhao/fix-self-custom-ops#diff-9453b6b52a54783beec3dd1c60248620f61c3a524d404a188af17bbdf6be3d9eR292 .
```python
import torch

@torch.library.custom_op("mylib::foo", mutates_args={"self"})
def foo(self: torch.Tensor) -> None:
    self.sin_()

x = torch.randn(3)

@torch.compile(backend="inductor", fullgraph=True)
def f(x):
    foo(x)

f(x)
```
## Fix
This PR changes all related default argument `self` to `self_` following the existing way in 6fc771d19b/torch/_ops.py (L667)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130179
Approved by: https://github.com/zou3519
2024-07-08 22:55:50 +00:00
17ca0d0edf Add linux manywheel python 3.13 binary workflows (#130030)
Test with passing linux manywheel workflows is here: https://github.com/pytorch/pytorch/pull/121979
Builder PR already merged: https://github.com/pytorch/builder/pull/1910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130030
Approved by: https://github.com/albanD
2024-07-08 22:50:15 +00:00
00335a27b4 Accept min / max sequence length in nested_tensor_from_jagged() constructor (#130175)
This PR updates the public API for NJT construction `torch.nested.nested_tensor_from_jagged()` to accept values for min / max sequence length. It's useful to provide these ahead of time to avoid GPU -> CPU syncs from on-demand computation later on.

NB: The test changes are extensive because I reworked the existing `_validate_nt()` helper function used throughout our NJT construction tests to verify more (specifically: expected cached min / max seq len and contiguity).

API design question: should we additionally provide an option to compute these from `offsets` at construction time? I can think of three possible cases during construction:
1. Min / max seq len has already been obtained from *somewhere* (manual calculation, static values, etc.) and they should be used in the cache
2. Min / max seq len should be computed immediately at construction time for use in the cache (ideally, the caller wouldn't have to do this computation manually)
3. Min / max seq len are not needed at all (i.e. SDPA isn't ever called) and computation should be skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130175
Approved by: https://github.com/davidberard98, https://github.com/soulitzer
2024-07-08 22:14:52 +00:00
922d2737d5 Add API for open registration between operators and subclasses (and modes) (#130064)
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.

I'll make this public in a follow-up PR if we think the approach and API
is good.

Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
2024-07-08 22:13:05 +00:00
44a773c121 Revert "[custom ops] infer schema (#130079)"
This reverts commit 3fe324ffb612c8712f6af7639c1e7bcec5f3b4fd.

Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/huydhn due to The test_public_bindings failure looks legit 3fe324ffb6 ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2215420957))
2024-07-08 22:02:29 +00:00
f9bb258892 Revert "[Inductor] Add aot_mode UT to new cpp_builder. (#130105)"
This reverts commit 21eeedb4554edab22b42bcb2f75f19e85652b72e.

Reverted https://github.com/pytorch/pytorch/pull/130105 on behalf of https://github.com/izaitsevfb due to Breaks 46 tests internally at meta with: OSError: CUDA_HOME environment variable is not set ([comment](https://github.com/pytorch/pytorch/pull/130105#issuecomment-2215392198))
2024-07-08 21:40:03 +00:00
5e467604c3 Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit dc5f37193f8d144d3de8525bf64eb1775d91e932.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to Depends on #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2215355259))
2024-07-08 21:25:28 +00:00
09d57f577b Revert "[inductor] switch CppCodeCache to new cpp_builder. (#130132)"
This reverts commit 3957b3b34976896e0b13e1d09cf19e1da5b8292e.

Reverted https://github.com/pytorch/pytorch/pull/130132 on behalf of https://github.com/izaitsevfb due to Depends on  #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130132#issuecomment-2215352180))
2024-07-08 21:22:39 +00:00
856fe230c7 [AOTI] better approach to generating runtime checks for symbolic dimensions (#130220)
Previously, we only handled cases where the symbolic dimension is of
Symbol. We should use bound_sympy which handles more general cases for us.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130220
Approved by: https://github.com/aakhundov
2024-07-08 20:46:38 +00:00
3fe324ffb6 [custom ops] infer schema (#130079)
Fixes #129617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079
Approved by: https://github.com/zou3519
2024-07-08 20:46:23 +00:00
1e61cb8c87 Revert "[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185)"
This reverts commit b428f1ad77aedfd150e920c8b0d23b7e6393ad6f.

Reverted https://github.com/pytorch/pytorch/pull/129185 on behalf of https://github.com/huydhn due to dr ci categorization is wrong, the test_linalg xsuccess is real, theres also a test_jit failure https://github.com/pytorch/pytorch/actions/runs/9844339391/job/27178009798 b428f1ad77 ([comment](https://github.com/pytorch/pytorch/pull/129185#issuecomment-2215230345))
2024-07-08 20:37:07 +00:00
f059201e0d [dtensor][debug] added deviceMesh for relevant operations and module parameter sharding and module fqn (#130072)
**Summary**
In order to give users more information, I have added the deviceMesh for operations with DTensor inputs, and module parameter sharding and FQN. These changes have only been placed in operation tracing log. In the future, I plan to just have one logging function with an argument to show how detailed a user wants the log to be, and will get rid of the module tracing log function. This information has also been added to the JSON dump and can be seen in the browser visual. I have also edited the test case file as the module_depth dictionary has been replaced with module_helper_dict and have edited the example output for the MLP operation tracing which can be seen below:

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130072
Approved by: https://github.com/XilunWu
ghstack dependencies: #129994
2024-07-08 20:12:52 +00:00
3e53cae0fc Release 2.4 matrix update. Future releases dates (#130267)
Added Release Compatibility Matrix for release 2.4
Updated future release dates for 2.6-2.9
Updated possible patch release date for 2.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130267
Approved by: https://github.com/malfet, https://github.com/albanD
2024-07-08 20:09:17 +00:00
36e2608783 [Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667)
**Description**
Add fusion path for dynamic quant and for QAT.
The following patterns can be matched for static quant with QAT cases:
`qx -> qlinear -> add -> optional relu -> optional type convert -> optional quant`

The following patterns can be matched for dynamic quant cases:
`qx -> qlinear -> add -> optional relu`

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear
python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear
python test/test_quantization.py -k test_linear_unary
python test/test_quantization.py -k test_linear_binary

Differential Revision: [D57655830](https://our.internmc.facebook.com/intern/diff/D57655830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122667
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-07-08 20:04:39 +00:00
a8985a97f9 elastic/store: use wait instead of get for barrier (#130148)
Summary: We call `.get` in the elastic store barrier operation but we don't need the result. This switches it to use `.wait` instead which eliminates one network round trip as `get` internally does a wait first.

Test Plan:

CI + existing tests -- no behavior change

Differential Revision: D59396199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130148
Approved by: https://github.com/kurman, https://github.com/wconstab
2024-07-08 19:53:42 +00:00
22c809aa73 [FSDP] Runtime Error on Checkpoint Loading for optimizer state (#129110)
for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device.

In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129110
Approved by: https://github.com/fegin
2024-07-08 18:52:13 +00:00
9158bb7837 Ignore functional tensor wrapper when caching (#128335)
This PR makes it so that we don't try to serialize FunctionalTensorWrappers. FunctionalTensorWrappers don't pickle well because they have no underlying storage. This should be fixable at a later point, but I might not be the right author for implementing the serialization for it. If there's a way to avoid actually saving the FunctionalTensorWrappers themselves and just saving the ViewMetadata so we can replay it, that would also work.

To do this, we disable view_replay_input_mutations when using AOTAutogradCache, and then only keep the functional tensor in the ViewAndMutationMeta if we need it for view_replay_input_mutations (i.e. the cache is off).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128335
Approved by: https://github.com/bdhirsh
2024-07-08 18:39:20 +00:00
6dc64026cb Restrict fusions in foreach if there are dependencies on multiple subkernels (#130046)
In https://www.internalfb.com/intern/sevmanager/view/s/429861/, a downstream consuming buffer `buf486_buf526` had two read dependencies; `buf373` and `buf394`, both of which were at separate indices of the upstream foreach op. `buf486_buf526` was fused into `buf373` because in the usual fused case, this is completely fine if all dependencies are met in the upstream fused buffer. However in the foreach case and this case specifically it is possible for foreach ops to be partitioned if there are many arguments in order to stay under CUDA driver arg limits. As a result, this large foreach op was split into two, and the latter had `buf394` in its node schedule for allocation, while the earlier split did not, even though `buf486_buf526` uses the `buf394`, as a result we would hit the unbound local error.

@eellison provided this repro to help debug the issue (https://www.internalfb.com/phabricator/paste/view/P1453035092)

To fix this, we no longer return a valid producer subnode if there are multiple producer subnodes for a downstream consuming op. In short we should not fuse if there are dependencies on multiple foreach subkernels because 1) their execution order is non-deterministic and 2) (this issue) we may not properly handle dependencies in the presence of foreach partitioning.

Co-authored-by: David Berard <dberard@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130046
Approved by: https://github.com/eellison
2024-07-08 18:25:16 +00:00
64139987c0 Add block mask utility support for batches and heads > 1 (#130227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160, #130106, #130224
2024-07-08 18:15:35 +00:00
cd683212a2 Fix indexing twice with score_mod (#130224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130224
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160, #130106
2024-07-08 18:15:35 +00:00
e16276b9bf [ROCm] Check supported archs before setting preferred blas backend to hipblasLT (#128753)
This PR is needed to resolve usability issues with PyTorch ROCm nightly wheels on non-gfx90a/gf94x architectures as a result of https://github.com/pytorch/pytorch/pull/127944.

Addresses https://github.com/pytorch/pytorch/issues/119081#issuecomment-2166504992

### With this PR's changes, I get the following on a gfx908 (unsupported by hipblasLT) architecture:
_Using setter function:_
```
>>> torch.backends.cuda.preferred_blas_library(backend="cublaslt")
[W617 19:58:58.286088851 Context.cpp:280] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
[W617 19:59:02.125161985 Context.cpp:291] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator())
<_BlasBackend.Cublas: 0>
```

_Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_
```
root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_CUBLASLT=1 python
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
[W619 06:14:11.627715807 Context.cpp:274] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator())
<_BlasBackend.Cublas: 0>
```

### and the following on a gfx90a (supported by hipblasLT) architecture:
_Using setter function:_
```
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
<_BlasBackend.Cublaslt: 1>
>>> torch.backends.cuda.preferred_blas_library(backend="cublas")
<_BlasBackend.Cublas: 0>
>>> torch.backends.cuda.preferred_blas_library(backend="cublaslt")
[W620 18:38:29.404265518 Context.cpp:293] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
<_BlasBackend.Cublaslt: 1>
```

_Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_
```
root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_HIPBLASLT=1 python
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
<_BlasBackend.Cublaslt: 1>
```
(Same result for _Using `TORCH_BLAS_PREFER_CUBLASLT` env var:_)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128753
Approved by: https://github.com/malfet
2024-07-08 17:43:41 +00:00
b428f1ad77 [3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185)
Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython.

This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame.
We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12.

This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185
Approved by: https://github.com/jansel
2024-07-08 17:39:05 +00:00
d325aaef39 [halide-backend] Use get_reduction_combine_fn for reduction ops (#130212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130212
Approved by: https://github.com/eellison
2024-07-08 17:23:32 +00:00
a18568f293 [dtensor][debug] Added functionality to convert log into a json file (#129994)
**Summary**
Currently, users have 2 options to view the tracing data. The first is through console where colored text is used to help users read the information. The second is they can log the information to a text file to view the log, which is useful in instances where the log is too long to fit in the console. However, depending on the model complexity, these logs could go on for thousands of lines making it difficult for the user to find specific information. In order to fix this, I have added the functionality to convert the log into a JSON file, which will be used to create a tree view in a browser, allowing the user to collapse parts of the log that will not be useful to them. I have given the user the option to pass their own file path, but have a default one in the event that none is provided. The expected output of the beginning json file and the browser view for the MLP model are shown below:

<img width="542" alt="Screenshot 2024-07-02 at 3 40 41 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b9570540-e1d2-4777-b643-db4801b60ed8">

<img width="777" alt="Screenshot 2024-07-02 at 3 41 43 PM" src="https://github.com/pytorch/pytorch/assets/50644008/9296e255-c3ae-48a4-8be7-4273f69ee178">

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129994
Approved by: https://github.com/XilunWu
2024-07-08 17:15:34 +00:00
61017eb77b Add missing mapping between DLDevice and ATenDevice for MAIA (#129615)
This PR adds missing mapping between the `DLDevice `and `ATenDevice `for MAIA device. These changes are necessary for `dlpack `support for `maia `tensors.

[MAIA is added to the DldeviceType enum in the dlpack repo](bbd2f4d324/include/dlpack/dlpack.h (L120)) already.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129615
Approved by: https://github.com/albanD
2024-07-08 17:08:39 +00:00
63743b223c [AO] catch qparam mismatch for cat (#123769)
Summary:
use &= instead of |= since |= ignores incorrect scale/zp
change scale to use float comparison, instead of int comparison

Issue warning instead of error for backward compatibility: ex: P1204628034

Test Plan: see warning in: P1204628034

Reviewed By: jerryzh168

Differential Revision: D55699212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123769
Approved by: https://github.com/jerryzh168
2024-07-08 16:47:14 +00:00
f4774d64bf Skip test_profile_memory on windows (#130037)
The test was introduced in https://github.com/pytorch/pytorch/pull/128743
It is failing on windows cuda a9a744e442/1 (it is skipped on cpu jobs)

After talking with the author and Aaron, I have been advised to skip it on windows, as windows support for kineto is not a high priority
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130037
Approved by: https://github.com/huydhn, https://github.com/aaronenyeshi
2024-07-08 16:11:51 +00:00
d7b7f8b79f Revert "[ROCm] Add int4 support (#129710)"
This reverts commit d0ad13fa42fc2e9935bd3bda2937a3491276d274.

Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))
2024-07-08 16:07:53 +00:00
c8ab2e8b63 Set seed per sample for OpInfo tests + support for restricting to a single sample input (#128238)
This PR:
* Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed).
    * Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`:
* Uncovered a bunch of test issues:
    * Test breakdown (>100 total)
        * A lot of tolerance issues (tweaked tolerance values to fix)
        * 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype)
        * 3 actually broken semantics (for masked tensor; added xfails)
        * 4 Jacobian mismatches (added xfails)
        * 2 nan results (skip for now, need fixing)
        * 3 results too far from reference result (add xfails)
* Skips MPS tests for now (there are so many failures!). Those will default to the old behavior.

**before (no seed setting):**
```
real	0m21.306s
user	0m19.053s
sys	0m5.192s
```

**after (with seed setting):**
```
real	0m21.905s
user	0m19.578s
sys	0m5.390s
```

* Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command.

```
======================================================================
ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper
    return test(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar
    self.assertFalse(True)
AssertionError: True is not false

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper
    fn(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper
    raise new_e from e
Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
    PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.037s

FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238
Approved by: https://github.com/janeyx99, https://github.com/justinchuby
2024-07-08 16:06:38 +00:00
acf9e31cf8 adding MTIA to supported activities (#130052)
Summary: Put the hasMTIA block in the if condition as well to let MTIA activities be added to supported activities

Test Plan: Tested with auto-trace

Differential Revision: D59280848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130052
Approved by: https://github.com/aaronenyeshi
2024-07-08 15:20:05 +00:00
16d53cb7d5 Only run mixed_mm heuristic if shapes are static (#130081)
If we have dynamic shapes, the heuristic in mixed_mm will cause a crash, because it cannot compare m, k and n to integer values. This PR makes it so that the heuristic only runs if we have static shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130081
Approved by: https://github.com/Chillee
2024-07-08 14:20:55 +00:00
010009e642 [compiled autograd] c++ autograd function saved_data: lift tensors (#130057)
avoid recompiles when custom c++ autograd function use ctx->saved_data to save tensors

iv.toTensor can return reference for `after(iv.toTensor())`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130057
Approved by: https://github.com/jansel
2024-07-08 07:42:07 +00:00
cyy
f4dcf2ae93 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-07-08 07:03:53 +00:00
f053be2a97 [dynamo] Graph break on random_ op (#130222)
Fixes https://github.com/pytorch/pytorch/issues/121621

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130222
Approved by: https://github.com/jansel
2024-07-08 06:10:24 +00:00
31bb65de19 [Inductor] Fix conditional codegen (#129492)
Summary:
We have the cache to guarantee the `sym` is codegen only once, see the following code
```
def ensure_size_computed(self, sym: sympy.Symbol):
    if isinstance(sym, sympy.Symbol) and symbol_is_type(sym, SymT.PRECOMPUTED_SIZE):
        if sym in self.computed_sizes:
            return
        self.computed_sizes.add(sym)
        expr = V.graph.sizevars.inv_precomputed_replacements[sym]
        self.writeline(
            f"{self.declare}{sym} = {self.expr_printer(expr)}{self.ending}"
        )
```
However, we don't consider the case when same `sym`s need to be codegen in both conditions (true branch and false branch), which caused the issue of  `undefined symbols`: P1441378833

To fix the issue, we use a stack to capture the state before doing the condition codegen and restore the state after doing the codegen

Test Plan:
TORCH_LOGS="+inductor" buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100 -c fbcode.enable_gpu_sections=true --config 'cxx.extra_cxxflags=-g1' -c fbcode.platform010_cuda_version=12 //scripts/hhh:repro_cond_torch_compile

PYTORCH_TEST_FBCODE=1 TORCH_COMPILE_DEBUG=1 buck2 run  mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true //caffe2/test/inductor:control_flow -- -r test_cond_control_flow_with_precomputed_size

Differential Revision: D58973730

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129492
Approved by: https://github.com/aakhundov
2024-07-08 05:33:47 +00:00
c5c9dbece1 [dynamo][user-defined] Simplify and improve scope of UserDefinedObject var_getattr (#130169)
Fixes https://github.com/pytorch/pytorch/issues/122649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130169
Approved by: https://github.com/jansel
ghstack dependencies: #118448, #130159
2024-07-08 04:10:56 +00:00
d0ad13fa42 [ROCm] Add int4 support (#129710)
Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction.
Only supports CDNA2 and CDNA3 gpus for now.
Fixes #124699

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-07 23:54:22 +00:00
d1b832e739 [inductor][mkl][inline-inbuilt-nn-modules] Change assertion (#130219)
Fixes the test in the next PR - `python test/inductor/test_mkldnn_pattern_matcher.py -k TestDynamicPatternMatcher.test_conv3d_unary_dynamic_shapes`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130219
Approved by: https://github.com/leslie-fang-intel
2024-07-07 21:32:07 +00:00
940e4477ab [runtime asserts] deduplicate runtime asserts & CSE (#128599)
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0)  # 2*s0
w = z.repeat(y.shape[0])  # 2*s0*s1
_w = w.shape[0]
# something with _w ...

# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```

Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)

# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang
2024-07-07 20:10:14 +00:00
0c44684901 [Typo] Fix typo in DispatchKeyExtractor.h (#130221)
Summary: typo_helper

Test Plan: ci

Differential Revision: D59424671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130221
Approved by: https://github.com/Skylion007
2024-07-07 19:43:31 +00:00
e423224546 Revert "[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967)"
This reverts commit 98929ceae3873f18f4747b88cdff708fde107aa7.

Reverted https://github.com/pytorch/pytorch/pull/126967 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/126967#issuecomment-2212337926))
2024-07-07 06:16:32 +00:00
1b57dce35f Revert "[Inductor][CPP] Support more than one LocalBuffer (#129121)"
This reverts commit f794cf59bd0891ff4a4337e0d919ee68ba1f0472.

Reverted https://github.com/pytorch/pytorch/pull/129121 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/129121#issuecomment-2212337590))
2024-07-07 06:13:40 +00:00
f794cf59bd [Inductor][CPP] Support more than one LocalBuffer (#129121)
**Summary**
Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion
```

**Next Step**

- [✓] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126967
2024-07-07 05:43:08 +00:00
98929ceae3 [Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967)
**Summary**
Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)).

In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach.

In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion
```

**Next Step**

- [ ] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-07-07 05:34:57 +00:00
a3ce9eddd6 [BE][Easy] apply autofix for ruff rule unnecessary-literal-set (C405) and unnecessary-map (C417) (#130198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130198
Approved by: https://github.com/Skylion007
2024-07-07 00:58:22 +00:00
9983242c8e [inductor] support adding a new inductor backend using PrivateUse1 (#129953)
Add handling custom device registered by PrivateUse1 in init_backend_registration() func

Fixes #129952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129953
Approved by: https://github.com/jansel
2024-07-06 21:15:40 +00:00
3d138af943 [Inductor] First implementation of the B2B-GEMM pass with tests (#129995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129995
Approved by: https://github.com/eellison
2024-07-06 19:10:22 +00:00
3957b3b349 [inductor] switch CppCodeCache to new cpp_builder. (#130132)
Changes:
1. switch CppCodeCache to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-06 18:57:44 +00:00
dc5f37193f [inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-06 18:44:13 +00:00
cyy
dfe3534134 [1/N] Fix NVCC warnings (#130191)
Fixes NVCC warnings, as the required steps to enable Werror on CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130191
Approved by: https://github.com/Skylion007
2024-07-06 18:25:04 +00:00
3f50e197c4 [BE] annotate torch.autograd.graph (#129558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129558
Approved by: https://github.com/soulitzer
2024-07-06 18:14:16 +00:00
01ec03bac6 [inductor] switch HalideCodeCache to new cpp_builder. (#130146)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130146
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-06 17:35:17 +00:00
cyy
2f219f7d79 Enforce unused-{variable/function} checks to all torch targets (#130189)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130189
Approved by: https://github.com/ezyang
2024-07-06 16:03:01 +00:00
cyy
096eca2f9a [2/N] Replace exceptions with static_assert(false) in some templates (#130116)
Follows #127371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130116
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-06 13:23:05 +00:00
520a4642bf [CI] Enable build with asserts (#129924)
Not a standard CMake config, as far as I can tell, but it introduces an important concept of optimized build without `NDEBUG`. Test by running `python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)"`, which is a no-op unless debug_assert_fail is enabled.

Add recently added `_unsafe_masked_index`/`_unsafe_masked_index_put_accumulate` to DONT_ENFORCE_SAME_TENSOR_IMPL_OR_STORAGE to avoid all test involving those ops to fail with internal assert
Suppress number of internal asserts to make CI green, see https://github.com/pytorch/pytorch/issues/130073

Fixes https://github.com/pytorch/pytorch/issues/102105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129924
Approved by: https://github.com/atalman, https://github.com/albanD
2024-07-06 13:14:32 +00:00
da66e50e6e Added compile option to create_block_mask (#130106)
Compiling the `create_block_mask` function allows us to "materialize" extremely large masks. This would have been a 1 *trillion* element tensor if fully materialized.

```
print(do_bench(lambda: create_block_mask(causal_mask, 1, 1, 2**20, 2**20, _compiled=True)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130106
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160
2024-07-06 08:09:56 +00:00
963f430d13 Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)"
This reverts commit 0267b2ddcb58aa66b2b62336216da7df4f9939d8.

Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a landrace and fails inductor/test_cudagraph_trees in trunk 0267b2ddcb ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2211690518))
2024-07-06 07:20:05 +00:00
aa4899eee9 [CCA][Memory Snapshot] Fix race on alloc_trace vector - S430480 (#130180)
Summary:
Multiple threads can be calling the alloc_trace std::vector, which will result in SIGSEGVs when objects are double freed, accessed after free, or two inserts at the same time.

We need to lock when inserting, accessing or removing TraceEntry in alloc_trace.

Test Plan:
This is a rare crash, which was exposed when we introduced recordAnnotations, which saves record_function annotations into the snapshot files. Saving a lot of annotations can trigger this bug. Here are a few jobs that crashed before, and this diff fixes.

Differential Revision: D59380507

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130180
Approved by: https://github.com/eqy, https://github.com/kit1980
2024-07-06 06:14:54 +00:00
e019540c9e Revert "Fix the SDPA AOT export issue (#130164)"
This reverts commit 1927c406844affbfe3496d5cbc31d4ebe11c8bfb.

Reverted https://github.com/pytorch/pytorch/pull/130164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking ExecuTorch tests in trunk 1927c40684 ([comment](https://github.com/pytorch/pytorch/pull/130164#issuecomment-2211667777))
2024-07-06 05:59:49 +00:00
bf609630ae Fix a bunch of stride issues with FlexAttention (#130160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130160
Approved by: https://github.com/yanboliang
2024-07-06 03:58:14 +00:00
10c831567b Make sympify'ing SymInt/etc produce their sympy expression (#130166)
There is one huge problem this fixes: today, sympify(symint)
produces a float(!!) because Sympy attempts to see if you can
coerce the symint to float in sympify and of course this works on
SymInt.

However, this also has another nontrivial effect: anywhere in Inductor
where sympy expressions are passed around, it is also valid to pass
around a SymInt now.  I'm ambivalent about this: it's currently a
mistake to be passing around a SymInt when a sympy expression is
expected.  But maybe this is fine?

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130166
Approved by: https://github.com/yf225
2024-07-06 03:56:45 +00:00
acd03ca2d9 [halide-backend] Support scan kernels (#129035)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129035
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #130129
2024-07-06 03:49:50 +00:00
c5110f6388 [halide-backend] Use 0D scalar inputs/outputs (#130129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130129
Approved by: https://github.com/shunting314
2024-07-06 03:49:50 +00:00
0267b2ddcb [runtime asserts] deduplicate runtime asserts & CSE (#128599)
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0)  # 2*s0
w = z.repeat(y.shape[0])  # 2*s0*s1
_w = w.shape[0]
# something with _w ...

# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```

Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)

# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang
2024-07-06 03:44:49 +00:00
7c43f59a45 [audio hash update] update the pinned audio hash (#129429)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429
Approved by: https://github.com/pytorchbot
2024-07-06 03:34:12 +00:00
bd0252fb98 [dynamo][user-defined] Support method descriptors (#130159)
Fixes https://github.com/pytorch/pytorch/issues/120650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130159
Approved by: https://github.com/jansel
ghstack dependencies: #118448
2024-07-06 02:03:09 +00:00
a1a2023eb8 Back out "Pass device to is_pinned call inside TensorProperties.create_from_tensor" (#129972)
Summary:
It turns out, the device used as a param in is_pinned is meant to be the accelerator device with the respect to which pinning is expected. Passing 'cpu' always makes the return value false, regardless of whether the actual tensor is a cpu tensor pinned to Cuda.

Besides, there is a PR https://github.com/pytorch/pytorch/pull/126376 about to be merged which automatically uses the correct accelerator device which obviates the need for users to pass any kind of explicit  device and doesn't create Cuda context for pure cpu tensors.

Note, https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0 test is expected to be broken by this diff, but it should be fixed forward by https://github.com/pytorch/pytorch/pull/126376

Test Plan: Sandcastle.

Differential Revision: D59283190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129972
Approved by: https://github.com/LucasLLC
2024-07-06 01:07:32 +00:00
1927c40684 Fix the SDPA AOT export issue (#130164)
Summary:
## Context
TL;DR: aot_export failed for SDPA memory efficient backend when using `inference_mode`

The CMF AOTI lowering started to fail on the trunk. We have the script (https://fburl.com/code/kfk64i5s) to reproduce the issue quickly (log: P1469307638). By bisecting the stack, we found the issue starting from the D58701607

## Root Cause
In the `inference_mode()`,
the `aten::scaled_dot_product_attention` was not decomposed before the `functionalization` and the op it-self was an out-place op, so the `functionalization` doesn't make change and then was decomposed into `masked_fill_.`, then decomposed to the `copy_`
So it's `aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (decompose) ---> `copy_` ---> failure

In the `torch.no_grad()`,
`aten::sdpa` was decomposed before `functionalization`, so the story is
`aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` --- (decompose) ---> `out-place ops` ---> good

## How to fix
Long-term:
The issue was tracked in the ticket (https://github.com/pytorch/pytorch/issues/129418). The long-term fix could be we do one more round of `functionalization` after the `decompose`, like

`aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` ---> good

Short-term:
It would be a big change I guess. To unblock the production use-case, I marked the `aten::sdpa` should be decomposed in this diff

Test Plan:
local repro works now

buck run mode/opt scripts/sijiac/prototypes:sdpa_aoti

Differential Revision: D59385876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130164
Approved by: https://github.com/zou3519
2024-07-06 00:57:47 +00:00
c5ede865c4 [pt2-bench] raise tolerance for squeezenet1_1 (#130165)
The training accuracy for this model starts to regress. It does not show up on the weekly run yet but
1. it shows up in my MA runs [here](https://hud.pytorch.org/benchmark/torchbench/inductor_max_autotune?dashboard=torchinductor&startTime=Fri,%2028%20Jun%202024%2006:53:45%20GMT&stopTime=Fri,%2005%20Jul%202024%2006:53:45%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/shunting314/162/head&lCommit=cb236e8c198b54901e4fb19698f91be786f72e25&rBranch=main&rCommit=4ee1cb9b955fcc5d75a421b19393998122136f2c)
2. I can repro it locally

Command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --training --amp --backend
 inductor --device cuda --only squeezenet1_1
```

Raise the tolerance to fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130165
Approved by: https://github.com/jansel
ghstack dependencies: #129996, #129941, #130005, #130163
2024-07-06 00:49:15 +00:00
0fcbca9adb [pt2-bench] use eval mode for vision_maskrcnn (#130163)
Try to fix https://github.com/pytorch/pytorch/issues/130161

The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors

I fix that to always use eval mode for vision_maskrcnn training.

With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f

I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130163
Approved by: https://github.com/jansel
ghstack dependencies: #129996, #129941, #130005
2024-07-06 00:49:15 +00:00
cyy
e5841bb8d5 [3/N] Enforce unused-function and unused-variable checks (#130084)
Follows #129878.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130084
Approved by: https://github.com/ezyang
2024-07-05 23:56:00 +00:00
126796d239 [c10d] fixing an UT after a change in eager mode new group (#130167)
Summary:
after
https://github.com/pytorch/pytorch/pull/129284, new_group is eager now if device_id is specified, one UT was broken
This PR fixes it.

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130167
Approved by: https://github.com/wconstab
2024-07-05 23:18:30 +00:00
d1d0a7080f [torchgen] reference generated comment to actual location of the generator and template (#130020)
As per title.

```diff
# torch/_VF.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/return_types.pyi

- # @generated from torch/_C/return_types.pyi
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in
```

```diff
# torch/_C/__init__.pyi

- # @generated from torch/_C/__init__.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in
```

```diff
# torch/_C/_nn.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in
```

```diff
# torch/_C/_VariableFunctions.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/nn/functional.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020
Approved by: https://github.com/ezyang
2024-07-05 21:47:14 +00:00
6fc771d19b Revert "Change depreacate warning on dispatch_on_subclass to warn once (#130047)"
This reverts commit 8ff243bcf190bab62348310693f0ad2f90061c89.

Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function on multiple jobs 8ff243bcf1 https://github.com/pytorch/pytorch/actions/runs/9812489165/job/27097342443.  Dr CI is doing something weird about the unstable failures ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2211409090))
2024-07-05 21:03:36 +00:00
df50452279 Pin optree==0.11.0 on windows CI (#130155)
Fixes #ISSUE_NUMBER

doctests
test_testing

Failing run has 0.12.0 https://github.com/pytorch/pytorch/actions/runs/9804335516/job/27072891998
Succeeding run has 0.11.0 https://github.com/pytorch/pytorch/actions/runs/9798330845/job/27057359554

It is already pinned for mac and linux
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130155
Approved by: https://github.com/huydhn, https://github.com/atalman
2024-07-05 20:28:58 +00:00
18e75c098b [DCP] Adds Checkpointing Team (dcp) to merge rules (#129582)
[DCP] Adds Checkpointing Team (dcp) to merge rules. Please comment to this PR if you think you should be added as well!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129582
Approved by: https://github.com/fegin
2024-07-05 20:09:31 +00:00
739fc01ac9 [NCCL] Make sure current device is correct in torch.distributed.barrier()'s streamSynchronize (#129908)
The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally:
a21d4363d2/c10/cuda/CUDAStream.h (L132)

OUTDATED below:
The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following:
```
import logging
import os
import time
import torch
import torch.distributed as dist

def main():
    logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")

    backend = 'nccl'
    group = torch.distributed.init_process_group(backend=backend)
    rank = torch.distributed.get_rank(group=group)

    for i in range(4):
        time.sleep(rank)
        logging.info(f"Rank {rank}: enter barrier {i}")
        dist.barrier()
        logging.info(f"Rank {rank}: exit barrier {i}")

    dist.destroy_process_group()

if __name__ == "__main__":
    main()
```
appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead.

The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization.

This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device.

CC @wujingyue @Aidyn-A @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908
Approved by: https://github.com/kwen2501
2024-07-05 19:53:54 +00:00
faebaef089 [EZ] Fix typo in upload stats OIDC rolename (#130168)
My mistake from https://github.com/pytorch/pytorch/pull/129544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130168
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman
2024-07-05 19:38:24 +00:00
3d56673b24 [Split Build][BE] remove extraneous .py, .a, and .so files (#130053)
Removes extraneous .a, .so, and .py files from the split build. From here we can also clean up the builder script which produces the binary to do this. That pr is https://github.com/pytorch/builder/pull/1912

Verification:

The built wheel with BUILD_LIBTORCH_WHL=1 has the following files only (with .a, .so, and .py extensions)

```
sahanp@devgpu086 ~/p/dist (viable/strict)> pwd                                                                                                                                                                                                                            (pytorch-3.10)
/home/sahanp/pytorch/dist
sahanp@devgpu086 ~/p/dist (viable/strict)> find . -type f \( -name "*.py" -o -name "*.a" -o -name "*.so" \)                                                                                                                                                               (pytorch-3.10)
./torch/__init__.py
./torch/lib/libbackend_with_compiler.so
./torch/lib/libc10.so
./torch/lib/libjitbackend_test.so
./torch/lib/libtorch.so
./torch/lib/libtorch_cpu.so
./torch/lib/libtorch_global_deps.so
./torch/lib/libtorchbind_test.so
sahanp@devgpu086 ~/p/dist (viable/strict)>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130053
Approved by: https://github.com/atalman
2024-07-05 19:05:32 +00:00
8ff243bcf1 Change depreacate warning on dispatch_on_subclass to warn once (#130047)
Summary:
Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead.

More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/

Test Plan: Sandcastle

Differential Revision: D59338775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047
Approved by: https://github.com/XilunWu
2024-07-05 18:52:49 +00:00
784e3b4123 Revert "Change numeric_debug_handle to store per-node id (#129811)"
This reverts commit a9a744e442975cfbc6f4b26a532e5c1b3d9d5692.

Reverted https://github.com/pytorch/pytorch/pull/129811 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129811#issuecomment-2211245852))
2024-07-05 18:14:02 +00:00
889ed48a22 Fix missing id-token write in upload stats (#130153)
Fix the mistake from https://github.com/pytorch/pytorch/pull/129544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130153
Approved by: https://github.com/clee2000
2024-07-05 18:05:46 +00:00
7c5f3cd049 Add explain function to TSConverter. (#129968)
Summary: The explain function does a conversion dry run to provide feedback on which operators are not supported / fail the conversion to the users.

Test Plan: * `pytest test/export/test_converter.py`

Differential Revision: D59251934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129968
Approved by: https://github.com/angelayi
2024-07-05 18:04:29 +00:00
7ea8a3c9b8 [dynamo] Validate check_fn (#118448)
Fixes - https://github.com/pytorch/pytorch/issues/128090

Tracker issue here - https://github.com/pytorch/pytorch/issues/129937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118448
Approved by: https://github.com/jansel, https://github.com/ezyang
2024-07-05 18:04:12 +00:00
7192ee0735 Default to input tensor device for as_nested_tensor(t) (#130050)
Fixes #129647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130050
Approved by: https://github.com/YuqingJ
2024-07-05 17:50:08 +00:00
a33ee73a28 Upload perf stats to both Rockset and dynamoDB (#129544)
To avoid outage on HUD, I plan to migrate perf stats to dynamoDB as follows:

1. Upload perf stats to both Rockset and dynamoDB
2. Copy all the existing content from Rockset to dynamoDB
3. Create new Rockset tables to map to dynamoDB
4. Switch HUD to use the new Rockset tables (temporarily)
5. Delete the existing tables

This depends on https://github.com/pytorch-labs/pytorch-gha-infra/pull/422

### Testing

```
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 9770217910 --workflow-run-attempt 1 --repo "pytorch/pytorch" --head-branch "gh/shunting314/162/head" --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --dynamodb-table torchci-dynamo-perf-stats --match-filename "^inductor_"
...
Writing 1607 documents to DynamoDB torchci-dynamo-perf-stats
```

And confirm the same number of documents is on the table

![Screenshot 2024-07-03 at 18 10 35](https://github.com/pytorch/pytorch/assets/475357/6c055c96-00ca-4cb3-bbe5-fe4914f9da9b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129544
Approved by: https://github.com/clee2000
2024-07-05 16:31:49 +00:00
e7ab7b83bc Have torch_key hash entire torch directory (#129250)
Summary:
Title. This way, both FXGraphCache and AOTAutogradCache use the same torch_key, and we don't need to only hash specific files.

There's an argument to be made to only hash *.py and *.cpp files. Maybe we can fix the glob to do that.

We use a buck_filegroup because otherwise $SRCs gets too large. By using `$(location :torch_sources)`, we make the genrule implicitly depend on all files globbed by torch_sources.

Test Plan:
Unit tests still pass on OSS
For torch_key:

```
buck2 build caffe2:src_hash.txt -v 2 --show-output
```
See the output, then make any change to any torch file. See that the hash changes.

Reviewed By: oulgen

Differential Revision: D58875785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129250
Approved by: https://github.com/oulgen
2024-07-05 15:37:16 +00:00
eea4ece256 Revert "[audio hash update] update the pinned audio hash (#129429)"
This reverts commit 30fc4b06f55c7c4a915f938d7d5d6abbbc23bf61.

Reverted https://github.com/pytorch/pytorch/pull/129429 on behalf of https://github.com/jeanschmidt due to pytorch bot should not have allowed this merge, as there are failing jobs ([comment](https://github.com/pytorch/pytorch/pull/129429#issuecomment-2210894639))
2024-07-05 13:38:44 +00:00
4b05d9d233 Revert "[NCCL] Make sure current device is correct in torch.distributed.barrier()'s streamSynchronize (#129908)"
This reverts commit c9f1db265e317829b3a4d3af5be5c9266874dcd4.

Reverted https://github.com/pytorch/pytorch/pull/129908 on behalf of https://github.com/jeanschmidt due to Seems to have introduced windows errors on main ([comment](https://github.com/pytorch/pytorch/pull/129908#issuecomment-2210888890))
2024-07-05 13:34:59 +00:00
8f6765f7a7 [pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005)
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.

Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:

<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">

What's nice is the dashboard shows the nightly commits for each run.

Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df

Roughly looking thru the PRs, I feel
```
ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e  . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224  )

Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.

Since this is not a real issue, I'll raise the tolerance to make it pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941
2024-07-05 10:26:39 +00:00
c0735a3dd3 [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-05 10:26:39 +00:00
8f1c2e1e28 [pt2-bench] pass acc test if ref is NaN (#129996)
I'm debugging the accuracy failure for training vision_maskrcnn.

Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error:
```
eager run fail: AssertionError: targets should not be none when in training mode
```
(Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn )

But look at the log from the dashboard
```
E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996
Approved by: https://github.com/jansel
2024-07-05 10:26:39 +00:00
78a0b010eb Refine XPU UTs (#130138)
# Motivation
1. enable all test cases related to `TestXpu` running in XPU CI.
2. make `test_lazy_init` stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130138
Approved by: https://github.com/EikanWang
2024-07-05 09:56:22 +00:00
3240bff56a [benchmarking] Add join_results.py (#129202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129202
Approved by: https://github.com/yanboliang, https://github.com/shunting314
2024-07-05 06:55:30 +00:00
30fc4b06f5 [audio hash update] update the pinned audio hash (#129429)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429
Approved by: https://github.com/pytorchbot
2024-07-05 03:32:29 +00:00
c9f1db265e [NCCL] Make sure current device is correct in torch.distributed.barrier()'s streamSynchronize (#129908)
The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally:
a21d4363d2/c10/cuda/CUDAStream.h (L132)

OUTDATED below:
The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following:
```
import logging
import os
import time
import torch
import torch.distributed as dist

def main():
    logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")

    backend = 'nccl'
    group = torch.distributed.init_process_group(backend=backend)
    rank = torch.distributed.get_rank(group=group)

    for i in range(4):
        time.sleep(rank)
        logging.info(f"Rank {rank}: enter barrier {i}")
        dist.barrier()
        logging.info(f"Rank {rank}: exit barrier {i}")

    dist.destroy_process_group()

if __name__ == "__main__":
    main()
```
appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead.

The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization.

This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device.

CC @wujingyue @Aidyn-A @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908
Approved by: https://github.com/kwen2501
2024-07-04 20:36:58 +00:00
7128504424 [inductor] Add Triton template for Conv3D (#129518)
This commit adds a Triton template for Conv3D ops,
by following the same logic like Conv2D. Conv3D
aren't as frequently used like Conv2D so they might
enjoy less optimizations in various libraries. So having
a Triton based inductor impl can improve performance
for cases.

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129518
Approved by: https://github.com/jansel, https://github.com/jataylo
2024-07-04 20:30:50 +00:00
e590168865 Enable sharing meta tensors between processes (#129520)
Fixes #129436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129520
Approved by: https://github.com/ezyang
2024-07-04 20:29:48 +00:00
21eeedb455 [Inductor] Add aot_mode UT to new cpp_builder. (#130105)
Changes:
1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT.
2. Switch AotCodeCompiler vec isa command gen to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-04 19:08:56 +00:00
d496145534 [CD] Add triton xpu wheel build (#129730)
Enable triton xpu wheel build firstly, then add pytorch xpu nightly wheel build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129730
Approved by: https://github.com/atalman
2024-07-04 17:55:20 +00:00
f78b79daaa Forward fix the missing torch.nn.Module.set_submodule from D59140215 (#130075)
Summary: This is to forward fix D59140215 from a PyTorch open source contributor T194074371. On PyTorch side, we need to use isinstance instead of type when checking for nn.Module.  This is the same way get_submodule is currently implemented.

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//dper3/dper3/core/tests:module_test`

Differential Revision: D59254638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130075
Approved by: https://github.com/mikaylagawarecki
2024-07-04 17:46:56 +00:00
5b5f4b02c2 [pipelining] [BE] Move pipeline_order validation to schedules.py (#129369)
# Changes
* small fix in stage error message
* Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`.
* Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369
Approved by: https://github.com/wconstab
ghstack dependencies: #129368
2024-07-04 16:38:30 +00:00
6dfa53ca76 Revert "[pt2-bench] pass acc test if ref is NaN (#129996)"
This reverts commit 51fa0bd436cf627bd0c8ccf3a3a8b9c07d260622.

Reverted https://github.com/pytorch/pytorch/pull/129996 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
fa3953a2e1 Revert "[pt2-bench] fix accuracy failure for a few models (#129941)"
This reverts commit dafbd603ee6672d9592ec72b59300a2631f431d2.

Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
54da35a2e0 Revert "[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005)"
This reverts commit 0af8c8a981e79b05767089e57e81262dbbf2b1b4.

Reverted https://github.com/pytorch/pytorch/pull/130005 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
57d05f2616 [RELAND] Add xpu to getAccelerator (#129205)
# Motivation
Add `xpu` support to `getAccelerator`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129205
Approved by: https://github.com/albanD, https://github.com/gujinghui
ghstack dependencies: #129463
2024-07-04 10:26:52 +00:00
551f3b92b2 [Dynamo] Add assertion for tensor unpack shape mismatch (#130077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130077
Approved by: https://github.com/Chillee
2024-07-04 09:25:08 +00:00
f3962cfd9c [RELAND] XPUHooksInterface inherits from AcceleratorHooksInterface (#129463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129463
Approved by: https://github.com/gujinghui, https://github.com/albanD
2024-07-04 08:46:34 +00:00
fa4e489d70 [dynamo][dynamic-shapes] Graph break if out shape changes on out= variants (#130074)
Fixes https://github.com/pytorch/pytorch/issues/130068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130074
Approved by: https://github.com/ezyang
ghstack dependencies: #129913, #129914
2024-07-04 08:36:12 +00:00
e98587c58d Update torch-xpu-ops pin (ATen XPU implementation) (#129353)
188 new ATen operators/variants are added in the pin update, involving eager and torch.compile usage on HuggingFace, TIMM and TorchBench models. 16 new unit tests ported to enhance functionality coverage. Aligned source file directory structure with ATen native. Fixed corner case failures in aten::resize, aten::index_add and aten::index_put.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129353
Approved by: https://github.com/EikanWang
2024-07-04 07:36:17 +00:00
bffb278700 [ONNX] Add artifacts_dir to torch-onnx-patch in benchmark (#130069)
Add `artifacts_dir` to torch-onnx-patch to save error report for debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069
Approved by: https://github.com/justinchuby
2024-07-04 07:11:02 +00:00
d62d351107 [Optim][BE] Change str(device) to _get_device_type(device) (#129984)
Prevent using vague expressions like `"cuda" in str(device)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129984
Approved by: https://github.com/janeyx99
ghstack dependencies: #129451, #129552
2024-07-04 06:44:48 +00:00
42f3d7e948 [MPS] Add mps profiler env vars to docs (#129552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129552
Approved by: https://github.com/malfet
ghstack dependencies: #129451
2024-07-04 06:44:48 +00:00
cyy
07b06f0f0a [2/N] Remove outdated CMake code (#130006)
Follows #129851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130006
Approved by: https://github.com/drisspg
2024-07-04 06:24:22 +00:00
26be691e6b Unify shard logic for inductor and dynamo test_config (#129508)
Addresses https://github.com/pytorch/pytorch/pull/129480#issuecomment-2189954552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129508
Approved by: https://github.com/clee2000, https://github.com/huydhn
2024-07-04 06:04:29 +00:00
9c9ac670a0 [dtensor][be] Reduced redundant LOC by creating functions to set up models used in example (#129613)
**Summary**
As the CommModeFeature example file grew, there were to many LOC that was repeated for setting up the models used. I created two functions, one to handle MLP and MLPStacked models and the other for transformer models. The output of the examples will not have changed.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129613
Approved by: https://github.com/XilunWu
ghstack dependencies: #129602
2024-07-04 06:00:58 +00:00
0b9995c1ce [dtensor][debug] Added forward and backward differentiation for module level tracing (#129602)
**Summary**
Currently, comm_mode only allowed users to differentiate between forward and backward passes at the operational level. I modified the code so that users can now see the collective counts for the passes at a module level. I decided to slightly change how the output was formatted making it easier to differentiate between a collective count and an operation. I have designed the operational trace table function so that in the future, a user can use command line arguments in order to determine the level of information they want to display instead of having two similar functions. Finally, I have updated the new output and test cases for comm_mode example and test files. The expected output for the first 3 examples are shown below:

<img width="320" alt="Screenshot 2024-06-26 at 2 30 25 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b8e88075-a07f-4e84-b728-a08959df3661">

<img width="497" alt="Screenshot 2024-06-26 at 2 29 15 PM" src="https://github.com/pytorch/pytorch/assets/50644008/5ef4bea7-1355-4089-bfb0-c7e3f588ac77">

<img width="615" alt="Screenshot 2024-06-26 at 2 31 05 PM" src="https://github.com/pytorch/pytorch/assets/50644008/feacae51-76f7-403b-b6cd-dd15e981770e">

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129602
Approved by: https://github.com/XilunWu, https://github.com/wz337
2024-07-04 06:00:58 +00:00
e2e624a02f [AOTAutograd] Micro-optimize runtime_wrapper (#128188)
This moves a bunch of runtime inspection of the `output_info` for alias handling into the construction of fixed output handlers that are created during compilation and captured by the runtime wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128188
Approved by: https://github.com/bdhirsh
2024-07-04 03:53:06 +00:00
a7a7363be0 [dynamo] Skip side effect tracking for c wrappers/descriptors (#129914)
Fixes PYTORCH_TEST_WITH_DYNAMO=1 pytest -vs test/test_python_dispatch.py::TestPythonDispatch::test_deepcopy_wrapper_subclass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129914
Approved by: https://github.com/jansel
ghstack dependencies: #129913
2024-07-04 03:14:45 +00:00
da8af685ac [dynamo] Skip ID_MATCH guard on GetSetDescriptorType (#129913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129913
Approved by: https://github.com/jansel
2024-07-04 03:14:45 +00:00
8405ba21c1 [inductor][cpp] fix the vec convertion between float and int64 on AVX2 (#130013)
Fix https://github.com/pytorch/pytorch/issues/129863

There is no single instruction support on AVX2 to convert between fp and int64 and has to be emulated. The original fast implementation (see https://stackoverflow.com/questions/41144668) assumes the data range is within [-2^51, 2^51]. The issue reported in https://github.com/pytorch/pytorch/issues/129863 has the input data outside this range and failed the test. This PR supports the full range of the conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130013
Approved by: https://github.com/lezcano
2024-07-04 03:01:49 +00:00
cyy
99ec7bbee7 Force inconsistent-missing-override for torch targets (#130010)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130010
Approved by: https://github.com/ezyang
2024-07-04 02:37:57 +00:00
0af8c8a981 [pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005)
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.

Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:

<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">

What's nice is the dashboard shows the nightly commits for each run.

Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df

Roughly looking thru the PRs, I feel
```
ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e  . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224  )

Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.

Since this is not a real issue, I'll raise the tolerance to make it pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941
2024-07-04 01:14:29 +00:00
dafbd603ee [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-04 01:14:29 +00:00
51fa0bd436 [pt2-bench] pass acc test if ref is NaN (#129996)
I'm debugging the accuracy failure for training vision_maskrcnn.

Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error:
```
eager run fail: AssertionError: targets should not be none when in training mode
```
(Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn )

But look at the log from the dashboard
```
E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996
Approved by: https://github.com/jansel
2024-07-04 01:14:29 +00:00
9108b74bbc Updates to scaled_mm for rowwise scaling (#130059)
# Summary

This updates _scaled_mm's API to enforce that input scales are always 2 dimensional. This resolves ambiguity around scaling scheme
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130059
Approved by: https://github.com/vkuzo
2024-07-04 00:53:17 +00:00
cd70ac884f c10d/Utils: better error message on 0 bytes (#130056)
This improves the error messages on 0 bytes sent/received. We currently log it as a connection reset when it's caused by other reasons.

Test plan:

```
python test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130056
Approved by: https://github.com/kurman, https://github.com/rsdcastro
2024-07-04 00:48:20 +00:00
cyy
efb73eda51 [2/N] Fix some violations of unused-function and unused-variable checks in torch_cpu (#129878)
Follows #128670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129878
Approved by: https://github.com/ezyang
2024-07-04 00:39:28 +00:00
d95a019704 [export] construct empty graph when there's no tensor computation (#129541)
Fixes [#127110](https://github.com/pytorch/pytorch/issues/127110).

When input module does not contain any tensor computation, we would create a graph with inputs and outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129541
Approved by: https://github.com/angelayi
2024-07-04 00:26:17 +00:00
2fe7c1fe04 [custom ops] Support factory function (#129978)
Fixes #129389

If a user registers a device-specific implementation for an operator that accepts no Tensors, then we require the operator to have a "device: torch.device argument"

We switch on the device argument to select the correct backend to dispatch to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129978
Approved by: https://github.com/zou3519
2024-07-04 00:10:52 +00:00
779fc8119e Revert "XPUHooksInterface inherits from AcceleratorHooksInterface (#129463)"
This reverts commit 6353a12e6a80f06217645b10fb69cffeac08a8d0.

Reverted https://github.com/pytorch/pytorch/pull/129463 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129463#issuecomment-2207529072))
2024-07-03 23:43:15 +00:00
8a9725bedb Revert "Add xpu to getAccelerator (#129205)"
This reverts commit 3e2df3ca9d0a593e09bc94c14bbf2b213413cbf3.

Reverted https://github.com/pytorch/pytorch/pull/129205 on behalf of https://github.com/kit1980 due to Need to revert https://github.com/pytorch/pytorch/pull/129463 which breaks Meta builds ([comment](https://github.com/pytorch/pytorch/pull/129205#issuecomment-2207514346))
2024-07-03 23:37:24 +00:00
a9a744e442 Change numeric_debug_handle to store per-node id (#129811)
Summary:
Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack,
but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional
support for numerical debugging for inputs and willing to hack around to achieve this.

This PR changes the structure of numeric_debug_handle to store unique_id for each node instead.

e.g.
graph:
```
node = op(input_node, weight_node)
```
Before:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3}
```

After:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1
```

Test Plan:
python test/test_quantization.py -k TestGenerateNumericDebugHandle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811
Approved by: https://github.com/tarun292
2024-07-03 22:03:31 +00:00
b0d0114f5b Enable automigration for windows jobs (#129977)
Enable Windows jobs to automatically use LF runners when the author is opted-in

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129977
Approved by: https://github.com/clee2000
2024-07-03 22:02:56 +00:00
a79bb8db91 Make _embedding_bag_backward explicitly dispatch to CPU and CUDA. (#129691)
This PR modifies `_embedding_bag_backward` item inside _native_functions.yaml_, so that it
dispatches to CPU and CUDA directly, instead of `CompositeImplicitAutograd`.

*Context:* PyTorch operations that have the `CompositeImplicitAutograd` dispatch do not
allow third party backends (e.g. XLA) to modify its implementation, since this dispatch
key has higher priority. When calling `_embedding_bag_backward` operation using XLA, a
dispatch error will be thrown, since PyTorch/XLA doesn't support sparse tensors.

*Problem:* `_embedding_bag_backward` has a `sparse` parameter that controls whether the
operation should return a sparse or dense tensor. However, at the moment, PyTorch/XLA does
not support sparse tensors. In order to fallback that execution to dense, i.e. change the
flag at runtime, we need to be able to modify its implementation.

*Solution:* we have changed the dispatch of `_embedding_bag_backward` to CPU and CUDA,
which allowed us to introduce our own kernel for it.

Additionally, this PR refactored the representation of its mode from constant integers
into an enum class. It also introduces two additional operators: `int == EmbeddingBagMode`
and `int != EmbeddingBagMode`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129691
Approved by: https://github.com/lezcano
2024-07-03 21:54:49 +00:00
7bbd6cf931 [custom_ops] Mark older custom ops prototypes as deprecated (#130032)
I've had at least one person try to call APIs from here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130032
Approved by: https://github.com/yushangdi, https://github.com/williamwen42
2024-07-03 21:11:05 +00:00
a21d4363d2 [Profiler] Remove all instances of TMP_USE_TSC_AS_TIMESTAMP (#129973)
Summary: Now that D56584521 is in, we can remove all insteances of TMP_USE_TSC_AS_TIMESTAMP

Test Plan:
Ran resnet. Trace looks good
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jun_27_14_46_01.1967733.pt.trace.json.gz&bucket=gpu_traces

Reviewed By: aaronenyeshi, swolchok

Differential Revision: D59132793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129973
Approved by: https://github.com/aaronenyeshi
2024-07-03 19:28:52 +00:00
042d764872 [export] Update example inputs format for DB. (#129982)
Summary: To give user a simpler example code, we are getting rid of ExportArgs in favor of example_args and example_kwargs.

Test Plan: CI

Differential Revision: D59288920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129982
Approved by: https://github.com/angelayi
2024-07-03 17:53:15 +00:00
9b902b3ee3 AOTI: dont treat views of buffers as constants (#129688)
More context [here](https://github.com/pytorch/pytorch/issues/129682#issuecomment-2195463838), but this change was enough to get this AOTI + float8 repro running for me (below).

Previously, it would fail an assertion [here](https://github.com/pytorch/pytorch/blob/main/torch/_meta_registrations.py#L5387) at inductor lowering time. It looks like during lowering, we were supposed to pass `param.transpose(1, 0)` as the second argument to the scaled_mm kernel. But in the inductor IR, this object is a `ReinterpretView` with `get_name()` equal to one of the param constants, so we would end up passing the constant directly into the kernel, instead of performing the view first.

I'm not totally sure if this is the right place to make the change, so interested in any thoughts from inductor folks (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @eellison )

```
import torch
from torch.export import export
from torch.export._trace import _export
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD 3-Clause license found in the
# LICENSE file in the root directory of this source tree.
import copy
import io
import random
import unittest
import pytest
import torch
import torch.nn as nn
import torch.nn.functional as F
from float8_experimental.float8_dynamic_linear import Float8DynamicLinear
from float8_experimental.float8_linear_utils import swap_linear_with_float8_linear
from float8_experimental.float8_tensor import Float8Tensor
from float8_experimental.float8_utils import compute_error
random.seed(0)
torch.manual_seed(0)
is_H100 = torch.cuda.is_available() and torch.cuda.get_device_capability() >= (9, 0)
import torch.nn.utils.parametrize as parametrize
# NOTE: we should upstream this directly into export and make it more automatic!
class UnwrapTensorSubclass(torch.nn.Module):
    def forward(self, *tensors):
        todo = list(tensors)
        for tp, meta, inner_tensors in reversed(self.rebuild_stack):
            nb_tensor = len(inner_tensors)
            inner_tensors = {a: b for a, b in zip(inner_tensors, todo[-nb_tensor:])}
            todo = todo[nb_tensor:]
            rebuilt = tp.__tensor_unflatten__(inner_tensors, meta, None, None)
            todo.append(rebuilt)
        assert len(todo) == 1
        return todo[0]
    def right_inverse(self, tensor):
        assert type(tensor) is not torch.Tensor
        rebuild_stack = []
        plain_tensors = []
        todo = [tensor]
        while todo:
            obj = todo.pop()
            inner_tensors, metadata = obj.__tensor_flatten__()
            rebuild_stack.append((type(obj), metadata, inner_tensors))
            for attr_name in inner_tensors:
                val = getattr(obj, attr_name)
                if type(val) is torch.Tensor:
                    plain_tensors.append(val)
                else:
                    assert isinstance(val, torch.Tensor)
                    todo.append(val)
        self.rebuild_stack = rebuild_stack
        return plain_tensors
def unwrap_tensor_subclass(model, filter_fn=None):
    for name, child in model.named_children():
        if (
            isinstance(child, Float8DynamicLinear) and
            hasattr(child, "weight") and
            type(child.weight) is not torch.Tensor and
            isinstance(child.weight, torch.Tensor)
        ):
            parametrize.register_parametrization(child, "weight", UnwrapTensorSubclass())
        unwrap_tensor_subclass(child)
    return model
class FeedForward(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.w1 = nn.Linear(4096, 14336, bias=False)
        self.w3 = nn.Linear(4096, 14336, bias=False)
        self.w2 = nn.Linear(14336, 4096, bias=False)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w2(F.silu(self.w1(x)) * self.w3(x))
    def reset_parameters(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                m.reset_parameters()
export_model = FeedForward().to("cuda")
swap_linear_with_float8_linear(
    export_model,
    Float8DynamicLinear,
    from_float_kwargs={"pre_quantize_weight": True},
)
export_model = unwrap_tensor_subclass(export_model)
batch_size = 4
num_tokens = 1024
embedding_dim = 4096
input_tensor = torch.randn(
    batch_size, num_tokens, embedding_dim, device="cuda", dtype=torch.float32
)
example_args = (input_tensor,)
# NOTE: this breaks unless we use strict=False, pre_dispatch=False!
exported_program: torch.export.ExportedProgram = _export(
    export_model,
    example_args,
    strict=False,
    pre_dispatch=False,
)
with torch.no_grad():
    so_path = torch._inductor.aot_compile(exported_program.module(), example_args)
    print(so_path)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129688
Approved by: https://github.com/eellison
2024-07-03 17:24:08 +00:00
35600bcaad Print float with full precision, don't truncate (#130027)
Fixes https://github.com/pytorch/pytorch/issues/119338

Exercised in https://github.com/pytorch/pytorch/pull/118448

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130027
Approved by: https://github.com/lezcano, https://github.com/Skylion007
2024-07-03 17:20:19 +00:00
01e41f1814 Modified autotuning for flex_attention to pass in (proper) fake inputs for the block sparse entries (#129915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129915
Approved by: https://github.com/yanboliang, https://github.com/eellison
ghstack dependencies: #129846, #129950
2024-07-03 17:08:45 +00:00
e2eb33b089 Added methods to blockmask to visualize them (#129950)
<img width="319" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/319b10f4-f6fe-4ff8-9529-d366ff411b95">
<img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/27a8953a-3c50-4922-b5d0-4ea5630a133a">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129950
Approved by: https://github.com/yanboliang, https://github.com/drisspg
ghstack dependencies: #129846
2024-07-03 17:08:45 +00:00
29c68df600 Stop immediately specializing common constants 0/1 for plain int (#128327)
Fixes https://github.com/pytorch/pytorch/issues/128319

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128327
Approved by: https://github.com/lezcano
ghstack dependencies: #129983
2024-07-03 16:41:51 +00:00
9e1e58e052 Support allowlisted modules and op overloads in AOTAutogradCache (#128329)
Ops in torch, torch.functional, and torch.nn.functional are cache safe by default (at least, based on my cursory audit of the ops). This fixes a few tests that use these ops with the cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128329
Approved by: https://github.com/bdhirsh
2024-07-03 14:59:24 +00:00
64a04d2225 Make sparse empty constructors specialize instead of fail on symbolic inputs (#129983)
Exercised in https://github.com/pytorch/pytorch/pull/128327

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129983
Approved by: https://github.com/anijain2305
2024-07-03 13:27:19 +00:00
735044191f [Easy] Add whitespace after comma when re-rendering tuple default value in schema (#129884)
The default value of `rot90()` in the schema registry is `[0,1]` because we split the function schema by `", "`. There should be no space after `,` in `[0,1]`.

5c9d5272e4/aten/src/ATen/native/native_functions.yaml (L6120-L6126)

Then the the default value is formatted to `(0,1)` in `pyi` files. This PR manually adds an extra whitespace when rerendering the default value to a string.

```python
", ".join(string.split(","))
```

```python
# before
def rot90(input: Tensor, k: _int = 1, dims: _size = (0,1)) -> Tensor: ...
# after
def rot90(input: Tensor, k: _int = 1, dims: _size = (0, 1)) -> Tensor: ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129884
Approved by: https://github.com/ezyang
2024-07-03 11:45:24 +00:00
8f70bf7a94 Skip TestSDPAPrivateUse1Only on FBCODE (#129997)
Summary: The test is from D59181111, but I couldn't figure out a way to make it pass on FBCODE because loading PyTorch C++ extension requires Ninja which is not going to work with BUCK

Test Plan: `buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test:transformers`

Differential Revision: D59304327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129997
Approved by: https://github.com/drisspg
2024-07-03 06:48:51 +00:00
62b710782d change LayoutLMForSequenceClassification inference accuracy tolerance (#129728)
Fixes #128510.

https://github.com/pytorch/pytorch/pull/124451 makes LayoutLMForSequenceClassification hit the SDPA pattern 1 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance and make the check pass. Note that even the math-version SDPA could have the issue because of some small implementation diff.

The test log:
Single thread
```
correct_result:  SequenceClassifierOutput(loss=tensor(0.5998), logits=tensor([[0.3301, 0.1338]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
new_result:  SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
E0627 01:09:16.762789 140281313759104 torch/_dynamo/utils.py:1476] RMSE (res-fp64): 0.00151, (ref-fp64): 0.00046 and shape=torch.Size([1, 2]). res.dtype: torch.bfloat16, multiplier: 3.000000, tol: 0.001000
E0627 01:09:16.762972 140281313759104 torch/_dynamo/utils.py:1390] Accuracy failed for key name logits
fail_accuracy
```

Multiple threads
```
correct_result:  SequenceClassifierOutput(loss=tensor(0.6007), logits=tensor([[0.3301, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
new_result:  SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129728
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-03 06:28:27 +00:00
4fc9157e90 [halide-backend] Disable split reductions for Halide (#129320)
In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #129321
2024-07-03 05:56:40 +00:00
0abcca85b7 [halide-backend] Support manual schedules (#129321)
Currently using this for some by-hand hacking, but might need to implement our own scheduler later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321
Approved by: https://github.com/shunting314
2024-07-03 05:56:40 +00:00
8af58f66bb Fix typo in floordiv solver code that affects flipped relation (#129888)
Fixes https://github.com/pytorch/pytorch/issues/123535

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129888
Approved by: https://github.com/lezcano
2024-07-03 04:47:32 +00:00
424cd1e1df Enable TORCH_TRACE by default on Conda on Mast (#129988)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129988
Approved by: https://github.com/kunalb
2024-07-03 03:35:45 +00:00
1026b0f687 Use setup-miniconda step from test-infra for llm retrival workflow (#129720)
Undo https://github.com/pytorch/pytorch/pull/129722

Use the setup-miniconda step in written in test-infra to install miniconda in the llm retrieval workflow.  It comes with a cache so we don't have to worry about hitting cache limits.  The llm retrieval job was failing due to too many requests https://github.com/pytorch/pytorch/issues/129718#issue-2379260544

2aba8f107a/.github/actions/setup-miniconda/action.yml (L1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129720
Approved by: https://github.com/PaliC, https://github.com/malfet, https://github.com/huydhn
2024-07-03 03:02:23 +00:00
31fc5b8966 Add support for inline_asm_elementwise in Inductor lowerings (#129846)
This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it.

<img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846
Approved by: https://github.com/shunting314
2024-07-03 02:34:03 +00:00
9ee8c18309 TCPStore: add ping to verify network connectivity on connect (#129985)
This does a round trip request on socket connect -- this allows for detecting connection resets etc and retrying before the non-retryable application requests are sent.

This adds support for PING to both the libuv and legacy backend.

Example error:
```
[trainer85612|12]:W0701 13:41:43.421574  4776 TCPStore.cpp:182] [c10d] recvValue failed on SocketImpl(fd=24, ...): Connection reset by peer
[trainer85612|12]:Exception raised from recvBytes at /mnt/code/pytorch/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first):
...
[trainer85612|12]:#9 c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84809637
[trainer85612|12]:#10 c10d::TCPStore::waitForWorkers() from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84812868
[trainer85612|12]:#11 c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84814775
```

Test plan:

```
python test/distributed/test_store.py -v
```

```
tristanr@devvm4382 ~/pytorch (d4l3k/tcpstore_ping)> python ~/pt_tests/tcpstore_large_test.py
starting pool
started 90000
started 30000
started 70000
started 20000
started 80000
started 60000
started 0
[W702 16:16:25.301681870 TCPStore.cpp:343] [c10d] Starting store with 100000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
init 20000
set 20000
init 80000
set 80000
init 70000
set 70000
init 60000
set 60000
init 30000
set 30000
init 90000
set 90000
started 40000
init 40000
set 40000
started 50000
init 50000
set 50000
started 10000
init 10000
set 10000
init 0
set 0
run finished 617.2992351055145
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129985
Approved by: https://github.com/rsdcastro, https://github.com/kurman
2024-07-03 02:09:44 +00:00
91a8376d47 run_test: Unset cpp stacktraces after reruns (#129004)
Rerun the failing test singly with the env var set.  If it succeeds, start a new process without the cpp stack traces env var

We don't want to waste time generating these if we don't have to

They can also show up in assertion errors, which may cause unexpected failures if a test wants to check these

Adds new --rs (run single) to be used the same way --scs and --sc are.  It will only run the single test in the step current file

https://hud.pytorch.org/pytorch/pytorch/pull/129004?sha=2c349d3557d399020bf1f6a8b7045e2e4957ba46 has some examples of logs

In the above:
* test_checkpoint_valid failed, then passed in another subprocess.  The testing continued in a different new subprocess from the test right after it (test_checkpointing_without_reentrant_early_free)
* test_format_traceback_short failed consistently, but it continued to run because keep-going was set

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129004
Approved by: https://github.com/PaliC
2024-07-03 01:50:15 +00:00
c77c139878 [Intel Triton] Update Intel Triton to resolve installation issue on manylinux. (#129847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129847
Approved by: https://github.com/Skylion007, https://github.com/gujinghui, https://github.com/atalman
ghstack dependencies: #129782
2024-07-03 01:46:32 +00:00
c686304277 Enable UFMT on test/test_public_bindings.py (#128389)
Part of: https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
> test/test_public_bindings.py

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128389
Approved by: https://github.com/malfet
2024-07-03 01:43:41 +00:00
3b77b122c5 [Inductor UT] update rtol for convoluton on XPU. (#129782)
[Inductor UT] update rtol for convoluton on XPU.
Fix https://github.com/pytorch/pytorch/issues/129974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129782
Approved by: https://github.com/atalman
2024-07-03 01:37:16 +00:00
1e27af335e [easy] enhance local model loading (#129897)
Summary:
1. add one more model lib dep.
2. add error message when torchscript failed to find a class in python compilation unit.

Test Plan: CI

Reviewed By: jingsh

Differential Revision: D59243250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129897
Approved by: https://github.com/jingsh
2024-07-03 00:29:02 +00:00
be2d79a16b [dynamic] config to disable duck sizing (#129804)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129804
Approved by: https://github.com/ezyang
2024-07-03 00:20:54 +00:00
111f9b5d44 [Dynamo] Add config to skip/inline torchrec (#129912)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129912
Approved by: https://github.com/anijain2305
2024-07-03 00:14:51 +00:00
89646ebb11 Revert "[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680)"
This reverts commit 4b8a5e03745924c8f987dc072fa4d41f4cb6f103.

Reverted https://github.com/pytorch/pytorch/pull/129680 on behalf of https://github.com/kit1980 due to breaking internal builds, see D59181183 ([comment](https://github.com/pytorch/pytorch/pull/129680#issuecomment-2204737227))
2024-07-03 00:03:50 +00:00
921c116089 [inductor] Kill mark_node_as_mutating (#129346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129346
Approved by: https://github.com/lezcano
ghstack dependencies: #128893, #129325, #129343, #129344
2024-07-02 23:50:07 +00:00
b2ac8d2af3 [inductor] Use multiple outputs for flex-attention (#129344)
This fixes the DCE issue for attention output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129344
Approved by: https://github.com/lezcano
ghstack dependencies: #128893, #129325, #129343
2024-07-02 23:50:07 +00:00
45844e0d4e [inductor] Add FileCheck to flex attention epilogue test (#129343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343
Approved by: https://github.com/lezcano
ghstack dependencies: #128893, #129325
2024-07-02 23:50:04 +00:00
7955cd3e83 [inductor] Make UserDefinedTritonKernel a multi-output operation (#129325)
Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.

Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129325
Approved by: https://github.com/lezcano
ghstack dependencies: #128893
2024-07-02 23:50:00 +00:00
fb078c20c1 [inductor] Separate Buffer and Operation into two concepts (#128893)
Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.

This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893
Approved by: https://github.com/lezcano
2024-07-02 23:49:57 +00:00
872d972e41 [custom_op] better error message on no returns (#129896)
I run into this a lot. I can imagine that it would look opaque to users,
so made it more friendly

Old error message: "ValueError: infer_schema(func): Return has unsupported type <class 'inspect._empty'>."

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129896
Approved by: https://github.com/yushangdi
2024-07-02 23:34:23 +00:00
aa0352ca38 [custom ops] add default value support for device types (#129792)
Fixes #129371

I think the first case in Issue #129371 is already supported in the current code? Since it takes care of string default values. This PR adds support for device type default values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129792
Approved by: https://github.com/zou3519
2024-07-02 23:31:29 +00:00
d7680a564b Bug fixes for disabling 0/1 specialization on plain int (#129961)
These bug fixes will be exercised in
https://github.com/pytorch/pytorch/pull/128327 but I separate them from
the actual policy change (which is more risky)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129961
Approved by: https://github.com/lezcano
2024-07-02 23:19:48 +00:00
eqy
29ffa20bb1 [CUDA] Bump tolerances for test_grad_pca_lowrank (#129902)
The revert of #127199 seems to surface an additional failure on A100---small tolerance bump to account for this.

I did find what appears to be a race condition in the one of the kernels used in this workload but I'm not sure it's related here...

CC @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129902
Approved by: https://github.com/ezyang
2024-07-02 23:17:02 +00:00
b5fdbc1a9f Revert "[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369)"
This reverts commit ec789a3c9ddd4e550b3dea6934ce2d41deb98784.

Reverted https://github.com/pytorch/pytorch/pull/129369 on behalf of https://github.com/clee2000 due to broke test/distributed/pipelining/test_schedule.py::ScheduleTest::test_non_symmetric_stage_ids_ScheduleClass0 on distributed cuda https://github.com/pytorch/pytorch/actions/runs/9766039400/job/26959115773 ec789a3c9d.  You can see the error on the PR, but Dr. CI classified it wrong ([comment](https://github.com/pytorch/pytorch/pull/129369#issuecomment-2204568418))
2024-07-02 22:30:53 +00:00
b6f781e433 Bug fix for captuing execution trace grid function (#129832)
Summary:
The inputs to grid function are varying argument, it can be one number, two numbers, or three  numbers. The current implementation captured it as a tuple. For example "grid((16,))". The fix is to change it to varying number of elements. In the previous example, it is changed to "grid(16,)".

PARAM et-replay code will be modified to reflect this change in a following up DIFF.

Test Plan: buck2 test  mode/dev-nosan caffe2/test:profiler -- -- test_execution_trace_with_pt2

Differential Revision: D59195933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129832
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2024-07-02 22:23:57 +00:00
39357ba06f [dynamo] don't constrain range on the replacement for a symbol (#129907)
# Error
```
  File "/data/users/colinpeppler/pytorch/torch/_meta_registrations.py", line 704, in sym_constrain_range
    constrain_range(size, min=min, max=max)
  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 898, in constrain_range
    a.node.shape_env._constrain_range(a.node.expr, min, max)
  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/recording.py", line 245, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 2813, in _constrain_range
    assert isinstance(a, sympy.Symbol), f"constraining non-Symbols NYI, {a} is {type(a)}"
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: constraining non-Symbols NYI, s1 + s2 is <class 'sympy.core.add.Add'>
```

# Context
I ran into the following scenario:
```
getitem = ...
sym_size_int = torch.ops.aten.sym_size.int(getitem, 0) # this is u0 = s0 + s1
_check_is_size = torch._check_is_size(sym_size_int)
# we fail at this guy
sym_constrain_range_default = torch.ops.aten.sym_constrain_range.default(sym_size_int, min = 4, max = 1234)

# runtime assertion
add = sym_size_int + sym_size_int_1
eq = add == sym_size_int
_assert_scalar_default = torch.ops.aten._assert_scalar(eq, "Runtime assertion failed for expression Eq(s0 + s1, u0) on node 'eq'")
```

everything but getitem was asserted into the FX graph by insert_deferred_runtime_asserts()
7e4329c258/torch/fx/passes/runtime_assert.py (L38-L52)

In the above scenario, we fail trying to constraint the range on `s0 + s1` which is not a `sympy.Symbol`.

And why exactly are we constraining the range on `s0 + s1`? Because it's the replacement for `u0`.

# Approach
Whenever we try to constrain the range on the replacement of ~~an unbacked symint~~ a non-symbol, just ignore it.

In the scenario above, we'll be okay to ignore it because whenever there's a replacement on an unbacked symint, we will update its range. Hence, no need to constrain the range on `s1 + s1`. We can confirm this with `TORCH_LOGS="+dynamic"`.
```
torch/fx/experimental/symbolic_shapes.py:4737: _update_var_to_range u0 = VR[4, 198] (update)
torch/fx/experimental/symbolic_shapes.py:4856: set_replacement u0 = s1 + s2 (trivial_lhs) VR[4, 198]
```

600bf978ba/torch/fx/experimental/symbolic_shapes.py (L4759-L4764)

Differential Revision: [D59257079](https://our.internmc.facebook.com/intern/diff/D59257079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129907
Approved by: https://github.com/jingsh
2024-07-02 21:46:40 +00:00
c22e66896f Revert "Fix typo in floordiv solver code that affects flipped relation (#129888)"
This reverts commit 3c6c3b94486d49614bae5e76e7bd6b9579f643d4.

Reverted https://github.com/pytorch/pytorch/pull/129888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the updated test starts to fail flakily in trunk somehow, so I am reverting the change to see if it helps ([comment](https://github.com/pytorch/pytorch/pull/129888#issuecomment-2204442653))
2024-07-02 21:16:59 +00:00
1ddb100318 [FSDP1][Easy] Remove Spammy Log Lin in _runtime_utils.py (#129967)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129967
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/Skylion007
2024-07-02 21:08:57 +00:00
deefc10dd3 [executorch hash update] update the pinned executorch hash (#129428)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129428
Approved by: https://github.com/pytorchbot
2024-07-02 20:39:39 +00:00
cyy
26de2c2487 [3/N] Enable clang-tidy on torch/csrc/jit/serialization/* (#129850)
Follows #129300.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129850
Approved by: https://github.com/ezyang
2024-07-02 20:08:48 +00:00
8ec5ba960f [MPS] Add tensor_lr overloads to fused adam & adamw (#129451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129451
Approved by: https://github.com/janeyx99
2024-07-02 19:46:30 +00:00
2631a96f2a Stop updating hints (#129893)
Some profiling suggests that the repeated maybe evaluate static calls are expensive.

Ref: https://github.com/pytorch/pytorch/issues/123964

With test script:

```
import torch
import torch._dynamo.config

torch._dynamo.config.capture_scalar_outputs = True

@torch.compile(fullgraph=True)
def f(a, b):
    xs = b.tolist()
    for x in xs:
        torch._check_is_size(x)
        torch._check(x <= 20)
    return a.split(xs)

N = 20

splits = torch.randint(10, (N,))
sz = splits.sum().item()

f(torch.randn(sz), splits)
```

Before:

```
real    0m18.526s
user    0m16.555s
sys     0m11.031s
```

After:

```
real    0m13.831s
user    0m12.152s
sys     0m10.941s
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129893
Approved by: https://github.com/lezcano
2024-07-02 19:24:33 +00:00
1f6c1fcd36 [dtensor][debug] add operation tracing to comm_mode (#129017)
**Summary**
I have added an even more detailed module tracker that now includes the collective counts and operations that happen in each submodule making it easier for users to debug. The tracing now includes the operation's DTensor arguements' input shape and sharding. Like the module collective tracing, the user also has the option to log the tracing table to output.txt file. I have decided not to include the example output for transformer as it is too many lines. The expected output for the MLP_operation_tracing is shown below:

<img width="574" alt="Screenshot 2024-06-25 at 3 33 16 PM" src="https://github.com/pytorch/pytorch/assets/50644008/a09e2504-19d5-4c69-96e8-f84e852d7786">

<img width="467" alt="Screenshot 2024-06-25 at 3 33 45 PM" src="https://github.com/pytorch/pytorch/assets/50644008/55c07d2d-6cb6-410f-82ac-2849bb7bfbbb">

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129017
Approved by: https://github.com/XilunWu
2024-07-02 19:05:05 +00:00
bf05ea2bab Re-generate Linux build workflows after #124014 (#129976)
This looks like a landrace as lint passed on #124014
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129976
Approved by: https://github.com/kit1980
2024-07-02 18:57:20 +00:00
080149cb38 [Inductor][FlexAttention] Add helper functions of converting score_mod to block_mask (#129909)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129909
Approved by: https://github.com/Chillee, https://github.com/drisspg
ghstack dependencies: #129831, #129859
2024-07-02 18:48:16 +00:00
1f3e2d7877 [Inductor] Rename TemplatedAttention to FlexAttention (#129859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129859
Approved by: https://github.com/Chillee, https://github.com/drisspg
ghstack dependencies: #129831
2024-07-02 18:48:16 +00:00
aa7ea6b45c Add wraps back (#129933)
Fixes https://github.com/pytorch/pytorch/issues/129922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129933
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-07-02 18:24:02 +00:00
ec789a3c9d [pipelining] [BE] Move pipeline_order validation to schedules.py (#129369)
# Changes
* small fix in stage error message
* Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`.
* Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369
Approved by: https://github.com/wconstab
ghstack dependencies: #129368
2024-07-02 18:19:28 +00:00
4eb449f7dc [pipelining] add small logging section to docs (#129368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129368
Approved by: https://github.com/wconstab
2024-07-02 18:19:28 +00:00
34e94c507a [Inductor] Make FlexAttention block_mask argument as tuple (#129831)
Re-organize ```block_mask``` related arguments a tuple to reduce the individual argument number. I was trying to use named tuple, but aot autograd doesn't work well with named tuple. The only downside of using tuple rather than named tuple is we need to use index to access its element. But we only need this at one place, it should be fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129831
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-07-02 17:18:33 +00:00
9105d54c6b [dynamo][sparse] Graph break on sparse tensors (#129883)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129883
Approved by: https://github.com/ezyang
ghstack dependencies: #129830, #129858, #129857, #129881
2024-07-02 16:51:56 +00:00
75443d3daf [dynamic-shapes] Dont create symbol if .item() is a nan (#129881)
Passes ` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/torch_np/numpy_tests/lib/test_function_base.py::TestInterp::test_scalar_interpolation_point` in the stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129881
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #129830, #129858, #129857
2024-07-02 16:51:56 +00:00
d146a62e77 [MPS][BE] Introduce mtl_setBytes (#129910)
Which for primitive types calls `[encoder setBytes:&val legnth:sizeof(val) index:idx];` and for container types passes number of elements equal to the size of the container

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129910
Approved by: https://github.com/Skylion007
2024-07-02 16:36:57 +00:00
9fb2dec7a6 [custom ops] Add unknown arg (#129614)
Fixes #129372

Add a mutated_args="unknown" that pessimistically assumes that all inputs to the operator are being mutates.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129614
Approved by: https://github.com/zou3519
2024-07-02 16:10:14 +00:00
e3b3431c42 Fix for HistogramObserver (#129387)
Summary:
There were two problems with the HistogramObserver:
1. It does not work when someone passes a batch_size 1, tensor_size 1 data-point.
2. The Histogram doesn't seem to actually update if the range of the new x falls within the old one

These issues were both fixed.

On top of this, I greatly simplified the logic for the histogram updating. Now, it doesn't do the downsampling anymore, which saves a ton of memory and code. The accuracy can still be controlled with the upsampling ratio. This ratio was also too high for the accuracy we generally need here, I reduced the default for this.
Also the code is cleaner now, much easier to follow what's happening.

test_histogram_observer_same_inputs was likely wrong - If I pass 0s and 1s to my histogramobserver, I want them to actually count! The current test now thinks it's good to discard and ignore these values.

Test Plan: You can run the included tests.

Differential Revision: D58931336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129387
Approved by: https://github.com/jerryzh168
2024-07-02 15:41:44 +00:00
03440a1c13 Revert "Add support for inline_asm_elementwise in Inductor lowerings (#129846)"
This reverts commit badc638eb68c0b07ae3b857e885e6d0137b218aa.

Reverted https://github.com/pytorch/pytorch/pull/129846 on behalf of https://github.com/jeffdaily due to introduced ROCm breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/129846#issuecomment-2203519554))
2024-07-02 15:25:34 +00:00
3fd128361e [traced-graph][sparse] add relay override for layout_impl (#129930)
In the "layout()" method of "TensorImpl" defined in the file core/TensorImpl.h, the following code and documentation can be found:

```
  Layout layout() const {
  ...
  if .. {
  ...
  } else if (is_sparse_compressed()) {
      // Typically, the tensor dispatch keys define the tensor layout
      // uniquely. This allows using non-virtual layout method for
      // better performance. However, when tensor's layout depends,
      // say, on tensor attributes, one must use this execution path
      // where the corresponding tensor impl class overwrites virtual
      // layout_impl() method.
      return layout_impl();
    } else {
    ...
    }
  }

```
However, this override was never implemented. This PR put the override in place, to prepare for sparsity propagation in another PR.

https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129930
Approved by: https://github.com/ezyang
2024-07-02 15:24:34 +00:00
dacc33d2fa Make sym_min/sym_max handle Numpy scalars (#129917)
Internal xref:
https://fb.workplace.com/groups/1069285536500339/posts/7773876449374514/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129917
Approved by: https://github.com/Skylion007
2024-07-02 14:59:20 +00:00
f1df13f023 [BE][Easy] Fix PYI001: unprefixed-type-param in torch/utils/data/datapipes (#129885)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129885
Approved by: https://github.com/ezyang
2024-07-02 14:56:27 +00:00
257b9c7936 Fix layout for *_like() factories on NJTs (#129879)
Background: this bug was triggering DEBUG=1 asserts in the backward for `unbind()`, which calls `empty_like()`. I found that the NJT implementation of `empty_like()` was redispatching on `values` while blindly passing along all kwargs. This resulted in `empty_like(values, ..., layout=torch.jagged)`, which is incorrect since `values` is strided, tripping the debug assert here:

433b691f98/aten/src/ATen/EmptyTensor.cpp (L305)

This PR explicitly sets `layout=torch.strided` when redispatching `*_like()` factories on `values`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129879
Approved by: https://github.com/soulitzer
2024-07-02 14:51:23 +00:00
6c2a8b6b38 [Ez][BE]: Enable new stable ruff rules (#129825)
Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825
Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet
2024-07-02 14:47:10 +00:00
2926655761 [inductor] optimize cpp builder configuration code (#129577)
Changes:
1. Combine choose isa condition dispatch code.
2. Unificate MacOS openmp configuration code.
3. Clean up useless code.

Co-authored-by: Jason Ansel <jansel@jansel.net>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129577
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-02 14:41:59 +00:00
6cb0ad3375 [BE]: Update NCCL submodule to 2.21.5 (#124014)
Update NCCL to the latest version. This release is mostly bugfixes with a few new minor features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124014
Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman
2024-07-02 14:39:33 +00:00
dc75ec252a [inductor] Fix can_merge check for expr=q0*q1 (#129806)
Fixes #111884

In the minimised reproducer, we have a loop with the index expression `-q0*q1`
for which in the merge tester we get:
```
expr1 = - 0 * (_merge_tester * 16) = 0
expr2 = - _merge_tester * 0 = 0
```
so it decides we can merge the dimensions and `q0` is set to `0`, meaning `-q0*q1` is always zero!

Here I change the test so we have at least one case where no zeros are
substituted so we can catch this situation. In the normal strided case we get
e.g.
```
expr = 16 * q0 + q1
expr1 = 16 * _merge_tester2 + (16 * _merge_tester1)
expr2 = 16 * (_merge_tester2 + _merge_tester1)
```
which are still equivalent expressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129806
Approved by: https://github.com/lezcano
2024-07-02 14:30:02 +00:00
37e3c60897 [Inductor][CPP] Remove redundant INT8-specific logic in the INT8 GEMM template (#129470)
**Summary**
Remove redundant INT8-specific logic in the INT8 GEMM template to unify the code structure with FP32/BF16/FP16 GEMM Template.

**Test Plan**
```
numactl -C 56-111 -m 1 python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129470
Approved by: https://github.com/jgong5
ghstack dependencies: #128825, #129048, #129049, #129103, #129220, #129221
2024-07-02 13:15:15 +00:00
b6379591a9 [Inductor][CPP] Pass weight dtype explicitly for cpp gemm template (#129221)
**Summary**
This PR mainly refactor 2 things:

1. Passing in weight's data type explicitly in `create_micro_gemm` as `input2.dtype`. When registering `CppMicroGemmConfig`, we will reuse `input.dtype` if `input2.dtype` is not explicitly registered.
2. Add an util function to get the output data type and compute data type from input data type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129221
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825, #129048, #129049, #129103, #129220
2024-07-02 13:06:32 +00:00
72fa864098 [Inductor][CPP] Enable Quantized Linear with AMX MicroGEMM (#129220)
**Summary**
Add the AMX micro gemm kernel with int8 data type.

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_amx
```

**Next Step**
- [✓] Unary post op fusion
- [✓] Int8 output
- [✓] Binary Fusion
- [✓] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129220
Approved by: https://github.com/jgong5
ghstack dependencies: #128825, #129048, #129049, #129103
2024-07-02 12:53:35 +00:00
a796358330 [Inductor][CPP] Enable Quantized Linear GEMM Template with Binary Fusion (#129103)
**Summary**
Based on previous PR, add the config to support quantized linear binary - optional(unary) post op fusion.

- Activation dtype: uint8
- Weight dtype: int8
- Output dtype: float32/bfloat16/uint8
- Post Op Fusion: with binary and optional[Unary] post operator fusion

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise_binary
```

**Next Step**
- [✓] Unary post op fusion
- [✓] Int8 output
- [✓] Binary Fusion
- [ ] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129103
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825, #129048, #129049
2024-07-02 12:45:10 +00:00
86e2d16ba0 [Inductor][Quant] Change the schema of QLinear Binary (#129049)
**Summary**
We change the schema of QLinear Binary, so it will be easier to enable the corresponding gemm template.

- Extra input of binary post-op is a tensor which needs to be an input node of autotuning, we need to move it at front of `output_scale` which is a scalar.
- We also move it at front of `bias`, since `bias` is optional tensor for this fusion, but `other` is a must to have for linear binary fusion.

**Test Plan**
```
python -u -m pytest -s -v test/quantization/core/test_quantized_op.py -k qlinear
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k qlinear
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129049
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825, #129048
2024-07-02 12:36:38 +00:00
07450e9713 Revert "[MPS] Add support for autocast in MPS (#99272)"
This reverts commit 6240cfd5c751bea6ca91dc765085e1d871b22345.

Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/jeanschmidt due to introduced breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2203033719))
2024-07-02 12:29:51 +00:00
0441173ab2 Add slowTest marker to test_linalg_solve_triangular_large (#129903)
In nvidia internal testing, for slower devices such as Orin NX, on large dtypes like complex128, test_linalg_solve_triangular_large is taking multiple hours to complete and timing out CI. This PR adds a slowTest marker so it can be skipped due to speed issues. cc @nWEIdia
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129903
Approved by: https://github.com/lezcano
2024-07-02 12:27:12 +00:00
95a5958db4 [ROCm] Update nightly triton-rocm pin to release branch (#129361)
Update pin to tip of https://github.com/triton-lang/triton/commits/release/3.0.x/ following upstream strategy here https://github.com/pytorch/pytorch/pull/126098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129361
Approved by: https://github.com/peterbell10
2024-07-02 11:49:52 +00:00
3c6c3b9448 Fix typo in floordiv solver code that affects flipped relation (#129888)
Fixes https://github.com/pytorch/pytorch/issues/123535

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129888
Approved by: https://github.com/lezcano
2024-07-02 11:15:03 +00:00
8ef8240172 Don't mark conversion to float as is_integer = False (#129890)
Zero is an integer, so if you say is_integer = False, you are also
saying the result cannot be zero, which is undesirable.

This is exercised by next PR in the stack.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129890
Approved by: https://github.com/lezcano
2024-07-02 11:08:09 +00:00
eb1ff76f23 Make are_strides_like_channels_last size oblivious (#129677)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129677
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #129869
2024-07-02 11:05:20 +00:00
ebeeb22669 Correctly put mark_unbacked symbols in shape_env_to_source_to_symbol_cache (#129869)
Internal xref:
https://www.internalfb.com/intern/anp/view/?source=version_selector&id=5534845

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129869
Approved by: https://github.com/albanD
2024-07-02 11:05:20 +00:00
567dd1a3ca [inductor] unificate toolchain code. (#129816)
This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 2, and it is continued PR to https://github.com/pytorch/pytorch/pull/129789

Changes:
1. Unificate cpp builder's toolchain code.
2. Move all build related code to `cpp_builder.py`.
3. Optimize `codecache.py`, `cpp_builder.py` and `cpu_vec_isa.py` import logical follow: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129816
Approved by: https://github.com/jansel
2024-07-02 09:52:06 +00:00
badc638eb6 Add support for inline_asm_elementwise in Inductor lowerings (#129846)
This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it.

<img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846
Approved by: https://github.com/shunting314
2024-07-02 09:31:38 +00:00
ccc4ee7793 check boolean alpha and beta of Fake tensor impl for Tensor.addr (#129839)
Fixes https://github.com/pytorch/pytorch/issues/127043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129839
Approved by: https://github.com/lezcano
2024-07-02 09:20:49 +00:00
5c9d5272e4 fixes #124582 (#128483)
added check for existence of outputs requiring grad to make_graphed_callables.

added new test case, updated existing test case to include parameterless modules.

Fixes #124582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128483
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-02 08:45:59 +00:00
1ad683033b Implemented flexible PP schedule (#129597)
Enabled some cases to work where num_microbatches % pp_size != 0. Using the flex_pp schedule, we will have

num_rounds = max(1, n_microbatches // pp_group_size) and it works as long as n_microbatches % num_rounds is 0. As a few examples, support

pp_group_size = 4, n_microbatches = 10. We will have num_rounds = 2 and n_microbatches % 2 is 0.
pp_group_size = 4, n_microbatches = 3. We will have num_rounds = 1 and n_microbatches % 1 is 0.

Moved over from PiPPy (https://github.com/pytorch/PiPPy/pull/1129)

Tested using the config in (1), schedule looks like the following graph:

```
=========== ALL_RANK_ACTIONS ===========
         Rank 0  Rank 1  Rank 2  Rank 3
Step 00: F0_s0   None    None    None
Step 01: F1_s0   F0_s1   None    None
Step 02: F2_s0   F1_s1   F0_s2   None
Step 03: F3_s0   F2_s1   F1_s2   F0_s3
Step 04: F4_s0   F3_s1   F2_s2   F1_s3
Step 05: F0_s4   F4_s1   F3_s2   F2_s3
Step 06: F1_s4   F0_s5   F4_s2   F3_s3
Step 07: F2_s4   F1_s5   F0_s6   F4_s3
Step 08: F3_s4   F2_s5   F1_s6   F0_s7
Step 09: F4_s4   F3_s5   None    B0_s7
Step 10: F5_s0   None    F2_s6   F1_s7
Step 11: None    None    B0_s6   B1_s7
Step 12: None    F4_s5   F3_s6   F2_s7
Step 13: None    B0_s5   B1_s6   B2_s7
Step 14: F6_s0   F5_s1   F4_s6   F3_s7
Step 15: B0_s4   B1_s5   B2_s6   B3_s7
Step 16: F7_s0   F6_s1   F5_s2   F4_s7
Step 17: B1_s4   B2_s5   B3_s6   B4_s7
Step 18: F8_s0   F7_s1   F6_s2   F5_s3
Step 19: B2_s4   B3_s5   B4_s6   B0_s3
Step 20: F9_s0   F8_s1   F7_s2   F6_s3
Step 21: B3_s4   B4_s5   B0_s2   B1_s3
Step 22: F5_s4   F9_s1   F8_s2   F7_s3
Step 23: B4_s4   B0_s1   B1_s2   B2_s3
Step 24: F6_s4   F5_s5   F9_s2   F8_s3
Step 25: B0_s0   B1_s1   B2_s2   B3_s3
Step 26: F7_s4   F6_s5   F5_s6   F9_s3
Step 27: B1_s0   B2_s1   B3_s2   B4_s3
Step 28: F8_s4   F7_s5   F6_s6   F5_s7
Step 29: B2_s0   B3_s1   B4_s2   B5_s7
Step 30: F9_s4   F8_s5   F7_s6   F6_s7
Step 31: B3_s0   B4_s1   B5_s6   B6_s7
Step 32: None    F9_s5   F8_s6   F7_s7
Step 33: B4_s0   B5_s5   B6_s6   B7_s7
Step 34: None    None    F9_s6   F8_s7
Step 35: B5_s4   B6_s5   B7_s6   B8_s7
Step 36: None    None    None    F9_s7
Step 37: B6_s4   B7_s5   B8_s6   B9_s7
Step 38: None    None    None    None
Step 39: B7_s4   B8_s5   B9_s6   B5_s3
Step 40: None    None    None    None
Step 41: B8_s4   B9_s5   B5_s2   B6_s3
Step 42: None    None    None    None
Step 43: B9_s4   B5_s1   B6_s2   B7_s3
Step 44: None    None    None    None
Step 45: B5_s0   B6_s1   B7_s2   B8_s3
Step 46: None    None    None    None
Step 47: B6_s0   B7_s1   B8_s2   B9_s3
Step 48: None    None    None
Step 49: B7_s0   B8_s1   B9_s2
Step 50: None    None
Step 51: B8_s0   B9_s1
Step 52: None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129597
Approved by: https://github.com/H-Huang
2024-07-02 07:54:38 +00:00
3e2df3ca9d Add xpu to getAccelerator (#129205)
# Motivation
Add `xpu` support to `getAccelerator`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129205
Approved by: https://github.com/albanD, https://github.com/gujinghui
ghstack dependencies: #129463
2024-07-02 06:48:24 +00:00
6353a12e6a XPUHooksInterface inherits from AcceleratorHooksInterface (#129463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129463
Approved by: https://github.com/gujinghui, https://github.com/albanD
2024-07-02 06:48:24 +00:00
76259ebfdd [inductor] split cpu vec isa to dedicate file (keep git history) (#129789)
This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1

Changes:
1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`.
<img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92">

2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa.
3. Update code for above changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-02 05:29:05 +00:00
f6edd1f7c9 [BE] Make ActivationWrapper an abstract class (#129808)
Fixes #95481

Test Plan:
Unit tested checkpoint_wrapper.py by instantizing ActivationWrapper and got TypeError as expected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129808
Approved by: https://github.com/Skylion007
2024-07-02 04:29:43 +00:00
c2d0b7b96d Revert "[ROCm] std::clamp work-around for hip-clang compiler (#127812)"
This reverts commit 8c2c3a03fb87c3568a22362d83b00d82b9fb3db2.

Reverted https://github.com/pytorch/pytorch/pull/127812 on behalf of https://github.com/ezyang due to windows trunk job failing ([comment](https://github.com/pytorch/pytorch/pull/127812#issuecomment-2201653245))
2024-07-02 01:52:31 +00:00
6240cfd5c7 [MPS] Add support for autocast in MPS (#99272)
Fixes https://github.com/pytorch/pytorch/issues/88415

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272
Approved by: https://github.com/malfet
2024-07-02 01:49:52 +00:00
600bf978ba [Pipelining] Add to/from CSV format and improved __repr__ (#129264)
_Action.__repr__ gets rearranged so it doesn't require an underscore or
a 's' prefix, but still keeps multi-digit stage and microbatch indices
separated by an alpha character indicating the action type.

to/from CSV methods allow dumping a generated schedule to CSV format for
offline visualization or manual editing in a spreadsheet and reloading
to use at runtime.

Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129264
Approved by: https://github.com/H-Huang
2024-07-02 01:26:23 +00:00
83e6ec2ccd [FSDP2+TP] Disable 2D state_dict (#129519)
Fixes #ISSUE_NUMBER

Gonna fill in the RFC but just want to run CI to see if anything else breaks.

Test:
```
python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_raise_not_implemented_state_dict_if_2d
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129519
Approved by: https://github.com/awgu
2024-07-02 01:26:14 +00:00
cyy
46366888d7 Remove outdated CMake code (#129851)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129851
Approved by: https://github.com/ezyang
2024-07-02 00:40:37 +00:00
7e4329c258 [EZ][BE] Bump min cmake version to 3.18 (#129906)
As this is a min CMake version supported by top level PyTorch

Hides
```
CMake Deprecation Warning at aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/CMakeLists.txt:7 (cmake_minimum_required):
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129906
Approved by: https://github.com/kit1980
2024-07-01 23:06:49 +00:00
9645eaaaec [BE] Improve logging for runner-determinator (#129679)
This lets us be more flexible about what data we output and throwing exceptions. It's also less likely to break when others make changes (e.g. any print statement would have broken this code before since the printed output was expected to only be a json)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129679
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt, https://github.com/Skylion007
2024-07-01 22:31:35 +00:00
eeef68671d [autograd] Do not detach when unpacking tensors that do not require grad (#127959)
In this PR:
- Ensure that if a tensor not requiring grad is saved for backward unpacking does not trigger a detach (unless the user installs a saved tensor pack hook that returns a tensor requiring grad).
- Update non-reentrant checkpoint to also no longer detach for this case.

Alternatives:
- For custom autograd Function, you could directly save on ctx to work around this, but that would not work for when we switch to using custom ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127959
Approved by: https://github.com/YuqingJ
ghstack dependencies: #125795, #128545, #129262
2024-07-01 21:57:36 +00:00
87693b534c [ROCm] Use AOTriton as a dynamic library (#129094)
This PR enables using AOTriton as a shared library dependency instead of a static one.

Resolves the issue of linker errors when trying to build PyTorch for a lot of (>7 or so) gfx archs due to huge size of aotriton static library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129094
Approved by: https://github.com/malfet
2024-07-01 21:39:27 +00:00
8c2c3a03fb [ROCm] std::clamp work-around for hip-clang compiler (#127812)
Fixes #127666.

Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max for USE_ROCM builds.

Patch comes from @lamikr. Modified to use #ifndef USE_ROCM.

https://github.com/lamikr/rocm_sdk_builder/pull/37

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812
Approved by: https://github.com/hongxiayang, https://github.com/malfet
2024-07-01 21:00:33 +00:00
750c701e49 [ROCm] Update xlogy comment detailing issue (#128151)
update skip reason comment with more accurate descriptor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128151
Approved by: https://github.com/zou3519
2024-07-01 20:58:58 +00:00
78cda9a810 [symbolic-shapes] Add FloatPow in the symbolic shape guard closure (#129857)
Fixes test failure raised in the next diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129857
Approved by: https://github.com/ezyang
ghstack dependencies: #129830, #129858
2024-07-01 20:44:59 +00:00
53d67165c0 [dynamo] Skip FUNCTION_MATCH guards for descriptors (#129858)
Hard to write tests. This PR makes many test pass in the stack such as

`PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_ao_sparsity.py::TestComposability::test_convert_without_squash_mask`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129858
Approved by: https://github.com/mlazos
ghstack dependencies: #129830
2024-07-01 20:44:59 +00:00
f86dbae247 Fix typo in lxml requirement (#129695)
Extra period at the end throws off pip:
```
root@f04177cab5af:/data/pytorch# pip install -r .ci/docker/requirements-ci.txt
ERROR: Invalid requirement: 'lxml==5.0.0.': Expected end or semicolon (after version specifier)
    lxml==5.0.0.
        ~~~~~~~^ (from line 309 of .ci/docker/requirements-ci.txt)
```

Not sure why CI docker builds do not have an issue with this period.

Typo comes from f73b1b9388
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129695
Approved by: https://github.com/huydhn
2024-07-01 19:43:37 +00:00
fdd0a7f9b4 Run test_mps_allocator_module serially (#129340)
Not sure why this test starts to fail (maybe runner update) 8a2fed7e6a/1 or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-07-01 18:44:48 +00:00
b02186ffc1 Revert "Allow get attributes on DDP similar to FSDP (#128620)"
This reverts commit 065c386990dce444db17eff7b254bf79e82450ef.

Reverted https://github.com/pytorch/pytorch/pull/128620 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if the trunk error on inductor is fixed ([comment](https://github.com/pytorch/pytorch/pull/128620#issuecomment-2200717876))
2024-07-01 17:57:00 +00:00
bb0f3df562 Fix index issues in torch.fx.interpreter (#129527)
Summary: Fix index issues in torch.fx.interpreter by changing range from `[:i]` to `[:i+1]`. Because if there are `n` elements, the last index `i` of the `for` loop is `n-1` and `[:i]` can only get access to elements from index `0` to index `n-2` and miss the last element. `[:i+1]` can get access to all elements correctly.

Test Plan: Test with Node API

Differential Revision: D59028395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129527
Approved by: https://github.com/dulinriley
2024-07-01 17:46:13 +00:00
1956d87c1f Increase riscv implementation in DepthwiseConvKernel (#127867)
**Summary:**

Increase riscv implementation in DepthwiseConvKernel.

**Compile:**

export USE_CUDA=0
export USE_DISTRIBUTED=0
export USE_MKLDNN=0
export MAX_JOBS=4
export CMAKE_CXX_COMPILER=clang++
export CMAKE_C_COMPILER=clang
export CMAKE_C_FLAGS=-march=rv64gcv
export CMAKE_CXX_FLAGS=-march=rv64gcv
python3 setup.py develop --cmake

**Test Plan:**

**Correctness** - Check the results of the run before and after test_convolution.py

python3 test/run_test.py --include nn/test_convolution --keep-going

**Before:**
===== 9 passed, 13 skipped, 564 deselected in 46.55s =====
The following tests failed consistently:
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types
test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size
test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d
test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32

**After:**
===== 9 passed, 13 skipped, 564 deselected in 48.13s =====
The following tests failed consistently:
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types
test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size
test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d
test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32

**Performance** - Compare the results before and after mobilenet_v2

python3 run.py mobilenet_v2  -d cpu -t eval

**Before:**
Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32.
CPU Wall Time per batch: 19590.647 milliseconds
CPU Wall Time:       19590.647 milliseconds
Time to first batch:         5271.3518 ms
CPU Peak Memory:                0.3809 GB

**After:**
Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32.
CPU Wall Time per batch: 13523.530 milliseconds
CPU Wall Time:       13523.530 milliseconds
Time to first batch:         2696.0304 ms
CPU Peak Memory:                0.3408 GB

**Versions:**
Clang version: 17.0.2
Platform: CanMV-K230
Architecture: riscv64
OS: Ubuntu 23.10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127867
Approved by: https://github.com/malfet
2024-07-01 17:11:21 +00:00
c9dc9887db Revert "Enable UFMT on test/test_public_bindings.py (#128389)"
This reverts commit fe5424d0f8604f6e66d827ae9f94b05cb7119d55.

Reverted https://github.com/pytorch/pytorch/pull/128389 on behalf of https://github.com/clee2000 due to broke test_mps.py::TestMPS::test_mps_allocator_module? https://github.com/pytorch/pytorch/actions/runs/9730750763/job/26854426294 fe5424d0f8 Not sure how this change can do that.  Build failed on PR so test didn't run ([comment](https://github.com/pytorch/pytorch/pull/128389#issuecomment-2200589719))
2024-07-01 16:34:04 +00:00
433b691f98 Revert "[inductor] optimize cpp builder configuration code (#129577)"
This reverts commit 2e3ff394bf94d3b9cbab0fe8a93a9ea7c9cb4267.

Reverted https://github.com/pytorch/pytorch/pull/129577 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D59181128 ([comment](https://github.com/pytorch/pytorch/pull/129577#issuecomment-2200554824))
2024-07-01 16:14:06 +00:00
19e17216a2 Revert "[inductor] split cpu vec isa to dedicate file (keep git history) (#129789)"
This reverts commit 58f346c874a8a982679b4d4f3876602cc05d66d4.

Reverted https://github.com/pytorch/pytorch/pull/129789 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/129577 ([comment](https://github.com/pytorch/pytorch/pull/129789#issuecomment-2200545144))
2024-07-01 16:08:44 +00:00
b6dc37bb4e Revert "[inductor] unificate toolchain code. (#129816)"
This reverts commit 67c9ec2b6d12ffd0e83861dcc16c1cd1a9b74d35.

Reverted https://github.com/pytorch/pytorch/pull/129816 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #129577 ([comment](https://github.com/pytorch/pytorch/pull/129816#issuecomment-2200539687))
2024-07-01 16:06:22 +00:00
cyy
ca5d13c672 [1/N] Enable unused variable warnings on torch_cpu and fix some violations (#128670)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128670
Approved by: https://github.com/ezyang
2024-07-01 14:56:46 +00:00
e385bf8ef8 Revert "[halide-backend] Disable split reductions for Halide (#129320)"
This reverts commit a18eb651d352e45860a96869abaf9fb7b215eac6.

Reverted https://github.com/pytorch/pytorch/pull/129320 on behalf of https://github.com/jeanschmidt due to This PR is breaking internal builds, please check comments on it D59204360 ([comment](https://github.com/pytorch/pytorch/pull/129320#issuecomment-2200351678))
2024-07-01 14:44:35 +00:00
a83eaf1c3a Revert "[halide-backend] Support manual schedules (#129321)"
This reverts commit 9ae78a578caff195821ad535a9e8d8ef59552142.

Reverted https://github.com/pytorch/pytorch/pull/129321 on behalf of https://github.com/jeanschmidt due to Reverting, as it is required to do so in order to revert #129320 ([comment](https://github.com/pytorch/pytorch/pull/129321#issuecomment-2200345664))
2024-07-01 14:42:33 +00:00
cc9b005bf2 Enable torchao nightly workflow (#129779)
Summary:
Make the following improvements:
* Schedule the torchao benchmark nightly
* Enable torchbench, timm, and huggingface models
* Refactor the benchmarking script to better arrange the benchmarking groups

Test workflow: https://github.com/pytorch/benchmark/actions/runs/9705589352

X-link: https://github.com/pytorch/benchmark/pull/2336

Differential Revision: D59074571

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129779
Approved by: https://github.com/jerryzh168
2024-07-01 14:28:38 +00:00
75f64e1203 Fix test test_type_hints.py::TestTypeHints::test_doc_examples (#129829)
As per the title, this test was broken for months.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129829
Approved by: https://github.com/ezyang
2024-07-01 13:28:37 +00:00
e1b426b345 [ROCm] CUDA_VISIBLE_DEVICES fallback option for device_count (#129650)
Updating `_parse_visible_devices` to allow use of CUDA_VISIBLE_DEVICES if HIP_VISIBLE_DEVICES is unset, to avoid any unnecessary code changes in workloads that already rely on CUDA_VISIBLE_DEVICES.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129650
Approved by: https://github.com/hongxiayang, https://github.com/malfet
2024-07-01 11:40:09 +00:00
cyy
313eec02cc Add hash function of std::string_view to torch/csrc/lazy/core/hash.h (#128800)
For easier moving of c10::string_view to std::string_view in PyTorch/XLA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128800
Approved by: https://github.com/ezyang
2024-07-01 09:53:34 +00:00
f6a0be5023 Add warpSize to Device properties (#128449)
Adding warp_size to CudaDeviceProperties.

>>> import torch
>>> prop = torch.cuda.get_device_properties(torch.cuda.current_device())
>>> prop.warp_size
64
>>>

@jeffdaily @pruthvistony @jithunnair-amd @ROCmSupport

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128449
Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/jithunnair-amd, https://github.com/malfet
2024-07-01 09:13:32 +00:00
04a0d85620 [BE] Print all pip packages installed on the system after TorchChat (#129809)
To make debugging regressions like ones happened last Wed when new version of torchao was released, that resulted in TorchBench downgrading pytorch version to 2.3.1

Test plan: Look at the log output for example https://github.com/pytorch/pytorch/actions/runs/9720408234/job/26832794157?pr=129809#step:20:1158 contains
```
+ echo 'Print all dependencies after TorchBench is installed'
Print all dependencies after TorchBench is installed
+ python -mpip freeze
absl-py==2.1.0
accelerate==0.31.0
aiohttp==3.9.5
aiosignal==1.3.1
astunparse==1.6.3
async-timeout==4.0.3
attrs==23.2.0
audioread==3.0.1
beautifulsoup4==4.12.3
boto3==1.19.12
botocore==1.22.12
bs4==0.0.2
cachetools==5.3.3
certifi==2024.6.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129809
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-07-01 04:51:53 +00:00
cyy
eb1583dbc1 [2/N] Fix clang-tidy warnings in torch/csrc/jit/serialization (#129300)
Follows #129055
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129300
Approved by: https://github.com/ezyang
2024-07-01 01:09:00 +00:00
e62073d799 [dynamo] Skip FUNCTION_MATCH on method-wrapper objects (#129830)
Fixes https://github.com/pytorch/pytorch/issues/118563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129830
Approved by: https://github.com/jansel
2024-06-30 20:21:18 +00:00
eqy
24b6c5a41f [cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587)
Fix for #129579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129587
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-06-30 19:37:44 +00:00
eqy
f845a7a91a [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-30 19:22:16 +00:00
eqy
7b0e9a27ba Restore allowed_info in OOM message when applicable (#129546)
Seems to be removed following #99699?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129546
Approved by: https://github.com/Skylion007
2024-06-30 17:22:32 +00:00
8755e035d2 [CUDA][Pooling] Fix 64-bit indexing in avg_pool_2d backward attempt 2 (#129818)
Somehow the original PR was missing the `CUDA_KERNEL_LOOP_TYPE` change???

Thanks @johnc-keen @Chillee for the great repro! (#129785)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129818
Approved by: https://github.com/Chillee, https://github.com/Skylion007
2024-06-30 16:52:33 +00:00
eqy
4dd3cff234 [CUDA] Fix more DeviceIndex printing (#128540)
Same `char` dtype causing device index `0` to be interpreted as a null-terminator, see also #123984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128540
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2024-06-30 16:44:14 +00:00
eqy
68484621fe [cuDNN][functorch] Bump tolerances for nn.functional.conv2d in test_vmap_autograd_grad (#129796)
Newer versions of cuDNN can dispatch to a winograd kernel here on A100 which affects numerics a bit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129796
Approved by: https://github.com/Skylion007
2024-06-30 16:36:12 +00:00
fff633f087 [CI] Enable AOT inductor FP32 accuracy test for CPU (#129040)
This PR enabled AOT inductor backend FP32 accuracy check for CPU in CI workflow, which could catch AOT inductor issue at early stage.

**Test Time cost:**
| Suite       	| Precision 	| Time cost 	|
|-------------	|-----------	|-----------	|
| Huggingface 	| FP32      	|   1h12m   	|
| Timm models 	| FP32      	|   1h32m   	|
|  Torchbench 	| FP32      	|   1h40m   	|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129040
Approved by: https://github.com/chuanqi129, https://github.com/desertfire, https://github.com/malfet
2024-06-30 14:00:09 +00:00
8a5fda0377 added type hints for __contains__ (#129653)
- Fixes #129646
- Added test in test/typing/reveal/tensor_constructors.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129653
Approved by: https://github.com/ezyang
2024-06-30 11:49:11 +00:00
1a689ea38c [Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op (#129048)
**Summary**
Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU`

- Activation dtype: uint8
- Weight dtype: int8
- Output dtype: float32/bfloat16/uint8
- Post Op Fusion: with unary post operator fusion

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise
```

**Next Step**
- [✓] Unary post op fusion
- [✓] Int8 output
- [ ] Binary Fusion
- [ ] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129048
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825
2024-06-30 09:53:55 +00:00
35a197defa [Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output (#128825)
**Summary**
Support int8 GEMM Template with refer MicroInt8GEMM kernel for case:

- Activation dtype: uint8
- Weight dtype: int8
- Output dtype: float32/bfloat16
- Post Op Fusion: without unary post operator fusion

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise
```

**Next Step**
- [ ] Unary post op fusion
- [ ] Int8 output
- [ ] Binary Fusion
- [ ] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128825
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-30 09:45:43 +00:00
fe5424d0f8 Enable UFMT on test/test_public_bindings.py (#128389)
Part of: https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
> test/test_public_bindings.py

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128389
Approved by: https://github.com/ezyang
2024-06-30 08:49:51 +00:00
4ee1cb9b95 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-30 01:36:07 +00:00
2effbcfcd8 Revert "[BE][Easy] replace import pathlib with from pathlib import Path (#129426)"
This reverts commit 6d75604ef135925e8c85363c2f4a5e0b6f7fef28.

Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))
2024-06-29 23:24:06 +00:00
67c9ec2b6d [inductor] unificate toolchain code. (#129816)
This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 2, and it is continued PR to https://github.com/pytorch/pytorch/pull/129789

Changes:
1. Unificate cpp builder's toolchain code.
2. Move all build related code to `cpp_builder.py`.
3. Optimize `codecache.py`, `cpp_builder.py` and `cpu_vec_isa.py` import logical follow: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129816
Approved by: https://github.com/jansel
2024-06-29 23:21:13 +00:00
3fec0efd34 [Inductor][CPP] Support vectorization of bitwise fn (#129733)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: bitwise_and`. In this PR, we add vectorization support of 6 bitwise functions.

In this PR, we also remove `bitwise_xor` from `ops_to_bool` list which sets output data type as bool in data type propagation. It seems wrong since according to this doc
https://pytorch.org/docs/stable/generated/torch.bitwise_xor.html, it should return the same integral data type with input and the testcase `test_bitwise3` failed due to this issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_bitwise
python -u -m pytest -s -v test/inductor/test_torchinductor.py -k test_bitwise3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129733
Approved by: https://github.com/jgong5, https://github.com/Skylion007
2024-06-29 17:25:27 +00:00
6d75604ef1 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-29 15:42:09 +00:00
7837a12474 [BE] enforce style for empty lines in import segments (#129751)
This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet:

> Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one)

`usort` allows empty lines within import segments. For example, `usort` do not change the following code:

```python
import torch.aaa
import torch.bbb
import torch.ccc

x = ...  # some code
```

```python
import torch.aaa

import torch.bbb
import torch.ccc

x = ...  # some code
```

```python
import torch.aaa

import torch.bbb

import torch.ccc

x = ...  # some code
```

This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style:

1. no empty lines within segments.
2. single empty line between segments.
3. two spaces after import statements.

All the code snippets above will be formatted to:

```python
import torch.aaa
import torch.bbb
import torch.ccc

x = ...  # some code
```

which produces a consistent code style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751
Approved by: https://github.com/malfet
2024-06-29 14:15:24 +00:00
9ae78a578c [halide-backend] Support manual schedules (#129321)
Currently using this for some by-hand hacking, but might need to implement our own scheduler later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321
Approved by: https://github.com/shunting314
ghstack dependencies: #126417, #129025, #129026, #127506, #129036, #129320
2024-06-29 14:06:28 +00:00
a18eb651d3 [halide-backend] Disable split reductions for Halide (#129320)
In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025, #129026, #127506, #129036
2024-06-29 14:06:28 +00:00
4cb8cb04a7 [halide-backend] Enable bfloat16 support (#129036)
Requires https://github.com/halide/Halide/pull/8255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129036
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025, #129026, #127506
2024-06-29 14:06:25 +00:00
b93bf55b6a [halide-backend] Add GPU support (#127506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025, #129026
2024-06-29 14:06:21 +00:00
86cadc6385 [halide-backend] Dimension-based indexing (#129026)
Prior to this the generated Halide code was a rather literal translation of the Triton code, with XBLOCK/YBLOCK/RBLOCK and 1D inputs.  Halide prefers dimensions, and this 1D index triggers a lot of bugs and perf issues.  This PR infers dimensions and changes the indexing in the generated code.

Before
```py
@hl.generator(name="kernel")
class Kernel:
    in_ptr0 = hl.InputBuffer(hl.Float(32), 1)
    out_ptr3 = hl.OutputBuffer(hl.Float(32), 2)

    def generate(g):
        in_ptr0 = g.in_ptr0
        out_ptr3 = g.out_ptr3
        xindex = hl.Var('xindex')
        rindex = hl.Var('rindex')
        r1 = rindex
        x0 = xindex
        idom = hl.RDom([hl.Range(0, 16), hl.Range(0, 32)])
        odom = hl.RDom([hl.Range(0, 16)])
        rdom = hl.RDom([hl.Range(0, 32)])
        xindex_idom = idom.x
        xindex_odom = odom.x
        rindex_idom = idom.y
        r1_idom = rindex_idom
        x0_idom = xindex_idom
        x0_odom = xindex_odom
        tmp0 = hl.Func('tmp0')
        tmp0[rindex, xindex] = in_ptr0[r1 + (32*x0)]
        tmp1 = hl.Func('tmp1')
        tmp1[xindex] = hl.maximum(rdom, tmp0[rdom, xindex])
        tmp2 = hl.Func('tmp2')
        tmp2[rindex, xindex] = tmp0[rindex, xindex] - tmp1[xindex]
        tmp3 = hl.Func('tmp3')
        tmp3[rindex, xindex] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[rindex, xindex])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[rindex, xindex])
        tmp4 = hl.Func('tmp4')
        tmp4[xindex] = hl.sum(rdom, tmp3[rdom, xindex])
        tmp5 = hl.Func('tmp5')
        tmp5[rindex, xindex] = tmp3[rindex, xindex] / tmp4[xindex]
        out_ptr3_i0 = hl.Var('out_ptr3_i0')
        out_ptr3_i1 = hl.Var('out_ptr3_i1')
        out_ptr3[out_ptr3_i0, out_ptr3_i1] = hl.cast(out_ptr3.type(), tmp5[out_ptr3_i0, out_ptr3_i1])

        assert g.using_autoscheduler()
        in_ptr0.set_estimates([hl.Range(0, 512)])
        out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
```

After
```py
@hl.generator(name="kernel")
class Kernel:
    in_ptr0 = hl.InputBuffer(hl.Float(32), 2)
    out_ptr3 = hl.OutputBuffer(hl.Float(32), 2)

    def generate(g):
        in_ptr0 = g.in_ptr0
        out_ptr3 = g.out_ptr3
        h0 = hl.Var('h0')
        h1 = hl.Var('h1')
        rdom = hl.RDom([hl.Range(0, 32)])
        hr1 = rdom[0]
        tmp0 = hl.Func('tmp0')
        tmp0[h0, h1] = in_ptr0[h0, h1,]
        tmp1 = hl.Func('tmp1')
        tmp1[h1] = hl.maximum(rdom, tmp0[hr1, h1])
        tmp2 = hl.Func('tmp2')
        tmp2[h0, h1] = tmp0[h0, h1] - tmp1[h1]
        tmp3 = hl.Func('tmp3')
        tmp3[h0, h1] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[h0, h1])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[h0, h1])
        tmp4 = hl.Func('tmp4')
        tmp4[h1] = hl.sum(rdom, tmp3[hr1, h1])
        tmp5 = hl.Func('tmp5')
        tmp5[h0, h1] = tmp3[h0, h1] / tmp4[h1]
        out_ptr3[h0, h1,] = hl.cast(hl.Float(32), tmp5[h0, h1])

        assert g.using_autoscheduler()
        in_ptr0.dim(0).set_min(0)
        in_ptr0.dim(0).set_stride(1)
        in_ptr0.dim(0).set_extent(32)
        in_ptr0.dim(1).set_min(0)
        in_ptr0.dim(1).set_stride(32)
        in_ptr0.dim(1).set_extent(16)
        in_ptr0.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
        out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129026
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025
2024-06-29 14:06:16 +00:00
da5f37515e [halide-backend] Generate standalone runtime (#129025)
This puts the halide runtime in a global shared object, rather than copying it to each kernel.  Having many copies of the runtime causes many issues with cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417
2024-06-29 14:06:12 +00:00
e34b7e6af3 [halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417
Approved by: https://github.com/shunting314, https://github.com/eellison
2024-06-29 14:06:08 +00:00
13d4be1dc7 [pipelining] Support W action for schedules (#129233)
Add support to for the `W` action in `_step_microbatches`.

## TODO:
- Clean up the tests theres a lot of copy-pasted repeated code there

Co-authored-by: Will Constable <whc@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129233
Approved by: https://github.com/wconstab
ghstack dependencies: #128983, #128976
2024-06-29 11:51:40 +00:00
a6da01bd01 [pipelining] Support arbitrary stage ordering on ranks (#128976)
Fixes based on discussion in https://github.com/pytorch/pytorch/issues/128665

Our previous assumption was that for looped schedules `stage_ids = range(rank, total_stages, num_local_stages)`. This is not true for all schedules. This change relaxes that assumptions and allows arbitrary ordering of stages. For example in the added test we do, rank 0: [stage0, stage3], rank 1: [stage1, stage2]. The test also adds a schedule registry (for testing) which performs 1 microbatch through this schedule

```
F0_0 None None F0_3 B0_3 None None B0_0
None F0_1 F0_2 None None B0_2 B0_1 None
```

Co-authored-by: Will Constable <whc@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128976
Approved by: https://github.com/wconstab
ghstack dependencies: #128983
2024-06-29 11:51:39 +00:00
18ae3bab2f [Pipelining] Support separate dw_runner for PipelineStage (#128983)
Fixes #128974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128983
Approved by: https://github.com/H-Huang
2024-06-29 11:51:34 +00:00
b0e5c9514d use shutil.which in check_compiler_ok_for_platform (#129069)
the same as https://github.com/pytorch/pytorch/pull/126060
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129069
Approved by: https://github.com/ezyang
2024-06-29 11:38:51 +00:00
56935684c3 Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)
------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419
Approved by: https://github.com/ezyang
ghstack dependencies: #129375, #129376
2024-06-29 09:23:39 +00:00
9120992c72 [BE][Easy] enable postponed annotations in torchgen (#129376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376
Approved by: https://github.com/ezyang
ghstack dependencies: #129375
2024-06-29 09:23:39 +00:00
8a67daf283 [BE][Easy] enable postponed annotations in tools (#129375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-29 09:23:35 +00:00
58f346c874 [inductor] split cpu vec isa to dedicate file (keep git history) (#129789)
This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1

Changes:
1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`.
<img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92">

2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa.
3. Update code for above changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-29 07:19:54 +00:00
a676b7c5f3 Add XGLMForCausalLM to the flaky model list (#129776)
Not failing on devGPU. Went to CI machine ... flaky. So adding to the flaky list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129776
Approved by: https://github.com/mlazos
ghstack dependencies: #129583, #129610, #129775
2024-06-29 05:47:28 +00:00
5d1763d159 Add lcnet to the inline_inbuilt_nn_module list (#129775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129775
Approved by: https://github.com/mlazos
ghstack dependencies: #129583, #129610
2024-06-29 05:47:28 +00:00
89696db4b0 Revert "[LLVM/TensorExpr] Update for an API change in LLVM 18." (#129797)
This reverts commit 20f394f10a389bcf13485929be8862f98ad4b322 (https://github.com/pytorch/pytorch/pull/117086)

LLVM upstream changed the pass builder API again, so registerPassBuilderCallbacks no longer takes extra boolean for PopulateClassToPassNames. Update accordingly.

Relevant LLVM upstream change:
https://github.com/llvm/llvm-project/pull/96321
https://github.com/llvm/llvm-project/pull/96462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129797
Approved by: https://github.com/dcci
2024-06-29 05:17:20 +00:00
3ef44df667 [ts-migration] support prim::SetAttr and fix prim::GetAttr (#129440)
- Lifting Tensor Constant attributes to buffers: TorchScript does not automatically lift tensor constant attributes to buffers. So previous converter cannot access tensor constant attributes. This PR fixed the issue.
- Add SetAttr support for tensor attributes by copy_.
- Add SetAttr support for non-tensor attributes. In particular, we maintain the current value of non-tensor attributes in `name_to_non_tensor_attribute_node`, similar to an interpreter pass on non-tensor attributes. So we can support the following use case:
```python
 def forward(self, x):
      c1 = self.count
      self.count += 1
      c2 = self.count
      return x + c1 + c2
```
- Fixed a bug in GetAttr to support the following use case:
```python
def forward(self, inp):
  x = self.buffer
  self.buffer += 1
  y = self.buffer
  return x + y + inp
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129440
Approved by: https://github.com/angelayi
2024-06-29 05:08:13 +00:00
ec47d4d9a8 [Inductor] FlexAttention supports block sparse mask (#129216)
Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286
Initial perf number:
* fwd speedup: 0.44 -> 0.72
* bwd speedup: 0.38 -> 0.71

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216
Approved by: https://github.com/Chillee
2024-06-29 04:44:38 +00:00
7b5a8424a1 [GPT-fast] Update micro benchmark numbers as A100-50G (#129799)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129799
Approved by: https://github.com/Chillee
2024-06-29 04:36:07 +00:00
065c386990 Allow get attributes on DDP similar to FSDP (#128620)
FSDP implements the following logic but its missing from DDP.
This PR adds an equivalent function for the same.

```python
    def __getattr__(self, name: str) -> Any:
        """Forward missing attributes to the wrapped module."""
        try:
            return super().__getattr__(name)  # defer to nn.Module's logic
        except AttributeError:
            return getattr(self._fsdp_wrapped_module, name)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128620
Approved by: https://github.com/awgu
2024-06-29 01:57:22 +00:00
2bc6f329b2 Make PyTorch argparser understand complex (#129580)
It understands float and int, so why not `complex`.

Test plan: `python -c "import torch;print(torch.rand(3, dtype=complex))"`

Fixes https://github.com/pytorch/pytorch/issues/126837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129580
Approved by: https://github.com/albanD
2024-06-29 01:21:12 +00:00
dfd55d1714 Revert "[cond] inlining into one of the branches when pred is a python constant (#128709)"
This reverts commit 23adf166e166bd56e3446284939af7e46a181079.

Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking one ExecuTorch test ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2197806850))
2024-06-29 01:03:55 +00:00
3d96217891 Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 9e1f3ecaa710785a1ab03c6ad5093a5566d6c5e5.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))
2024-06-29 00:47:15 +00:00
c0782e7c81 Kineto profiler: collecting observer traces from C++ child threads (#128743)
Summary:
In a C++ program, if we have child threads doing GPU work, it would be nice to get traces of those threads as well. The problem is, pushProfilingCallbacks() is not called on child threads, therefore, no observer traces are collected on these threads, entirely missing in the final output.

This diff provides a new API that a child thread may elect to call to register itself onto the profiler that was started in main thread (or whatever the Python thread that manages the profiler).

Test Plan:
```
buck2 test @mode/opt //caffe2/test:profiler_test_cpp_thread
```

Reviewed By: aaronenyeshi

Differential Revision: D56669942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128743
Approved by: https://github.com/aaronenyeshi
2024-06-29 00:44:30 +00:00
a32ce5ce34 Revert "[BE][Easy] enable postponed annotations in tools (#129375)"
This reverts commit 59eb2897f1745f513edb6c63065ffad481c4c8d0.

Reverted https://github.com/pytorch/pytorch/pull/129375 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:25 +00:00
6063bb9d45 Revert "[BE][Easy] enable postponed annotations in torchgen (#129376)"
This reverts commit 494057d6d4e9b40daf81a6a4d7a8c839b7424b14.

Reverted https://github.com/pytorch/pytorch/pull/129376 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:25 +00:00
83caf4960f Revert "Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)"
This reverts commit e40f50cb87bcd176a380b729af5dda13dbe9c399.

Reverted https://github.com/pytorch/pytorch/pull/129419 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:24 +00:00
00d7bba2fa Revert "[BE] enforce style for empty lines in import segments (#129751)"
This reverts commit f5ff1a3ab9ef279655308266029faf6543a8a1ca.

Reverted https://github.com/pytorch/pytorch/pull/129751 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129751#issuecomment-2197799814))
2024-06-29 00:41:41 +00:00
fa6c0fe3e4 Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749)"
This reverts commit 9450e198aa0bdf6f81ccb8ad2f74c06e81d1af6e.

Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2197790226))
2024-06-29 00:16:47 +00:00
24f69eef6a [FSDP2] Ran reduce-scatter copy-in in default stream (#129721)
This PR runs the reduce-scatter copy-in in the default stream, allowing the reduce-scatter input (large allocation proportional to unsharded gradients) to be allocated in the default stream to avoid fragmenting that memory across stream memory pools.
- In general, minimizing memory usage spikes in non-default-stream memory pools helps because otherwise, that memory cannot be reused by the default stream outside of that spike. This reduce-scatter input allocation represents one such spike. The reduce-scatter outputs are still allocated in the separate `reduce_scatter` stream since they are small and have a non-spiky allocation/free pattern (we iteratively allocate them through backward and free them altogether after optimizer).
- This PR should not have any impact on overlap (I sanity checked Llama3-8B traces from torchtitan; plus we have the `test_fully_shard_overlap.py` unit tests).

**Experiment**
**(Before)** Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1:
```
[rank0]:2024-06-27 16:38:56,620 - root - INFO - step:  1  loss: 12.2764  memory: 71.99GiB(75.75%)  wps: 1,436  mfu: 8.41%
[rank0]:2024-06-27 16:38:56,620 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:2024-06-27 16:38:57,943 - root - INFO - step:  2  loss: 12.1001  memory: 79.82GiB(83.98%)  wps: 6,195  mfu: 36.28%
[rank0]:2024-06-27 16:38:59,266 - root - INFO - step:  3  loss: 11.7697  memory: 79.82GiB(83.98%)  wps: 6,193  mfu: 36.27%
[rank0]:2024-06-27 16:39:00,587 - root - INFO - step:  4  loss: 11.2807  memory: 79.82GiB(83.98%)  wps: 6,203  mfu: 36.32%
[rank0]:2024-06-27 16:39:01,910 - root - INFO - step:  5  loss: 10.9494  memory: 79.82GiB(83.98%)  wps: 6,198  mfu: 36.30%
```

**(After)** Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1:
```
[rank0]:2024-06-27 16:41:12,106 - root - INFO - step:  1  loss: 12.2560  memory: 69.46GiB(73.08%)  wps: 1,158  mfu: 6.78%
[rank0]:2024-06-27 16:41:12,106 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:2024-06-27 16:41:13,502 - root - INFO - step:  2  loss: 12.0949  memory: 77.29GiB(81.32%)  wps: 5,870  mfu: 34.37%
[rank0]:2024-06-27 16:41:14,839 - root - INFO - step:  3  loss: 11.7770  memory: 77.29GiB(81.32%)  wps: 6,130  mfu: 35.90%
[rank0]:2024-06-27 16:41:16,154 - root - INFO - step:  4  loss: 11.3188  memory: 77.29GiB(81.32%)  wps: 6,230  mfu: 36.48%
[rank0]:2024-06-27 16:41:17,474 - root - INFO - step:  5  loss: 10.9443  memory: 77.29GiB(81.32%)  wps: 6,211  mfu: 36.37%
```
**2.53 GiB reduction in peak reserved memory.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129721
Approved by: https://github.com/weifengpy, https://github.com/yifuwang
2024-06-28 23:55:12 +00:00
f06e3a1569 [Split Build] Make script not crash if split build is not set (#129774)
Fixes issue causing https://github.com/pytorch/pytorch/actions/runs/9704484834/job/26801889463 to crash
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129774
Approved by: https://github.com/atalman
2024-06-28 23:50:18 +00:00
7bda23ef84 [BE]: Update ruff to 0.5.0 (#129744)
Update ruff to 0.5.0 so we can enable all the some of the new checks I've been wanting to add to the codebase. First just updating the code to comply with some rule changes and a couple minor API changes / deprecations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129744
Approved by: https://github.com/ezyang
2024-06-28 21:49:56 +00:00
0a337613f8 Fix typo in stack_module_state doc (#129126)
I think there is a typo in the first example of the `torch.func.stack_module_state` documentation. The first parameter in the function call in the `wrapper` return is missing an 's'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129126
Approved by: https://github.com/zou3519
2024-06-28 21:36:40 +00:00
f5ff1a3ab9 [BE] enforce style for empty lines in import segments (#129751)
This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet:

> Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one)

`usort` allows empty lines within import segments. For example, `usort` do not change the following code:

```python
import torch.aaa
import torch.bbb
import torch.ccc

x = ...  # some code
```

```python
import torch.aaa

import torch.bbb
import torch.ccc

x = ...  # some code
```

```python
import torch.aaa

import torch.bbb

import torch.ccc

x = ...  # some code
```

This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style:

1. no empty lines within segments.
2. single empty line between segments.
3. two spaces after import statements.

All the code snippets above will be formatted to:

```python
import torch.aaa
import torch.bbb
import torch.ccc

x = ...  # some code
```

which produces a consistent code style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751
Approved by: https://github.com/malfet
2024-06-28 21:02:59 +00:00
5b96a552df Add a check and error message for no support on MPS for conv with output_channels > 2^16 (#129484)
Fixes the silent correctness issue in #129207 by preventing the user from calling the convolution op on MPS device with an unsupported value.

The fix for the missing support is coming in later as that requires work on the kernel side so it'll take some more time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129484
Approved by: https://github.com/kulinseth
2024-06-28 20:57:40 +00:00
bc8883a7c4 fix the error msg in device_mesh (#129747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129747
Approved by: https://github.com/awgu, https://github.com/wconstab
2024-06-28 20:12:09 +00:00
45f3e20527 Improve error message for weights_only load (#129705)
As @vmoens pointed out, the current error message does not make the "either/or" between setting `weights_only=False` and using `add_safe_globals` clear enough, and should print the code for the user to call `add_safe_globals`

New formatting looks like such

In the case that `add_safe_globals` can be used

```python
>>> import torch
>>> from torch.testing._internal.two_tensor import TwoTensor
>>> torch.save(TwoTensor(torch.randn(2), torch.randn(2)), "two_tensor.pt")
>>> torch.load("two_tensor.pt", weights_only=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options
        (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([TwoTensor])` to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

For other issues (unsupported bytecode)
```python
>>> import torch
>>> t = torch.randn(2, 3)
>>> torch.save(t, "protocol_5.pt", pickle_protocol=5)
>>> torch.load("protocol_5.pt", weights_only=True)
/data/users/mg1998/pytorch/torch/_weights_only_unpickler.py:359: UserWarning: Detected pickle protocol 5 in the checkpoint, which was not the default pickle protocol used by `torch.load` (2). The weights_only Unpickler might not support all instructions implemented by this protocol, please file an issue for adding support if you encounter this.
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
 Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Unsupported operand 149

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

Old formatting would have been like:
```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/users/mg1998/pytorch/torch/serialization.py", line 1203, in load
    raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you get the file from a trusted source. Alternatively, to load with `weights_only` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals` to allowlist this global if you trust this class/function.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129705
Approved by: https://github.com/albanD, https://github.com/vmoens
ghstack dependencies: #129239, #129396, #129509
2024-06-28 19:36:31 +00:00
99456a612b [AOTI] Properly indent launchKernel calls in AOTInductor (#129616)
Summary:
There is a small cosmetic issue in the C++ wrapper file generated by AOTInductor - The launchKernel() call isn't properly indented.

Added indentation for launchKernel() code block call when there's a "if" condition. a.k.a when `grid_uses_symbolic_shapes` is `True`.

Test Plan:
Test cmd ran (in pytorch oss):

`TORCH_LOGS="output_code" TORCH_COMPILE_DEBUG=1 python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols_abi_compatible_cuda`

And then manually verified the output code generated in a path like
`/tmp/torchinductor_guorachel/coraisesuchpl3qabrazn7ydydszcit6lwpn7ckd3b4wej4rep5l/cba5g5ajeh5sym3tp5iqn7kkokimj7qqd4krs2rruhupbfqgppge.cpp`

Similarly, also verified for test case:`test_zero_grid_with_unbacked_symbols_abi_compatible_cuda`

Differential Revision: D58897157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129616
Approved by: https://github.com/ColinPeppler
2024-06-28 19:16:18 +00:00
6120aa3718 [nn-module] Use standard dict for _parameters, _modules and _buffers (#129164)
TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow.

With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model.

Functionality impact
- The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We  use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR).

Perf impact
- I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)).

Typing impact
- I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #129163
2024-06-28 18:30:13 +00:00
db4c7bb7fc Refine typing annotation for compile (#129136)
before
![image](https://github.com/pytorch/pytorch/assets/46243324/91372d0f-ad0e-4abe-9582-7fe892f99ec8)

after
![image](https://github.com/pytorch/pytorch/assets/46243324/175066ff-78f9-44a1-a3bb-5df809f7e86d)

Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129136
Approved by: https://github.com/ezyang
2024-06-28 17:57:44 +00:00
FEI
59e4e92556 sdp::SDPBackend::flash_attention support PrivateUse1 (#126392)
Fixes https://github.com/pytorch/pytorch/issues/124271

cc  @cpuhrsch @drisspg @albanD @soulitzer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126392
Approved by: https://github.com/drisspg
2024-06-28 17:48:40 +00:00
26d633b721 [BE] Correctly catch skip signals emitting from sys.exit in Sandcastle (#129731)
https://github.com/pytorch/pytorch/pull/129581 does not work correctly with Sandcastle environment. This PR fixes the issue.

Differential Revision: [D59144062](https://our.internmc.facebook.com/intern/diff/D59144062/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129731
Approved by: https://github.com/wz337
2024-06-28 17:24:12 +00:00
c12a4f2e65 Add decomposition for slice_scatter (#123744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123744
Approved by: https://github.com/peterbell10
2024-06-28 17:02:10 +00:00
6897631ceb Guard on inner tensor names for traceable wrapper subclasses (#129618)
Fixes #129601

Background: it's possible that a traceable wrapper subclass will have an optional inner tensor constituent (e.g. NJT's cached min / max sequence lengths). To specify this, the subclass's `__tensor_flatten__()` impl should leave out any unspecified optional inner tensors in the returned list of `attrs`.

This PR guards on the list of inner tensor `attrs` returned in `subclass.__tensor_flatten__()[0]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129618
Approved by: https://github.com/anijain2305
2024-06-28 16:30:25 +00:00
b84036e3fb [AOTI] Fix test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation (#129173)
Fixes #122978
## Summary
To fix compilation error for test test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation

- Error 1
```
error: no matching function for call to ‘torch::aot_inductor::ArrayRefTensor<float>::ArrayRefTensor(float [1], const int64_t [0], const int64_t [0], int&, int32_t&)’
  613 |     ArrayRefTensor<float> buf3(buf3_storage, int_array_6, int_array_6, cached_torch_device_type_cpu, this->device_idx_);
      |                                                                                                                       ^
...
torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:188:35: note:   no known conversion for argument 2 from ‘const int64_t [0]’ {aka ‘const long int [0]’} to ‘torch::aot_inductor::MiniArrayRef<const long int>’
  188 |       MiniArrayRef<const int64_t> sizes,
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
```
Fix: added constructor for empty array in arrayref_tensor.h
- Error 2
```
error: cannot convert ‘torch::aot_inductor::ArrayRefTensor<float>’ to ‘AtenTensorHandle’ {aka ‘AtenTensorOpaque*’}
  625 |     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw));
      |                                                         ^~~~
      |                                                         |
      |                                                         torch::aot_inductor::ArrayRefTensor<float>
```
Fix: in cpp_wrapper_cpu.py, added codegen to call convert ArrayRefTensor to AtenTensorHandle first.
## Test Plan
```
python test/inductor/test_aot_inductor.py -k AOTInductorTestABICompatibleCpuWithStackAllocation.test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation
```

Before the fix, detailed in  #122978:
```
 |     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw));
      |                                                         ^~~~
      |                                                         |
      |                                                         torch::aot_inductor::ArrayRefTensor<float>
/home/yingzhaoseattle/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/utils.h:34:8: note: in definition of macro ‘AOTI_TORCH_ERROR_CODE_CHECK’
Ran 1 test in 4.377s
FAILED (errors=1)
```
After the fix

```
/home/yingzhaoseattle/pytorch/torch/backends/cudnn/__init__.py:107: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('extern_calls', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 9.633s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129173
Approved by: https://github.com/chenyang78
2024-06-28 16:27:42 +00:00
04264efab6 Add structured logging on FXGraphCache hit (#129588)
We'll also want to do this for AOTAutogradCache once that's ready

Differential Revision: [D59144226](https://our.internmc.facebook.com/intern/diff/D59144226)
Co-authored-by: Oguz Ulgen <oulgen@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129588
Approved by: https://github.com/oulgen, https://github.com/xmfan
2024-06-28 16:06:22 +00:00
e40f50cb87 Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)
------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419
Approved by: https://github.com/ezyang
ghstack dependencies: #129375, #129376
2024-06-28 15:37:57 +00:00
494057d6d4 [BE][Easy] enable postponed annotations in torchgen (#129376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376
Approved by: https://github.com/ezyang
ghstack dependencies: #129375
2024-06-28 15:37:57 +00:00
59eb2897f1 [BE][Easy] enable postponed annotations in tools (#129375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-28 15:37:54 +00:00
2e3ff394bf [inductor] optimize cpp builder configuration code (#129577)
Changes:
1. Combine choose isa condition dispatch code.
2. Unificate MacOS openmp configuration code.
3. Clean up useless code.

Co-authored-by: Jason Ansel <jansel@jansel.net>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129577
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-28 15:08:54 +00:00
eabe6574c0 [metal] Parameterize group_size in int4_mm test, fix int4mm shader for group_size > 128 (#129628)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129628
Approved by: https://github.com/kimishpatel
2024-06-28 15:01:30 +00:00
635d6c9d66 [FSDP2] Ran post-acc-grad hooks manually (#129450)
FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually.

**Discussion**
Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity.

Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not.

Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually.

**Caveats**
- Running `foreach=False` optimizer _per parameter tensor_  incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass).
    - On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be.
    - One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers.
    - If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`.
- The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream.
    - This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues.
- This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope.

**Experiments (torchtitan)**
- Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision:
    - Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU
    - With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped)
    - With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450
Approved by: https://github.com/weifengpy, https://github.com/yf225
2024-06-28 14:50:09 +00:00
fe4032fe20 [BE][CMake] Do not use EXEC_PROGRAM (#129714)
It was deprecated since CMake-3.0 in favor of `execute_process`, see https://cmake.org/cmake/help/v3.18/command/exec_program.html

This makes the following warning disappear:
```
CMake Warning (dev) at cmake/Modules/FindARM.cmake:5 (EXEC_PROGRAM):
  Policy CMP0153 is not set: The exec_program command should not be called.
  Run "cmake --help-policy CMP0153" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.

  Use execute_process() instead.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129714
Approved by: https://github.com/kit1980
2024-06-28 13:29:52 +00:00
98d34d849d Add a XPU UT to ensure lazy init (#129638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129638
Approved by: https://github.com/gujinghui
2024-06-28 13:22:17 +00:00
22a06869f2 include jit/*.pyi (#129654)
Fixes #108781, see https://github.com/pytorch/pytorch/pull/108782#issuecomment-1927321532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129654
Approved by: https://github.com/ezyang
2024-06-28 12:40:11 +00:00
424068d0d2 [Windows] remove mkl shared library dependency. (#129493)
# Background
I have fixed pytorch Windows missing mkl shared library dependency issue: https://github.com/pytorch/pytorch/issues/124009
The solution is change torch_cpu module static link mkl library:
1. pytorch static link mkl PR: https://github.com/pytorch/pytorch/pull/124925
2. builder install mkl static library: https://github.com/pytorch/builder/pull/1790

Double confirmed current build is using mkl static link: https://github.com/pytorch/pytorch/issues/124009#issuecomment-2160941802

# Goal
Remove setup.py `install_requires` will install mkl shared lib on pytorch Windows. It is not required now, due to we have static linked it.
It will reduce the pytorch install network traffic and avoid install useless mkl shared library package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129493
Approved by: https://github.com/malfet
2024-06-28 11:42:21 +00:00
a0dac3de31 Noise tensor using same size/stride with input to promote performance when channel last situation. (#129467)
All ops in _dropout_impl function are point-wise op. When input and output tensors are with same size and stride, those operators will get better performance. So i have remove memory in at::empty_like in make noise tensor.

@ezyang

Test code:
```
import torch

input1 = torch.randn((50, 20, 50 ,30)).cuda()
input2 = torch.randn((50, 20, 50 ,30)).cuda().to(memory_format=torch.channels_last)
input3 = torch.randn((50, 20, 50 , 50)).cuda()[...,10:40]
dropout = torch.nn.Dropout(p=0.5, inplace=True)

# warmup:
for i in range(20):
    output = dropout(input1)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
num = 10000
start_event.record()
for i in range(num):
    output = dropout(input1)
end_event.record()
end_event.synchronize()
time = start_event.elapsed_time(end_event)
print("input1 each time: {0}.".format(time * 1.0/num), flush =True)

start_event.record()
for i in range(num):
    output = dropout(input2)
end_event.record()
end_event.synchronize()
time = start_event.elapsed_time(end_event)
print("input2 each time: {0}.".format(time * 1.0/num), flush =True)

start_event.record()
for i in range(num):
    output = dropout(input3)
end_event.record()
end_event.synchronize()
time = start_event.elapsed_time(end_event)
print("input3 each time: {0}.".format(time * 1.0/num), flush =True)
```

Test result:

  | 算子名称 | 输入信息size / stride | empty是否携带连续性参数 | 耗时(ms) | 备注
-- | -- | -- | -- | -- | --
1 | dropout | (50, 20, 50 ,30) / (30000, 1500, 30, 1) | LEGACY_CONTIGUOUS_MEMORY_FORMAT | 0.0426735 |  
2 | dropout | (50, 20, 50 ,30) / (30000, 1, 600, 20) | LEGACY_CONTIGUOUS_MEMORY_FORMAT | 0.0461689 |  
3 | dropout | (50, 20, 50 ,30) / (50000, 2500, 50, 1) | LEGACY_CONTIGUOUS_MEMORY_FORMAT | 0.0512882 |  
4 | dropout | (50, 20, 50 ,30) / (30000, 1500, 30, 1) | 空,根据输入决定size/stride | 0.0426598 | 对比1,基本一致
5 | dropout | (50, 20, 50 ,30) / (30000, 1, 600, 20) | 空,根据输入决定size/stride | 0.0422751 | 对比2,提升8.4%左右
6 | dropout | (50, 20, 50 ,30) / (50000, 2500, 50, 1) | 空,根据输入决定size/stride | 0.0509037 | 对比3,基本一致

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129467
Approved by: https://github.com/ezyang
2024-06-28 10:06:13 +00:00
999eec8dea Revert "[cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)"
This reverts commit b7e7a4cb01de394af7686ab6feb216a8a5c716bb.

Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))
2024-06-28 06:03:54 +00:00
d21993bbb8 Revert "[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587)"
This reverts commit 7854d84acbfb7a4e3e807951188535a0316b585e.

Reverted https://github.com/pytorch/pytorch/pull/129587 on behalf of https://github.com/huydhn due to Sorry for revert yet another of your change but I need to revert this to cleanly revert https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196187332 ([comment](https://github.com/pytorch/pytorch/pull/129587#issuecomment-2196198756))
2024-06-28 06:01:07 +00:00
c43923a116 Revert "[Inductor] FlexAttention supports block sparse mask (#129216)"
This reverts commit b9d3cedd648d4ed9d0bf5b918893341e5f95289a.

Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is still failing in trunk b9d3cedd64, maybe a landrace given that TD has been turned off ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2196182882))
2024-06-28 05:44:46 +00:00
73eb4503cc Enable UFMT for numpy_test files, test_xnnpack_integration.py (#129023)
Fixes #123062

Run lintrunner on files:
test/test_xnnpack_integration.py

```bash
$ lintrunner

  FLAKE8 success!
  CLANGFORMAT success!
  MYPY success!
  MYPYSTRICT success!
  CLANGTIDY success!
  TYPEIGNORE success!
  TYPENOSKIP success!
  NOQA success!
  NATIVEFUNCTIONS success!
  NEWLINE success!
  CONSTEXPR success!
  SPACES success!
  TABS success!
  INCLUDE success!
  PYBIND11_INCLUDE success!
  ERROR_PRONE_ISINSTANCE success!
  PYBIND11_SPECIALIZATION success!
  PYPIDEP success!
  EXEC success!
  CUBINCLUDE success!
  RAWCUDADEVICE success!
  RAWCUDA success!
  ROOT_LOGGING success!
  DEPLOY_DETECTION success!
  CMAKE success!
  SHELLCHECK success!
  ACTIONLINT success!
  TESTOWNERS success!
  TEST_HAS_MAIN success!
  CALL_ONCE success!
  ONCE_FLAG success!
  WORKFLOWSYNC success!
  UFMT success!
  COPYRIGHT success!
  BAZEL_LINTER success!
  LINTRUNNER_VERSION success!
  ATEN_CPU_GPU_AGNOSTIC success!
  MERGE_CONFLICTLESS_CSV success!
  RUFF success!
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129023
Approved by: https://github.com/ezyang
2024-06-28 05:40:31 +00:00
b019f38fdd [inductor] Fix pattern replacements with multiple users (#129689)
Fixes #129685

After matching a pattern, we currently try to remove all the nodes of that
pattern, which doesn't work if any intermediate node has users outside of the
pattern. In which case we can't delete those particular nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129689
Approved by: https://github.com/shunting314
2024-06-28 05:16:17 +00:00
eqy
7854d84acb [cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587)
Fix for #129579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129587
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-06-28 04:42:45 +00:00
8d4216af8c Fix compile error with Intel oneAPI compiler (#129589)
I am building PyTorch with the Intel oneAPI 2024.0.0 compiler, and encountered this compile error:
```
[ 85%] Building CXX object caffe2/CMakeFiles/cpu_rng_test.dir/__/aten/src/ATen/test/cpu_rng_test.cpp.o
In file included from /home/src/pytorch/aten/src/ATen/test/cpu_rng_test.cpp:2:
/home/src/pytorch/aten/src/ATen/test/rng_test.h:119:41: error: loop variable 'to' creates a copy from type 'const ::std::optional<int64_t>' (aka 'const optional<long>') [-Werror,-Wrange-loop-construct]
  119 |     for (const ::std::optional<int64_t> to : tos) {
      |                                         ^
/home/src/pytorch/aten/src/ATen/test/rng_test.h:119:10: note: use reference type 'const ::std::optional<int64_t> &' (aka 'const optional<long> &') to prevent copying
  119 |     for (const ::std::optional<int64_t> to : tos) {
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                         &
1 error generated.
```

This change makes the compiler happy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129589
Approved by: https://github.com/colesbury
2024-06-28 02:35:10 +00:00
4b8a5e0374 [export] make with_effect mark op has_effect to prevent them from DCEed. (#129680)
Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph.

This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident.

Test Plan:
Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680
Approved by: https://github.com/angelayi
2024-06-28 02:22:30 +00:00
4b598d87d3 Fix FindBLAS.cmake (#129713)
Fixes regression introduced by https://github.com/pytorch/pytorch/pull/125227 by adding `INCLUDE(CheckFunctionExists)` that fixes
```
CMake Error at cmake/Modules/FindBLAS.cmake:413 (check_function_exists):
  Unknown CMake command "check_function_exists".
```

Fixes https://github.com/pytorch/pytorch/issues/129693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129713
Approved by: https://github.com/kit1980
2024-06-28 02:15:16 +00:00
b9d3cedd64 [Inductor] FlexAttention supports block sparse mask (#129216)
Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286
Initial perf number:
* fwd speedup: 0.44 -> 0.72
* bwd speedup: 0.38 -> 0.71

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216
Approved by: https://github.com/Chillee
2024-06-28 01:32:54 +00:00
c07a799ed5 [Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247)
Test command:
`pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127247
Approved by: https://github.com/bdhirsh
ghstack dependencies: #129502
2024-06-28 01:04:49 +00:00
36b9d9cfcd [Inductor UT] Generalize device-bias code in newly added UT test_scatter_optimization.py (#129622)
[Inductor UT] Generalize device-bias code in newly added UT test_scatter_optimization.py and test_torchinductor_dynamic_shapes.py
Fix issue #129624 , #129642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129622
Approved by: https://github.com/EikanWang, https://github.com/peterbell10
2024-06-28 01:04:21 +00:00
deaab33f3f [custom op] add error message (#129417)
Fixes [#129370](https://github.com/pytorch/pytorch/issues/129370)

Suggest correct a List type annotation when input is in Tuple type. To avoid confusion, we only suggest a type if the type is supported.

Example:
Tuple[int, int] -> List[int]
Tuple[Tensor, Tensor, Optional[Tensor]] -> List[Optional[Tensor]]
Tuple[int, ...] -> List[int]

ValueError: infer_schema(func): Parameter y has unsupported type typing.Tuple[torch.Tensor, torch.Tensor, typing.Optional[torch.Tensor]]. Tuple type annotation is not supported. Please try to use a List instead. For example, typing.List[typing.Optional[torch.Tensor]].
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129417
Approved by: https://github.com/zou3519
2024-06-28 01:03:14 +00:00
8ba0f6c7c2 Revert "[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164)"
This reverts commit f2840bb22079a6952c61446a3d0dfc12f6452852.

Reverted https://github.com/pytorch/pytorch/pull/129164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some internal dper3 tests ([comment](https://github.com/pytorch/pytorch/pull/129164#issuecomment-2195888838))
2024-06-28 00:49:39 +00:00
9e1f3ecaa7 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-06-28 00:35:15 +00:00
d4b6ff6fbe Disable llm-td step (#129722)
As it often fails during conda install step with `Unexpected HTTP response: 429`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129722
Approved by: https://github.com/kit1980, https://github.com/clee2000
2024-06-28 00:12:32 +00:00
0ffb17547e [Simple FSDP] Add unit test for torch.compile + reparameterization + SAC (#129641)
This can reproduce the error in https://github.com/pytorch/pytorch/issues/129684. Adding a unit test so that we hold the line for torch.compile + reparameterization + SAC to always be working, to pave the path for Tianyu's intern's project.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129641
Approved by: https://github.com/tianyu-l
2024-06-28 00:00:36 +00:00
169b4ca07e add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy, https://github.com/malfet
2024-06-27 23:53:13 +00:00
cyy
fb5888c719 Remove unused type traits in torch/csrc/utils (#128799)
Follows  #127852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128799
Approved by: https://github.com/ezyang
2024-06-27 23:51:18 +00:00
3fc279633b [ATen] Make argsort.stable CompositeImplicitAutograd (#129529)
It literally just calls `at::sort` and returns the indices, so is composite compliant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129529
Approved by: https://github.com/lezcano
2024-06-27 23:49:16 +00:00
7cf0b90e49 [BE] enable UFMT in torch.utils.data (#127705)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127705
Approved by: https://github.com/ezyang
ghstack dependencies: #127706, #127704
2024-06-27 23:16:24 +00:00
f911957573 [BE] sort imports in torch.utils.data (#127704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127704
Approved by: https://github.com/ezyang
ghstack dependencies: #127706
2024-06-27 23:16:24 +00:00
d80939e5e9 [BE] enable UFMT for torch/storage.py (#127706)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127706
Approved by: https://github.com/ezyang
2024-06-27 23:16:24 +00:00
67416a2996 [c10d] Introduce a util for detecting DMA connectivity among devices (#129510)
This PR introduces `_detect_dma_connectivity` - a utility for detecting DMA connectivity among devices.

The "DMA connectivity" in this context is more stringent than the ability to perform memory copy without CPU involvement. We define it as the ability for a device to issue load/store instructions and perform atomic operations on memory that resides on connected devices. The ability translates to the ability to run most aten GPU operations with operands backed by remote memory. `_detect_dma_connectivity` can help PyTorch and its users to determine whether certain DMA-based optimizations are possible.

`_detect_dma_connectivity` takes a `(device_type, connection_type)` pair and returns a matrix describing the connectivity. Connectivity detectors are statically registered on a `(device_type, connection_type)` basis. This PR implements the detector for `(CUDA, "nvlink")`. Later, detectors for pairs such as `(ROCM, "infinity_fabric")` can be introduced.

Example:

```python3
>>> from torch._C._autograd import DeviceType
>>> from torch._C._distributed_c10d import _detect_dma_connectivity
>>> connectivity = _detect_dma_connectivity(DeviceType.CUDA, "nvlink")
>>> for row in connectivity.matrix:
...     print(row)
...
[0, 18, 18, 18, 18, 18, 18, 18]
[18, 0, 18, 18, 18, 18, 18, 18]
[18, 18, 0, 18, 18, 18, 18, 18]
[18, 18, 18, 0, 18, 18, 18, 18]
[18, 18, 18, 18, 0, 18, 18, 18]
[18, 18, 18, 18, 18, 0, 18, 18]
[18, 18, 18, 18, 18, 18, 0, 18]
[18, 18, 18, 18, 18, 18, 18, 0]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129510
Approved by: https://github.com/weifengpy
2024-06-27 23:02:07 +00:00
305ba62906 Add support to GradScaler for respecting an already set grad_scale value (#123429)
Fixes #123428

Co-authored-by: Yousuf Mohamed-Ahmed <youmed.tech@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123429
Approved by: https://github.com/ezyang
2024-06-27 22:40:54 +00:00
83a4a8b510 [C10D] clean up pointless 'or None' clause (#129522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129522
Approved by: https://github.com/awgu
2024-06-27 22:40:11 +00:00
5e7ac69a67 [Dynamic Shapes] fixed dynamic shape inference (#128807)
Made dynamic dimension indirectly bound to an integer constrained.
After each ShapeEnv._refine_ranges, check if the new ValueRange is singleton, if it is, replace the symbol.

Fixes #122307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128807
Approved by: https://github.com/ezyang
2024-06-27 22:33:32 +00:00
b8398b771c Upload test stats when workflow regardless of conclusion (#129694)
Upload test stats when workflow always so that we can get status for cancelled workflows (especially ones that were cancelled manually)

There aren't that many workflow conclusions, so might as well as always run it, and we can see what happens

Undos [this old PR](https://togithub.com/pytorch/pytorch/pull/79180)

Notable pitfalls from the above:
Might cause noise if things can't be downloaded, but since this workflow doesn't show up on PRs, I think it's ok to slowly deal with what comes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129694
Approved by: https://github.com/huydhn
2024-06-27 21:12:21 +00:00
1d0efedc85 [Profiler] Add TSC Clock Callback to CUPTI (#125036)
Summary:
Right now we use the default clock for CUPTI which is not monotonic nor particularly fast. We have already added the Kineto side of the implementation here: https://www.internalfb.com/diff/D56525885

This diff only adds the compile flags such that the TSC format is used and sets the converter using a libkineto call in the profiler

Test Plan:
Obtained following trace using resnet test:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Apr_25_11_03_18.3862943.pt.trace.json.gz&bucket=gpu_traces

TBD: Add benchmarks

Differential Revision: D56584521

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125036
Approved by: https://github.com/aaronenyeshi
2024-06-27 21:07:43 +00:00
602b5cb218 [inductor] switch HalideCodeCache to new cpp_builder. (#129441)
Original PRs is damaged by confilct and rebase: https://github.com/pytorch/pytorch/pull/128303, https://github.com/pytorch/pytorch/pull/129144

This PR just switch `HalideCodeCache` to new cpp_builder and it is not `fb_code` related. It can merge without `fb_code` test.
Let's land this change firstly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129441
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-27 20:50:13 +00:00
39427288f4 Taskify training IR + run_decomp flow failures (#129547)
Differential Revision: [D59069088](https://our.internmc.facebook.com/intern/diff/D59069088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129547
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #128077, #129092, #129249
2024-06-27 20:43:22 +00:00
23adf166e1 [cond] inlining into one of the branches when pred is a python constant (#128709)
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.

We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.

Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709
Approved by: https://github.com/zou3519
2024-06-27 20:28:50 +00:00
71f5ecd1ee Fixed Memory Leaks in tests (#129640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129640
Approved by: https://github.com/clee2000
ghstack dependencies: #129400
2024-06-27 20:26:21 +00:00
dabaebd339 Make run_decomp work (#129249)
In this PR, we implement the first version of training_ir.run_decomp functionality. Since we don't return the modified buffers as extra output in training IR, our previous strategy of reusing graph signature won't work. In fact, this run_decomp is more similar to retracing. So i reuse some of export steps here. After this PR:
export_for_training().run_decomp({}, _preserve_ops=[all 183 ops]) == export_for_predispatch() - autograd_manipulating_ops.

Differential Revision: [D59069090](https://our.internmc.facebook.com/intern/diff/D59069090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129249
Approved by: https://github.com/zhxchen17
ghstack dependencies: #128077, #129092
2024-06-27 19:16:07 +00:00
ec284d3a74 Prototype for export_for_training (#129092)
This PR implements export_for_training where the IR is not-functional, pre-dispatch aten IR. The general strategy:
1. Call dynamo to get torch IR
2. Lift param/buffer
3. call make_fx

TODO:
1. run_decomp doesn't work
2. not-strict is not supported

Differential Revision: [D59069087](https://our.internmc.facebook.com/intern/diff/D59069087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129092
Approved by: https://github.com/zhxchen17
ghstack dependencies: #128077
2024-06-27 18:27:11 +00:00
4dcc1ceff3 [dynamo] Fakify result of delegate (#128752)
Summary: Somehow the delegate returns a real tensor result even though we pass in fake tensors. So here we need to convert the result to fake.

Test Plan: `buck2 run @//mode/dev-nosan //on_device_ai/helios/multi_zion:multi_zion_test -- -r test_single_delegate_dsp_only`

Differential Revision: D58617091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128752
Approved by: https://github.com/ydwu4
2024-06-27 17:59:52 +00:00
389492e264 Fix runner determinator bug (#129612)
Currently the runner determinator is buggy and doesn't let anyone's workflows run against the LF runners (it prefixes a "@" to the user names in the issue instead of either stripping it or prefixing it to the incoming names)

This PR fixes the bug so that people opted in to using LF runners can actually use them. It also puts the python code back into the repo.  Even though the code isn't directly invoked, having it there makes testing and linting easier/possible

Also includes lint fixes

Note: if you just review the .yml file you'll see all the relevant diffs

### Testing:
#### Before
```
python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo
{"label_type": "", "message": "LF Workflows are disabled for ZainRizvi, ZainRizvi. Using meta runners."}
```

#### After
```
python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo
{"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi, ZainRizvi. Using LF runners."}
```

Aside: updated test case after rebase:
```
python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi2 --github-branch foo  --github-repo python/pythonss --github-ref-type branch
{"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129612
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-06-27 17:51:09 +00:00
a4d7aa498b [Traceable FSDP2] Add auto-functionalize support for mutable list[Tensor] (copy from Brian's PR #127347); enable E2E inductor unit test for transformer model (#129502)
Copy of Brian's PR: https://github.com/pytorch/pytorch/pull/127347 with additional changes to support mutable `List[Tensor]` in Inductor. Also enable E2E inductor unit test for Traceable FSDP2 + transformer model.

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_set_`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_aot_eager`
- `pytest -rA test/dynamo/test_misc.py::MiscTests::test_auto_functionalize_tensorlist`
- `pytest -rA  test/inductor/test_torchinductor.py::GPUTests::test_fallback_mutable_op_list_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129502
Approved by: https://github.com/zou3519
2024-06-27 17:50:57 +00:00
9174d14551 Don't install remaining caffe2 python files (#129067)
It is assumed that they are no longer needed.
And keeping their installation as is breaks
"python setup.py develop --user" workflow
when non-root user is used.

This change is follow up for 3d617333e700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129067
Approved by: https://github.com/cyyever, https://github.com/r-barnes
2024-06-27 17:25:59 +00:00
e0bba37d66 [codemod] Add [[noreturn]] to 2 files inc caffe2/c10/util/TypeCast.cpp (#129575)
Summary: LLVM-15 has a warning `-Wno-return` which can be used to identify functions that do not return. Qualifying these functions with `[[noreturn]]` is a perf optimization.

Test Plan: Sandcastle

Reviewed By: dmm-fb

Differential Revision: D59003594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129575
Approved by: https://github.com/Skylion007
2024-06-27 17:23:22 +00:00
321bdcb372 Fix device propagation for checkpointing (#128671)
Fixes: #128478

In backward() implementation checkpointing code was quering device type from the rng_state tensors saved on forward(). These tensors are CPU only tensors and don't carry device information with them. As a result CUDA device was assumed as a default. Which is not correct if user runs on some other device. For example, on XPU.

This patch saves full device information on forward() and uses it on backward() to get device type. Previously forward save only device index.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128671
Approved by: https://github.com/guangyey, https://github.com/soulitzer
2024-06-27 17:14:13 +00:00
04206d1898 TunableOp hotfix, unit test follow-up (#129606)
PR #129281 was landed to fix critical issues but did not contain unit tests to exercise those issues.  This is a follow-up set of unit tests that would exercise the problems seen previously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129606
Approved by: https://github.com/atalman
2024-06-27 17:01:04 +00:00
5c6af2b583 [cpu] Fix div with rounding_mode="floor" when division overflows (#129536)
Fixes #77742

`Sleef_fmod` returns NaN when the division overflows, where `libm` returns 0. In this narrow case we can drop the `fmod` from the calulation entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129536
Approved by: https://github.com/lezcano
2024-06-27 16:50:47 +00:00
5ceba6a3cb Revert "[Inductor] FlexAttention supports block sparse mask (#129216)"
This reverts commit 4082759925a712b7cb340164d3da3a1dab372d9f.

Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/clee2000 due to broke functorch/aot_dispatch and test_proxy_tensor on windows https://github.com/pytorch/pytorch/actions/runs/9691331440/job/26743164471 4082759925 missed on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2195087274))
2024-06-27 15:57:52 +00:00
82c8fc3a2b [inductor] Add size_hint to conv dilation (#129631)
Summary: [Here](ea588d7fd3/torch/_inductor/kernel/conv.py (L252)) in the `conv` lowering `dilation` is not `size_hint`-ed. This breaks if `dilation` is a symbolic expression (which we see in some internal models). The PR fixes it by adding a `size_hints`.

Test Plan:
```
$ python test/inductor/test_torchinductor.py -k test_convolution5
...
----------------------------------------------------------------------
Ran 2 tests in 7.329s

OK
```

Differential Revision: D59097019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129631
Approved by: https://github.com/chenyang78
2024-06-27 15:27:57 +00:00
483dbfcf2a [BE] Correctly catch skip signals emitting from sys.exit (#129581)
Some tests in test_c10d_nccl.py overwrite `_join_process()` and `_check_return_codes()`, which cause the skip signals are not catched appropriately. This PR fixes the issue.

Differential Revision: [D59067457](https://our.internmc.facebook.com/intern/diff/D59067457/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129581
Approved by: https://github.com/fduwjj
2024-06-27 15:12:51 +00:00
2d9012ad25 Forward fix internal pyre failure from D58983461 (#129525)
Summary: Somehow, using underscore alias of some builtin types breaks pyre

Test Plan:
All failed tests from D58983461 are passing:

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/utils/tests:gpu_memory_utils_test-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:device_util-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:thompson_samplers_gpu-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:combined_sampling_diversifier_test-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:submodular_opt_test-type-checking
```

Differential Revision: D59029768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129525
Approved by: https://github.com/XuehaiPan, https://github.com/clee2000, https://github.com/malfet
2024-06-27 14:41:20 +00:00
0680e6cd1c [Profiler] Add sraikund16 to profiler paths in CODEOWNERS (#129591)
Summary: Add Shivam to the list of code owners for the profiler code paths, so that Shivam gets added to reviewers for PRs too.

Test Plan: CI

Differential Revision: D59072152

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129591
Approved by: https://github.com/sraikund16
2024-06-27 14:22:09 +00:00
ad607b91f4 [dynamo][onnx] Skip some dynamic=True test with inlining in built nn modules (#129610)
These tests fail with dynamic=True when inlining in built nn modules. There are a few more recompilations. Since `dynamic=True` is not a recommended usage, I am skipping these tests for now. This is the tracking issue to come back later and fix/update these tests - https://github.com/pytorch/pytorch/issues/129456
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129610
Approved by: https://github.com/yanboliang
ghstack dependencies: #129583
2024-06-27 10:56:24 +00:00
a028e5862d [profiler] Directly use end_ns to create the FunctionEvent instead of using start_ns + duration_ns in pytorch profiler post processing for checking parent-child precisely (#129554)
Use the raw end_ns directly, instead of the sum of start_ns and duration_ns, in order to avoid negative CPU time in profiler.

Fix https://github.com/pytorch/pytorch/issues/101861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129554
Approved by: https://github.com/gujinghui, https://github.com/aaronenyeshi
2024-06-27 10:46:05 +00:00
ff026f3d0a Fix an issue in meta_scaled_mm (#129521)
Summary:
To fix the following failure cases:

For example, when `M, K, N = 245760, 656, 6560`, fp8 with compile fails due to `RuntimeError: mat2 must be col_major`.

---------
From the inductor generated code (https://fburl.com/everpaste/epcagkrd)
```
V0625 01:38:55.551000 140329914449920 torch/_inductor/scheduler.py:1623] [0/0] scheduling ComputedBuffer(name='buf12', layout=FixedLayout('cuda', torch.float8_e4m3fn, size=[656, 6560], stride=[6656, 1]),
... ...
V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code]         buf12 = empty_strided_cuda((656, 6560), (6656, 1), torch.float8_e4m3fn)
... ...
V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code]     return (buf10, buf2, buf5, buf6, reinterpret_tensor(buf11, (245760, 656), (1, 245760), 0), reinterpret_tensor(buf12, (6560, 656), (1, 6656), 0), )
... ...
V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code]     assert_size_stride(permute_10, (6560, 656), (1, 6656))
... ...
V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code]         buf8 = aten._scaled_mm.default(buf6, permute_10, buf7, reciprocal_3, None, None, torch.bfloat16)
```

Inductor gives the mat2 (`permute_10`) a different stride (`6656`) instead of using its shape[0] (`(6560, 656)`).

Therefore, the `stride[1] == shape[0]` condition fails.

To fix the issue, simply modify the `is_col_major` check to exclude this condition as it doesn't hold for all valid cases.

Test Plan:
Run the failed case again. It works with the fix.
-----
Sandcastle / GitHub CI will make sure the existing tests could still pass.

Reviewed By: vkuzo

Differential Revision: D58994704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129521
Approved by: https://github.com/drisspg
2024-06-27 07:03:34 +00:00
9f29a2291c Feat: Updated torch.nn.Modules.set_submodules() (#127714)
modified:   torch/nn/modules/module.py

Implemented feature request by #127712.
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127714
Approved by: https://github.com/mikaylagawarecki
2024-06-27 06:38:54 +00:00
c9798d123b [dynamo][compile-time] Manually trace torch.nn.Module.parameters (#129583)
With this PR, we are not worse than no-inlining for Dynamo-only compilation time (there is a litte bit of noise, so outlier of 0.89 is probably ok here). For most of the models, we see positive numbers because of better caching in `UserDefinedObjectVariable`.

![image](https://github.com/pytorch/pytorch/assets/13822661/719d34fd-3e7f-4886-b7e0-1dbfc7141aa5)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129583
Approved by: https://github.com/jansel
2024-06-27 06:06:04 +00:00
cf392d8a89 [pytorch][cuda] Generate kernels for 5x5 filters on depth wise convolution backward (#129609)
In #125362 we improved the default implementation of depth wise convolution 2D forward pass by precomputing boundaries of accessed slices instead of doing expensive edge checks in the inner loops. We also generated kernels for 5x5 filters as this is a common problem size.

In this PR we tried to applied the same strategy for the backward kernel but we only saw good gains just by generating code for 5x5 filters. We could also write a fallback implementation that precomputes access boundaries when filter size and stride are not known at compile time may bring some speedup but that kernel would very rarely be called.

This PR also hints the thread count at compile time and leaves only the unroll directive that seems to help performance.

Before:

```
         B      C      iH      iW    kH    kW  conv2d-backward (cuda)  conv2d-fp16-backward (cuda)
0      8.0   64.0  1024.0  1008.0   5.0   5.0               89.002686                    26.400480
1      8.0   64.0  1008.0  1008.0   5.0   5.0               88.885025                    25.995296
2      4.0   48.0   720.0   539.0   6.0   1.0                9.488832                     9.091136
3      4.0  120.0   379.0   283.0   6.0   1.0                4.194640                     3.844432
4      4.0   32.0   713.0   532.0   6.0   1.0                8.027296                     7.700064
5      4.0    3.0   712.0   542.0  31.0  31.0               15.618095                    15.097760
6      4.0  120.0   379.0   288.0   1.0   6.0                3.788224                     3.499648
7   1024.0  384.0     1.0   928.0   1.0   3.0               18.988289                    14.152768
8      4.0   24.0   687.0   512.0   6.0   1.0                6.902704                     6.685056
9     96.0   96.0   112.0   112.0   5.0   5.0               15.672400                     4.953984
10    96.0   80.0    56.0    56.0   5.0   5.0                3.261152                     1.250320
11    64.0  128.0    64.0    84.0   3.0   3.0                3.172192                     1.515648
12    16.0  960.0     7.0     7.0   5.0   5.0                0.197024                     0.072736
13    16.0   64.0   112.0   112.0   3.0   3.0                1.126240                     0.650304
```

After
```
conv2d-performance:
         B      C      iH      iW    kH    kW  conv2d-backward (cuda)  conv2d-fp16-backward (cuda)
0      8.0   64.0  1024.0  1008.0   5.0   5.0               76.278656                    26.418720
1      8.0   64.0  1008.0  1008.0   5.0   5.0               73.211617                    26.018433
2      4.0   48.0   720.0   539.0   6.0   1.0                8.901312                     9.322912
3      4.0  120.0   379.0   283.0   6.0   1.0                3.815616                     3.992208
4      4.0   32.0   713.0   532.0   6.0   1.0                7.753024                     8.032433
5      4.0    3.0   712.0   542.0  31.0  31.0               15.244144                    15.277296
6      4.0  120.0   379.0   288.0   1.0   6.0                3.503264                     3.552976
7   1024.0  384.0     1.0   928.0   1.0   3.0               16.682976                    14.167969
8      4.0   24.0   687.0   512.0   6.0   1.0                6.802576                     7.019040
9     96.0   96.0   112.0   112.0   5.0   5.0               12.713024                     4.958656
10    96.0   80.0    56.0    56.0   5.0   5.0                2.648352                     1.254752
11    64.0  128.0    64.0    84.0   3.0   3.0                3.213568                     1.517952
12    16.0  960.0     7.0     7.0   5.0   5.0                0.182208                     0.076256
13    16.0   64.0   112.0   112.0   3.0   3.0                1.139952                     0.652432
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129609
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-06-27 06:01:47 +00:00
4082759925 [Inductor] FlexAttention supports block sparse mask (#129216)
Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286
Initial perf number:
* fwd speedup: 0.44 -> 0.72
* bwd speedup: 0.38 -> 0.71

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216
Approved by: https://github.com/Chillee
2024-06-27 05:44:27 +00:00
5ee893a84a Add inductor support for conv3d transpose (#129458)
This PR is to add Conv3d Transpose support in inductor. Basicly reuse and expand Conv2d Transpose and unit tests to Conv3d Transpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129458
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-27 05:27:10 +00:00
9b5b93c58f [CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423)
Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first.

This PR also gives a current read on cu121 and cu124 parity.

Essentially reverting #127150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423
Approved by: https://github.com/atalman, https://github.com/eqy
2024-06-27 05:22:18 +00:00
ea588d7fd3 [SymmetricMemory] use SCM_RIGHTS socket control message to share exported cumem handle (#129412)
`SymmetricMemory` currently uses the `pidfd_getfd` syscall to share the exported cumem fd among devices. The syscall is introduced in linux kernel 5.6 which is relatively new and not available everywhere.

This PR replaces the use of the `pidfd_getfd` syscall with socket + SCM_RIGHTS control message. The approach is demonstrated in [memMapIPCDrv](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/memMapIPCDrv) in [cuda-samples](https://github.com/NVIDIA/cuda-samples/tree/master/Samples) (relevant code: https://github.com/NVIDIA/cuda-samples/blob/master/Common/helper_multiprocess.cpp).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129412
Approved by: https://github.com/Chillee
2024-06-27 04:38:13 +00:00
84ad5452f6 [MPS] Fused SGD optimizer (#129350)
```
[-------------------------------------- Fused SGD --------------------------------------]
                                                          |  Fused: True  |  Fused: False
1 threads: ------------------------------------------------------------------------------
      numel: 1024, num_tensors: 100, momentum: True       |        2      |       15
      numel: 1024, num_tensors: 100, momentum: False      |        2      |        5
      numel: 65536, num_tensors: 100, momentum: True      |        3      |       16
      numel: 65536, num_tensors: 100, momentum: False     |        2      |        5
      numel: 1048576, num_tensors: 100, momentum: True    |       11      |       16
      numel: 1048576, num_tensors: 100, momentum: False   |        8      |        6
      numel: 1024, num_tensors: 500, momentum: True       |       29      |       70
      numel: 1024, num_tensors: 500, momentum: False      |       20      |       24
      numel: 65536, num_tensors: 500, momentum: True      |       33      |       76
      numel: 65536, num_tensors: 500, momentum: False     |       22      |       26
      numel: 1048576, num_tensors: 500, momentum: True    |       70      |       80
      numel: 1048576, num_tensors: 500, momentum: False   |       43      |       40
      numel: 1024, num_tensors: 1000, momentum: True      |      108      |      139
      numel: 1024, num_tensors: 1000, momentum: False     |       72      |       48
      numel: 65536, num_tensors: 1000, momentum: True     |      116      |      150
      numel: 65536, num_tensors: 1000, momentum: False    |       77      |       52
      numel: 1048576, num_tensors: 1000, momentum: True   |      190      |      170
      numel: 1048576, num_tensors: 1000, momentum: False  |      120      |       50
```

```python
def profile_fused_sgd():
    from torch.optim.sgd import sgd
    import torch.utils.benchmark as benchmark

    import itertools

    def profile(fn, params, grads, momentum_buffer_list, fused):
        fn(
            params,
            grads,
            momentum_buffer_list,
            momentum=True if len(momentum_buffer_list) > 0 else False,
            dampening=0.0,
            nesterov=False,
            foreach=False,
            fused=fused,
            lr=1e-3,
            weight_decay=.0,
            maximize=False,
            grad_scale=None,
            found_inf=None,
        )
        torch.mps.synchronize()

    device = "mps"

    results = []

    for num_tensors, numel, momentum in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False]):
        sublabel = f"numel: {numel}, num_tensors: {num_tensors}, momentum: {momentum}"
        print(sublabel)
        params, grads = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(2)]
        momentum_buffer_list = [torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] if momentum else []
        fn = sgd

        for fused in [True, False]:

            t = benchmark.Timer(
                    stmt='profile(fn, params, grads, momentum_buffer_list, fused)',
                    label='Fused SGD',
                    sub_label=sublabel,
                    globals=locals(),
                    description= f"Fused: {fused}",
                ).blocked_autorange(min_run_time=5)
            results.append(t)

    compare = benchmark.Compare(results)
    compare.trim_significant_figures()
    compare.colorize(rowwise=True)
    compare.print()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129350
Approved by: https://github.com/janeyx99
ghstack dependencies: #129006, #129008, #129007, #129105
2024-06-27 04:37:14 +00:00
e19042481b [cuDNN][cuDNN Frontend] Bump cuDNN FE submodule to 1.5.2 (#129592)
Some relevant fixes include stride-0 support 👀

CC @drisspg @Skylion007 @vedaanta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129592
Approved by: https://github.com/Skylion007
2024-06-27 04:01:23 +00:00
9450e198aa Conversions between strided and jagged layouts for Nested Tensors (#115749)
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
2024-06-27 03:41:28 +00:00
c9ceae3fac Use JK for mast rdzv handler tcpstore handling and additional logging (#129603)
Summary:
Use JK to control the release instead of using env variable to toggle the feature.

Note: sharing the store reduces shutdown races asn the TCPStore lifecycle is managed outside of trainer rank execution time.

Test Plan: CI

Differential Revision: D59071544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129603
Approved by: https://github.com/d4l3k
2024-06-27 03:34:52 +00:00
b9697eacd3 [torchbind] support tensor ops inside of __obj_flatten__ (#129605)
As titled. Previously, __obj_flatten__ can run in a fake tensor mode, e.g. in process_input of aot_autograd, which is surrounded by a fake tensor mode. This causes the tensor ops inside __obj_flatten__ to run under fake tensor mode. However, tensors inside of script obejct are real tensors, this causes the fake tensor mode to error out saying that we need to first fakify fall the tensors (because allow_non_fake_inputs is set to True).

In this PR, we disable all the dispatch modes when running to_fake_obj.

 Note that, the output of `__obj_flatten__` will be fakified and filled inside of the corresponding FakeScriptObject. So during traicng, we'll be using FakeScriptObject that has fake tensor contents.

Test Plan:
Add a new test: pytest test/export/test_torchbind.py -k test_compile_tensor_op_in_tensor_flatten

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129605
Approved by: https://github.com/angelayi
2024-06-27 03:07:31 +00:00
cdbd6542d0 Fix inductor benchmarks (#129620)
By installing torchao explicitly, as torchao-0.3.0 that was release recently to pypi introduced hard dependency to torch-2.3.1, which results in following cryptic error: `RuntimeError: operator torchvision::nms does not exist`

TODOs:
 - Figure out what installs torchao from pypi rather than builds from source
 - Add proper CI pin for torchao
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129620
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-06-27 02:59:08 +00:00
27a14405d3 enable device index check for all device types (#126767)
enable device index check for all device types for grad setter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126767
Approved by: https://github.com/albanD
2024-06-27 01:09:53 +00:00
0b7e8df7d8 [CUDAGraph Trees] Enable input mutation support in OSS (#129184)
Summary: Enable input mutation support for cudagraph trees in OSS.

Test Plan: CI

Differential Revision: D58847850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129184
Approved by: https://github.com/eellison
2024-06-27 00:49:45 +00:00
7bb558fd6e add _flash_attention_forward and _efficient_attention_forward to compute intensive ops in partitioner (#129533)
Avoid recompute of SDPA during the backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129533
Approved by: https://github.com/drisspg
2024-06-27 00:49:00 +00:00
b6689e0fb8 [ts migration] add logging as part of torch logging system (#129405)
#### Description
Add more verbose logging of conversion process. Output which IR is being converted, which function is used to do conversion, and whether it succeeds.

#### Example
`TORCH_LOGS="+export,ts2ep_conversion" pytest test/export/test_converter.py -s -k test_prim_tolist`
```
test/export/test_converter.py I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] TorchScript graph
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] graph(%x.1 : Long(3, strides=[1], requires_grad=0, device=cpu)):
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]   %1 : __torch__.export.test_converter.___torch_mangle_1.Module = prim::CreateObject()
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]   %2 : int = prim::Constant[value=1](), scope: export.test_converter.Module::
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]   %3 : int = prim::Constant[value=0](), scope: export.test_converter.Module::
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]   %4 : int[] = prim::tolist(%x.1, %2, %3), scope: export.test_converter.Module::
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]   return (%4)
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%1 : __torch__.export.test_converter.___torch_mangle_1.Module = prim::CreateObject()]
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_CreateObject] succeeds
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%2 : int = prim::Constant[value=1](), scope: export.test_converter.Module::]
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_Constant] succeeds
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%3 : int = prim::Constant[value=0](), scope: export.test_converter.Module::]
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_Constant] succeeds
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%4 : int[] = prim::tolist(%x.1, %2, %3), scope: export.test_converter.Module::]
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_tolist] succeeds
I0624 13:19:26.427000 140608224474112 torch/_export/converter.py:760] TS2EPConverter IR-to-IR conversion succeeds
```

#### Test Plan
`pytest test/export/test_converter`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129405
Approved by: https://github.com/angelayi
2024-06-27 00:20:20 +00:00
90f6043368 Don't decompose functional composite ops in export inference IR (#128077)
Recently we decided to split export IR into two different IRs (training vs inference). In the inference IR, one major change we decided to introduce was we wanted to keep the composite ops that user specified in the IR. This PR does that by overriding the CompositeImplicitAutograd decomp in export inference path.

Differential Revision: [D58701607](https://our.internmc.facebook.com/intern/diff/D58701607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128077
Approved by: https://github.com/bdhirsh
2024-06-26 23:07:55 +00:00
64f1111d38 Expose nholmann json to torch (#129570)
Summary:

Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`.
The next PR makes actual use of this header.

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D59035246

Pulled By: c-p-i-o

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570
Approved by: https://github.com/d4l3k, https://github.com/malfet
2024-06-26 21:59:26 +00:00
5ad2ad5921 Update start_, end_ and retired only for the right entry when retire a work (#128948)
Fixes #128805
If the buffer size of NCCLTraceBuffer is 10 and the pg has recorded 11 works, the entry of the work 0 will have been overwritten by the work 10,  so when watchdog retire the work 0, the  start_ and end_  of the entry 0 shouldn't be set to nullptr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128948
Approved by: https://github.com/wconstab, https://github.com/c-p-i-o
2024-06-26 21:58:00 +00:00
b8e5678ad2 Delete lazy ddp optimizer (#120727)
This is no longer necessary now that the normal ddp optimizer works correctly with inductor strides.

Differential Revision: [D54858819](https://our.internmc.facebook.com/intern/diff/D54858819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120727
Approved by: https://github.com/jansel, https://github.com/yf225
2024-06-26 21:53:54 +00:00
13316a8d46 [Profiler] Add Rank to NCCL Debug Info (#129528)
Summary: We need to add the Rank information to the NCCL debug data so that kineto can infer all the necessary process group info such that on-demand can create distributedInfo metadata. Kineto portion will be added in a follow up diff

Test Plan: Tested in D58736045, this diff just splits the kineto and profiler instances

Differential Revision: D59028819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129528
Approved by: https://github.com/aaronenyeshi
2024-06-26 21:24:05 +00:00
7b1988f922 [ez] Give trymerge id token write permissions after #129503 (#129594)
Forgot to do this in #129503

Also fix minor typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129594
Approved by: https://github.com/huydhn
2024-06-26 20:33:14 +00:00
795db80975 Upload release tag source code to s3 (#128842)
Upload tarball containing source code to s3 for release tags

Can be found here https://us-east-1.console.aws.amazon.com/s3/buckets/pytorch?region=us-east-1&bucketType=general&prefix=source_code/test/&showversions=false

D58695048 for adding permissions to allow uploading to the s3 folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128842
Approved by: https://github.com/atalman, https://github.com/malfet
2024-06-26 20:32:40 +00:00
28480dd7dc [CI] Fix runner determinator for ciflow (#129500)
In case of ciflow, runs are triggered by a tag which is created by @pytorchbot, which breaks the logic of the runner determinator.

In case of tag triggers, extract the pr number from the tag name, fetch the pr and extract the user login from it.

Both the inline and standalone python scripts have been updated for consistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129500
Approved by: https://github.com/malfet, https://github.com/zxiiro
2024-06-26 20:27:06 +00:00
d3d6764082 [pytorch][logging] add fb internal ODS implementation of wait counter (#128605)
* created fb internal implementation in `caffe2/torch/csrc/monitor/fb/instrumentation.cpp`
    * uses `facebook::data_preproc::WaitCounterUs` under the hood by having `WaitCounterImpl` trivially subclass it.
    * this makes `WaitCounterHandle` a glorified pointer to `facebook::data_preproc::WaitCounterUs` which is statically defined in the `STATIC_WAIT_COUNTER` macro making these pointers Meyer's singletons.
        * `facebook::data_preproc::WaitCounterUs` uses 3 singletons:
             1. `std::unique_ptr<DynamicCounter::State>` map — leaky singleton
             2. `std::weak_ptr<WaitCounterUs::State>` map — leaky singleton
             3. publisherSingleton — normal singleton since it manages resources (threads)
        * `facebook::data_preproc::WaitCounterUs` actually owns shared pointers to the state and its destructor will remove it from the `std::weak_ptr<WaitCounterUs::State>` map when the reference count for the state hits 0.
* linked `caffe2/torch/csrc/monitor/fb/instrumentation.cpp` and added `//data_preproc/common:counters` (dpp dependency) to `caffe2/fb/fbcode/target_definitions.bzl`
* wrapped OSS null implementation in `#ifndef FBCODE_CAFFE2` so that internally we use the fb internal implementation.

as a follow-up I might move the counter implementation out of the data_preproc/counters library to a more common ai infra library?

Differential Revision: [D58458751](https://our.internmc.facebook.com/intern/diff/D58458751/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128605
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #128466
2024-06-26 19:11:21 +00:00
90f82426b9 RS migration - trymerge to upload merge records to s3 (#129503)
Uploads merge records to to ossci-raw-job-status (public) bucket instead of directly to rockset

The runner used by trymerge is a GH runner, so it doesn't have access to s3.  Instead, I save the record as a json and upload the json to s3 in a different step that runs after the aws credentials are configured.

The role is defined [here](https://togithub.com/pytorch-labs/pytorch-gha-infra/pull/421)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129503
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet
2024-06-26 19:06:52 +00:00
895316119d Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 0314c4c101c44d5d89b4fad9d37a012dc6f31128.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))
2024-06-26 19:03:57 +00:00
e9aefad641 Revert "[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423)"
This reverts commit 551e4127185195ae8a5331dc8bbfdffd5d4dd1b8.

Reverted https://github.com/pytorch/pytorch/pull/128423 on behalf of https://github.com/nWEIdia due to Sorry for reverting your change but I need to revert it to cleanly revert https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/128423#issuecomment-2192423840))
2024-06-26 18:54:41 +00:00
cca85c96cd [export] minor typo fix (#129543)
Fixes a typo in torch.export doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129543
Approved by: https://github.com/angelayi
2024-06-26 18:35:31 +00:00
87d14ad419 [inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)
Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR:
* Fix the with_fresh_cache_if_config() decorator
* Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257
Approved by: https://github.com/oulgen
2024-06-26 18:34:48 +00:00
61bf1452a3 Add one more shard for CPU jobs (#129299)
The first shard is very close to 3.5h and timeout sometimes now 1c75ddff35 (26540310592)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129299
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2024-06-26 18:32:10 +00:00
b9a1c2c991 [ROCm] Enable F8 Inductor Unit tests (#128353)
First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353
Approved by: https://github.com/jansel, https://github.com/eellison
2024-06-26 18:30:43 +00:00
8e4f7f742f [DCP] Capture reader, writer and planner components in the DCP API logger (#129548)
Summary: Capture reader, writer and planner components in the DCP API logger

Test Plan:
logs can be found in scuba pytorch_dcp_logging

https://fburl.com/scuba/pytorch_dcp_logging/ruqez1ki

Differential Revision: D59040866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129548
Approved by: https://github.com/wz337, https://github.com/fegin
2024-06-26 18:11:16 +00:00
7373492c9b Use _unsafe_masked_index in masked_scatter decomposition (#123667)
and remove masked_scatter_with_index inductor prims

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123667
Approved by: https://github.com/peterbell10
2024-06-26 17:18:24 +00:00
1b1fd0f4fe [ROCm] Use additional shard for inductor workflow to resolve timeouts (#129480)
This will help timeouts on inductor workflow. The cuda equivalent job also moved to 2 shards since e0aa992d73

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129480
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet
2024-06-26 17:18:20 +00:00
bc68907caa [EZ][BE] Replace assertTrue with more appropriate checks (#129569)
Based on this https://github.com/pytorch/pytorch/pull/129340#issuecomment-2191228046 I.e.
- `assertTrue(x == y)` -> `assertEqual(x, y)
- `assertTrue(not x)` -> assertFalse(x)`
- `assertTrue(x > y)` -> assertGreater(x, y)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129569
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007
2024-06-26 16:29:59 +00:00
9cf8e5dd32 chore(quantization): Enable PT2E symmetric dynamic quantization (#124615)
in `_find_choose_qparams_node` function compare
the current node if it is affine or symmetric
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124615
Approved by: https://github.com/kimishpatel, https://github.com/malfet
2024-06-26 16:14:58 +00:00
f7708ffebb Revert "[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378)"
This reverts commit 52009068bc39ebc846bd37b44f5f9c5f62257778.

Reverted https://github.com/pytorch/pytorch/pull/129378 on behalf of https://github.com/clee2000 due to broke inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_sympy_expr_arg_abi_compatible_cuda and a few other tests https://github.com/pytorch/pytorch/actions/runs/9680978494/job/26713689249 52009068bc. The tests were added in https://github.com/pytorch/pytorch/pull/129301 which is before your base ([comment](https://github.com/pytorch/pytorch/pull/129378#issuecomment-2192032697))
2024-06-26 15:46:17 +00:00
474d743dba [torchao][benchmark] Skip all accuracy tests by returning pass_due_to_skip (#129545)
Summary: As the title says.

Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --quantization noquant --inference --bfloat16 --accuracy
```

Differential Revision: D59040593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129545
Approved by: https://github.com/HDCharles
2024-06-26 14:21:53 +00:00
25cec43678 Remove dependency on private _compat_pickle in CPython (#129509)
Use the IMPORT_MAPPING and NAME_MAPPING from here https://github.com/python/cpython/blob/main/Lib/_compat_pickle.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129509
Approved by: https://github.com/malfet
ghstack dependencies: #129239, #129396
2024-06-26 14:20:27 +00:00
3b531eace7 Add example for torch.serialization.add_safe_globals (#129396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129396
Approved by: https://github.com/albanD, https://github.com/malfet
ghstack dependencies: #129239
2024-06-26 14:20:27 +00:00
303ad8d7f5 Add warning for weights_only (#129239)
Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239
Approved by: https://github.com/albanD, https://github.com/malfet
2024-06-26 14:20:19 +00:00
52009068bc [AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378)
Summary: Unify the UserDefinedTritonKernel argument codegen logic between python wrapper and cpp wrapper. This prepares for later PRs that will simplify AOTI codegen.

Differential Revision: [D59002226](https://our.internmc.facebook.com/intern/diff/D59002226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129378
Approved by: https://github.com/oulgen, https://github.com/chenyang78
ghstack dependencies: #129267
2024-06-26 13:53:27 +00:00
42d490d41d [AOTI][refactor] Move generate_user_defined_triton_kernel (#129267)
Summary: Move generate_user_defined_triton_kernel from cpp_wrapper_cpu to cpp_wrapper_cuda as it's for CUDA only

Differential Revision: [D58953005](https://our.internmc.facebook.com/intern/diff/D58953005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129267
Approved by: https://github.com/chenyang78
2024-06-26 13:50:39 +00:00
53fafdd0c3 [BE] Runner determinator: more resilient user matching (#129462)
Small improvements on runner determinator script:

* Don't do splitting of the issue comment, unless necessary;
* Match username against a set over a list;
* Match both triggering_actor and issue owner over only actor (to avoid edge cases, where we get `pytorch-bot[bot]`)
* Add stripping, to remove potential breaking and not visible whitespaces;
* Don't use linux.4xlarge as a runner: it should not depend on meta runners, for reliability;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129462
Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi
2024-06-26 13:47:52 +00:00
211f38e742 Revert "[ALI] [Reland] Use LF runners for Lint (#129071)"
This reverts commit 1b92bdd0ea326cd30bc3945602701ffe28c85fd5.

Reverted https://github.com/pytorch/pytorch/pull/129071 on behalf of https://github.com/malfet due to All LF jobs are backlogged, so revert this one ([comment](https://github.com/pytorch/pytorch/pull/129071#issuecomment-2191676677))
2024-06-26 13:19:00 +00:00
92be3403ea Fix an issue in oneShotAllReduce where different ranks perform reduction in different order (#129501)
In `oneShotAllReduce`, ranks read data from peers in a round-robin fashion to load-balance NVLinks. However, the following reduction is also performed in the this order which is different across ranks. This can results in slight numerical differences across ranks, which can lead to a hang in data dependent applications like speculative decoding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129501
Approved by: https://github.com/Chillee
2024-06-26 08:43:10 +00:00
f2840bb220 [nn-module] Use standard dict for _parameters, _modules and _buffers (#129164)
TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow.

With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model.

Functionality impact
- The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We  use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR).

Perf impact
- I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)).

Typing impact
- I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #129163
2024-06-26 07:59:42 +00:00
ead97ee486 [Compile+SAC] Only warn for in-place ops once (#129397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129397
Approved by: https://github.com/tianyu-l
2024-06-26 07:25:02 +00:00
c422a9549d [easy][DCP] Fix test_fsdp_ep.py for _MeshEnv.create_child_mesh API ch… (#129445)
…ange

Update test/distributed/checkpoint/e2e/test_fsdp_ep.py for #127465 change.
Failure info:
```bash
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] Caught exception:
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] Traceback (most recent call last):
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 657, in run_test
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     getattr(self, test_name)()
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 539, in wrapper
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     fn()
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_utils.py", line 2744, in wrapper
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     method(*args, **kwargs)
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 369, in wrapper
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     func(self, *args, **kwargs)  # type: ignore[misc]
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 180, in wrapper
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     return func(*args, **kwargs)
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/distributed/checkpoint_utils.py", line 44, in wrapper
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     func(self, *args, **kwargs)
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/test/distributed/checkpoint/e2e/test_fsdp_ep.py", line 76, in test_e2e
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     mesh_fsdp_ep = _mesh_resources.create_child_mesh(mesh_fsdp_tp, 0, "dp")
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] TypeError: _MeshEnv.create_child_mesh() takes 3 positional arguments but 4 were given
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] To execute this test, run the following from the base repo dir:
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]      python test/distributed/checkpoint/e2e/test_fsdp_ep.py -k TestFSDPWithEP.test_e2e
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129445
Approved by: https://github.com/fegin, https://github.com/wz337
2024-06-26 06:43:30 +00:00
8b8e2fcdda [DCP] Fix Optimizer Learning Rate not being loaded correctly (#129398)
Fixes #129079

Currently, the tensor object is loading correctly in-place, but the non-tensor object such as learning rate is not load correctly after f518cf811d, which is a regression introduced in 2.3.

This PR replaces tree_map_only and manual replacement of the state dict items with _tree_map_only and fixes the regression of non-tensor loading.

Test:
```
# test to make sure lr is loading correctly
python3 test/distributed/checkpoint/e2e/test_e2e_save_and_load.py -k test_init_state_dict
# test to make sure load on meta device model still works
python3 test/distributed/checkpoint/test_tp_checkpoint.py -k test_tp_checkpoint_load_on_meta_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129398
Approved by: https://github.com/fegin
2024-06-26 06:41:47 +00:00
000f2d637b Refactoring the code to make it lint clean (#129424)
Summary: Refactoring the code to make it lint clean

Test Plan: buck2 build mode/dev-tsan caffe2/test:test_profiler_cuda

Differential Revision: D58971175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129424
Approved by: https://github.com/aaronenyeshi
2024-06-26 06:12:01 +00:00
610894e978 [MPS][BE] Generalize Fused optimizers (#129105)
This PR generalizes the multi_tensor_apply function for other fused optimizers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129105
Approved by: https://github.com/malfet
ghstack dependencies: #129006, #129008, #129007
2024-06-26 06:00:41 +00:00
d02bba519c [export] match fake mode for _decompose_exported_program() (#129421)
Summary:
_decompose_exported_program() ran into an issue with trace_joint, where trace_joint() produces values with mismatching FakeModes. Adding fake mode context to aot_export_module() so this doesn't happen.

#thanks to tugsbayasgalan for the fix!

Test Plan: test_experimental

Differential Revision: D58977694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129421
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17
2024-06-26 05:52:31 +00:00
7420bad74c [BE] Do not assert if the barrier is not created (#129497)
the foler will be created as long as TEMP_DIR is set and the program
has the write permission. This will ensure some test environment can run the
spawn tests.

Differential Revision: [D59020736](https://our.internmc.facebook.com/intern/diff/D59020736/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129497
Approved by: https://github.com/fduwjj, https://github.com/wz337
2024-06-26 05:51:36 +00:00
c04cec609d [dtensor][debug] fixing CommDebugMode module collective tracing (#128887)
**Summary**
The logic for CommDebugMode module collective tracing is incorrect as it only worked for leaf module nodes on the model's module tree. If we had a sub-module that had a collective call along with a nested module inside it, the sub-module was not removed from the module_tracker parent set leading to double-counting collectives. This problem was addressed by checking to make sure the current sub-module was not already in the parent set. The output of the below test cases should remain the same.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128887
Approved by: https://github.com/XilunWu
ghstack dependencies: #128729
2024-06-26 05:25:57 +00:00
bd3a11776f [dtensor][test] test case suite for comm_mode features (#128729)
**Summary**
Currently, there is only an example file for comm_mode and its features. I have created test cases that mirror the examples while the more complicated test cases also ensure that comm_mode resets all variables when used multiple times in the same function. This test case suite will also help developers ensure that new code they add to comm_mode does not affect correctness of old features.
#128536

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128729
Approved by: https://github.com/XilunWu
2024-06-26 05:25:57 +00:00
6181e65cd8 Nested tensor subclass support (#127431)
When we have nested tensor subclasses, we need to recursively flatten/unflatten in Fake tensor creation and AOTAUtograd. Most of the PR is about mechanical change which changes today's single level flatten logic to be recursive.

Differential Revision: [D58533224](https://our.internmc.facebook.com/intern/diff/D58533224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127431
Approved by: https://github.com/bdhirsh
2024-06-26 04:45:22 +00:00
cda4d4887d Skip signals from older runs of the same workflows (#129291)
I discovered this bug in trymerge when debugging https://github.com/pytorch/pytorch/pull/129013 in which Dr.CI reported no relevant failures while mergebot complained about some unrelated ROCm failures https://github.com/pytorch/pytorch/pull/129013#issuecomment-2183009217.

It turns out that mergebot took into account stale signals from older runs of the same workflow here.  For example,
* https://github.com/pytorch/pytorch/actions/runs/9604985361 was the first run where it had a ROCm failure
* While https://github.com/pytorch/pytorch/actions/runs/9608926565 was the second attempt and it was all green

Notice that both runs came from the same push to commit [be69191](be69191f2d) with [ciflow/rocm/129013](https://github.com/pytorch/pytorch/tree/ciflow/rocm/129013).  So, we just need to check the signals from the newer run.

Note that Dr.CI handles this part correctly using the logic in https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/drci/drci.ts#L1079-L1088.  So, the fix in this PR is to bring the same logic to trymerge.

### Testing

`pytest -v test_trymerge.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129291
Approved by: https://github.com/ZainRizvi
2024-06-26 03:49:09 +00:00
c718e2f43b [pytorch][logging] add empty wait counter implementation (#128466)
Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128466
Approved by: https://github.com/c-p-i-o
2024-06-26 03:47:17 +00:00
54f27b886e [Inductor UT] Reuse test_distributed_patterns.py for Intel GPU (#129437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129437
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-06-26 02:58:45 +00:00
555f71a15b Fix test_auto_simd in machine with AMX support (#129444)
Fixes #129438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129444
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-06-26 02:50:55 +00:00
a89a1ed072 [easy][DCP] make BroadcastingTorchSaveReader device generic (#129231)
Test test/distributed/checkpoint/test_format_utils.py on GPU and othor device pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129231
Approved by: https://github.com/fegin
2024-06-26 02:37:30 +00:00
90d5a6f001 [inductor] Add lowering and codegen for aten.sort (#128458)
Closes #125633

Benchmarks:
| Shape       | dim | stable | compiled | eager   | speedup |
|-------------|-----|--------|----------|---------|---------|
| (256, 4096) | 0   | False  | 0.73 ms  | 1.26 ms | 1.7     |
| (256, 4096) | 0   | True   | 0.75 ms  | 1.27 ms | 1.7     |
| (4096, 256) | 1   | False  | 0.20 ms  | 0.73 ms | 3.7     |
| (4096, 256) | 1   | True   | 0.21 ms  | 0.73 ms | 3.5     |
| (255, 4096) | 0   | False  | 1.05 ms  | 1.48 ms | 1.4     |
| (255, 4096) | 0   | True   | 1.03 ms  | 1.47 ms | 1.4     |
| (4096, 255) | 1   | False  | 0.52 ms  | 0.98 ms | 1.9     |
| (4096, 255) | 1   | True   | 0.54 ms  | 1.00 ms | 1.9     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128458
Approved by: https://github.com/lezcano, https://github.com/eellison
2024-06-26 01:36:39 +00:00
b7e7a4cb01 [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-26 00:49:18 +00:00
9554a9af87 [GPT-benchmark] Distinguish LLM models and mirco-benchmarks (#129498)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129498
Approved by: https://github.com/huydhn
2024-06-26 00:25:05 +00:00
0d0d42c4a7 test_qat_mobilenet_v2 succeeding on dynamo (#129532)
https://github.com/pytorch/pytorch/actions/runs/9669572961/job/26677024995

Test is usually marked as slow so it doesn't get run on dynamo since dynamo doesn't have a slow equivalent

However, it is succeeding, so we might as well as do what the logs tell us to do and remove the failure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129532
Approved by: https://github.com/malfet, https://github.com/kit1980
2024-06-25 23:55:12 +00:00
112ef79f29 [inductor] Remove comm-specific node attributes from scheduler (#129084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129084
Approved by: https://github.com/lezcano
2024-06-25 23:52:19 +00:00
d1f9e822dd [DTensor][Test] Update implicit replication unit tests for tensor arg being the first in args list (#127803)
Change the operands order so we can have test coverage for when the first arg is a tensor arg instead of DTensor arg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127803
Approved by: https://github.com/XilunWu
2024-06-25 23:51:58 +00:00
575bc1e3af [Reopen #114036] Allow "must recompute" in torch.compile + selective checkpointing (SAC) (#129295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129295
Approved by: https://github.com/Chillee
2024-06-25 23:47:08 +00:00
f389541ce0 Add Strided Input test for flex attention (#128915)
Test strided inputs to the flex_attention HOP. Similar to how inputs are generated in
https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128915
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-06-25 23:26:34 +00:00
87ebd627a7 RS migration - upload sccache stats to s3 instead of rockset (#129490)
Upload sccache stats to s3 instead of rockset

I don't think we use these anywhere, so it's ok to cut off the ingest into rockset right now.

We should consider deleting this entirely if we don't plan on using it

I will work on copying existing data over from rockset to s3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129490
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-06-25 23:23:16 +00:00
52341c28e8 Revert "[FSDP2] Ran post-acc-grad hooks manually (#129450)"
This reverts commit 7ebffef4d02a3cc68dbbcf44b92d63c7fe0ebb67.

Reverted https://github.com/pytorch/pytorch/pull/129450 on behalf of https://github.com/clee2000 due to broke distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager 7ebffef4d0 https://github.com/pytorch/pytorch/actions/runs/9667812641/job/26671489454.  Test got added in https://github.com/pytorch/pytorch/pull/129157 which is before your mergebase ([comment](https://github.com/pytorch/pytorch/pull/129450#issuecomment-2190174363))
2024-06-25 23:13:57 +00:00
bbd47f7b2f Remove ProcessGroupCudaP2P and change async-TP to use SymmetricMemory (#128762)
This PR removes `ProcessGroupCudaP2P` and changes async-TP to use `SymmetricMemory`. The async-TP implementation is still workspace-based, but it now doesn't require a buffer size to be specified upfront.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128762
Approved by: https://github.com/wanchaol
2024-06-25 22:32:21 +00:00
1c5df9107d [BE] Fix several incorrect skip tests (#129488)
These tests may not be skipped properly if NCCL library exists but CUDA is not avaiable.

Differential Revision: [D59013855](https://our.internmc.facebook.com/intern/diff/D59013855/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129488
Approved by: https://github.com/wz337, https://github.com/fduwjj
2024-06-25 22:10:31 +00:00
fd414d6189 [inductor] don't materialize the large sparse matrix in CE bwd (#129043)
Inductor currently materialize a large sparse matrix in the backward pass for CrossEntropyLoss and load that to compute gradients of Softmax input. If we could fuse the sparse matrix computation to the consumer sides, we gonna have both perf and memory usage wins.

The Fx graph snippets that construct this aforementioned sparse matrix looks like:
```
       full_default_3: "bf16[32768, 50257]" = torch.ops.aten.full.default([32768, 50257], 0, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=0), pin_memory = False)
       scatter: "bf16[32768, 50257]" = torch.ops.aten.scatter.value(full_default_3, 1, where_2, -1.0);  full_default_3 = where_2 = None
```
Leveraging the following observations:
- the scatter is applied upon a all zero (or more generally a const tensor)
- the index tensor for the scatter has a single element on the scatter dimension. In this case it's the label tensor

allow us to lower this 'scatter_upon_const_tensor' pattern to a pointwise kernel that can be easily fused with downstream kernels:

```
    def inner_fn(idx):
        selector_idx = list(idx)
        selector_idx[dim] = 0  # can do this since the index tensor has a single element on the scatter dimension

        selector = selector_loader(selector_idx)
        return ops.where(
            selector == ops.index_expr(idx[dim], torch.int64),
            ops.constant(val, dtype),
            ops.constant(background_val, dtype),
        )
```

## Test result on microbenchmark

For the microbenchmark added as `test_cross_entropy_loss`, we improve latency from 47.340ms to 42.768ms, memory footprint from 10.524GB to 7.227GB on A100. (on H100, we improve latency from 27.54ms to 23.51ms, memory footprint from 10.574GB to 7.354GB).

The saving matches the back-of-envelope calculation. We avoid storing a BF16 tensor with shape [30K, 50K] which is about 3GB in size. On A100, avoid loading and storing such a tensor can roughly save 3GB x 2 / 1.5TBGS = 4ms

## Test result on llm.c

We also test this on llm.c and the saving is much larger especially for memory footprint. The reason is due to autotuning that allocates extra memory for benchmarking. (Check https://github.com/pytorch/pytorch/issues/129258 and https://github.com/pytorch/pytorch/pull/129399 for more details).

For llm.c PyTorch implementation on A100, we improve from
171K tokens/s , 33.6G peak memory usage to
180K tokens/s, 18.6G peak memory usage. (A **45%** saving of peak memory)

## Test on PyTorch 2.0 Dashboard

The optimization is quite general especially for transformers. We tested this on PyTorch2.0 dashboard. Here is the [result](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2017%20Jun%202024%2018%3A07%3A51%20GMT&stopTime=Mon%2C%2024%20Jun%202024%2018%3A07%3A51%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/158/head&lCommit=c62c55e29c65497d495217b6574bb36b0c4da7d4&rBranch=main&rCommit=0d25f096c1beaf8749932a3d6083ad653405ed71).

TLDR, for Huggingface benchmark suite, we get **6%** geomean perf improvement and **10%** geomean memory footprint improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129043
Approved by: https://github.com/jansel, https://github.com/Chillee
2024-06-25 21:25:50 +00:00
e1499f6342 [C10D] Make new_group eager when used with comm_split (#129284)
If users pass `device_id` to init_process_group, they enable eager init
for the default group.  Then if they subsequently call `new_group`, the
device_id argument is not required as it should be assumed to match the
one used for init_process_group.

However, both `init_process_group` and `new_group` apis share a helper
function, which expects a `device_id` value that defaults to None.  When
it's None, eager initialization is disabled.

This PR ensures that if a device_id was passed to init_process_group,
the same device_id will automatically be fed into the helper function
for any new_group calls that follow.

**Test plan**
I found an existing test in CI  `test_comm_split_subgroup` that failed after my change, because it was asserting that backend comm_split counter did not increment eagerly, and its behavior had changed to increment eagerly.  I updated the test in the PR to pass with my change.

I also tested locally via simple program with TORCH_CPP_LOG_LEVEL=INFO and
observed eager initialization of the 'lows' and 'highs' PGs before the
'Here' print.

```
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl", device_id =torch.device(f"cuda:{torch.distributed.get_node_local_rank(0)}"))
dist.new_group([0, 1], group_desc="lows")
dist.new_group([2, 3], group_desc="highs")
print("Here")
torch.distributed.destroy_process_group()
```

Output:
https://gist.github.com/wconstab/88a5ba0b970244ca1f79133f989e0349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129284
Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj, https://github.com/d4l3k, https://github.com/nvcastet
2024-06-25 21:09:34 +00:00
e58ef5b65f [export] Rewrite exportdb formatting. (#129260)
Summary: It'll be easier to generate examples if the code doesn't depend on exportdb library.

Test Plan: CI

Differential Revision: D58886554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129260
Approved by: https://github.com/tugsbayasgalan
2024-06-25 21:04:53 +00:00
551e412718 [CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423)
Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first.

This PR also gives a current read on cu121 and cu124 parity.

Essentially reverting #127150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423
Approved by: https://github.com/atalman, https://github.com/eqy
2024-06-25 20:59:49 +00:00
79959d707c [Inductor][ROCm] Composable Kernel backend for Inductor (#125453)
This PR adds an alternative backend for Inductor, adding Composable Kernel Universal GEMM instances to the autotune instance selection.

The implementation is heavily influenced by the series of PRs which adds CUTLASS backend (https://github.com/pytorch/pytorch/issues/106991). The main differences are
 (1) customizing compiler for the ROCm platform
 (2) customizing template code generation for Composable Kernel Universal GEMM instances.

We provide config tuning knobs for balancing between instance sources compilation time and finding the best instance.

### Testing
Install the ck library
```
pip install git+https://github.com/rocm/composable_kernel@develop
```
Run the test
```
TORCH_LOGS=+torch._inductor \
pytest --capture=tee-sys test/inductor/test_ck_backend.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125453
Approved by: https://github.com/eellison, https://github.com/jansel
2024-06-25 20:54:14 +00:00
ae0f84d89c [CI] Enable amp accuracy check for inductor cpu (#127758)
This is to enable inductor AMP accuracy check for on CPU in CI workflow to capture issue early. Three suites are included: timms, huggingface as well as torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127758
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-25 20:34:18 +00:00
45f2876934 [Fix] NumToTensor resulted from numel() and size() in TSCovnerter (#128761)
#### Issue
In jit.trace, torch.numel() is automatically cast to a `LongTensor`. But during conversion, we lost the casting part. `prim::NumToTensor` was previously converted to `torch.ops.aten.scalar_tensor`, which uses the same `dtype` as the input tensor instead of `LongTensor`. in this PR, we add a casting to convert it to the correct `dtype`.

#### Test Plan
We activate previously failing test case.
* `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128761
Approved by: https://github.com/angelayi
2024-06-25 20:20:03 +00:00
e68ee2cadb TunableOp hotfix (#129281)
Fixes.
- PYTORCH_TUNABLEOP_NUMERICAL_CHECK=1 had a memory leak.
- The strided batched gemm size calculation for buffer rotation was incorrect resulting in a mem fault.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129281
Approved by: https://github.com/xw285cornell, https://github.com/eqy, https://github.com/mxz297
2024-06-25 20:12:46 +00:00
1865fe282f Log whenever we sleep (#129197)
Summary:
Log whenever we sleep for heartbeatTimeout.
Useful for debugging stuck jobs.
This will eventually turn into a metric.

Test Plan:
none.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129197
Approved by: https://github.com/Skylion007, https://github.com/d4l3k, https://github.com/wconstab
2024-06-25 20:09:41 +00:00
b1f486aff9 Revert "Add warning for weights_only (#129239)"
This reverts commit 381ce0821c3fa2b342f0b8660c76cc27f48543c4.

Reverted https://github.com/pytorch/pytorch/pull/129239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm 381ce0821c, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))
2024-06-25 19:30:07 +00:00
7cf454ec52 Revert "Add example for torch.serialization.add_safe_globals (#129396)"
This reverts commit f18becaaf1c7a7bf851e3ae8d215eee8dba688b6.

Reverted https://github.com/pytorch/pytorch/pull/129396 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm 381ce0821c, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))
2024-06-25 19:30:07 +00:00
0298560ca2 TCPStore: improve connect and retry logic (#129261)
We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times.

This PR does a few things:
* Retry that connect and validate up to the specified timeout.
* Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep.
* Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu https://github.com/pytorch/pytorch/pull/116141

Test plan:

```
python test/distributed/test_store.py -v
./build/bin/BackoffTest
```

Will do internal testing with some large scale jobs to ensure TCPStore works correctly.

At 4k scale: 4x improvement

```
tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py                                                                                                   (pytorch-3.10)
started 0
init 0
set 0
joined all

________________________________________________________
Executed in    1.98 secs    fish           external
   usr time    0.93 secs   91.00 micros    0.93 secs
   sys time    1.98 secs  954.00 micros    1.97 secs

tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10                                                                                                                                              (pytorch-3.10)
tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py                                                                                                          (torchdrive-3.10)
started 0
init 0
set 0
joined all

________________________________________________________
Executed in    8.20 secs    fish           external
   usr time    2.15 secs    0.00 micros    2.15 secs
   sys time    2.76 secs  843.00 micros    2.76 secs
```

```py
import time
import os
import threading
from multiprocessing import Pool

WORLD_SIZE = 10000

import torch.distributed as dist

def run(rank):
    should_log = rank % (WORLD_SIZE // 10) == 0
    if should_log:
        print(f"started {rank}")
    store = dist.TCPStore(
        host_name="devvm4382.nao0.facebook.com",
        port=29500,
        world_size=WORLD_SIZE,
        is_master=rank == 0,
        use_libuv=True,
    )
    if should_log:
        print(f"init {rank}")
    store.set(f"key{rank}", "1234")
    if should_log:
        print(f"set {rank}")
    del store

def noop(rank):
    pass

print("starting pool")
with Pool(WORLD_SIZE) as pool:
    pool.map(noop, range(WORLD_SIZE), 1)
    print("pool hot")
    start = time.time()
    pool.map(run, range(WORLD_SIZE), 1)
    print("run finished", time.time()-start)
```

```
tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py                                                                                                                                (pytorch-3.10)
starting pool
pool hot
started 0
[W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
started 1000
init 1000
set 1000
started 2000
init 2000
set 2000
started 3000
init 3000
set 3000
started 4000
init 4000
set 4000
started 5000
init 5000
set 5000
started 6000
init 6000
set 6000
started 7000
init 7000
set 7000
started 8000
init 8000
set 8000
started 9000
init 9000
set 9000
init 0
set 0
run finished 0.705092191696167
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129261
Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o
2024-06-25 19:24:22 +00:00
816e8a3f21 [MacOS] Improve libomp packaging (#129473)
Instead of replacing `@rpath/libomp.dylib` with `@loadper_path/libomp.dylib`, keep it in place and add `@loadper_path` as new rpath

This should prevent double-loading of OpenMP runtime, because in case of `@rpath` loader is allowed to reuse other libraries, but `loadper_path` directive forces it to load it from the location relative to the executable

Test plan:
- Prepare the environment
```shell
conda create -n py310-cf python=3.10 numpy pip -c conda-forge
conda activate py310-cf
pip install torch --index-url https://download.pytorch.org/whl/test/cpu
```
- Verify that OpenMP is loaded twice and than crashes
```shell
KMP_VERSION=true python -c "import numpy as np; import torch; print(torch.__version__, torch.backends.openmp.is_available()); print(torch.rand(300, 300).abs().max())"
```
output:
```
LLVM OMP version: 5.0.20140926
LLVM OMP library type: performance
LLVM OMP link type: dynamic
LLVM OMP build time: no_timestamp
LLVM OMP build compiler: Clang 16.0
LLVM OMP alternative compiler support: yes
LLVM OMP API version: 5.0 (201611)
LLVM OMP dynamic error checking: no
LLVM OMP thread affinity support: no
LLVM OMP version: 5.0.20140926
LLVM OMP library type: performance
LLVM OMP link type: dynamic
LLVM OMP build time: no_timestamp
LLVM OMP build compiler: Clang 12.0
LLVM OMP alternative compiler support: yes
LLVM OMP API version: 5.0 (201611)
LLVM OMP dynamic error checking: no
LLVM OMP thread affinity support: no
2.4.0 True
zsh: segmentation fault  KMP_VERSION=true python -c
```
- Install artifact from this PR and make sure it passes the same test
```shell
python -mpip install ~/Downloads/torch-2.5.0.dev20240625-cp310-none-macosx_11_0_arm64.whl
KMP_VERSION=true python -c "import numpy as np; import torch; print(torch.__version__, torch.backends.openmp.is_available()); print(torch.rand(300, 300).abs().max())"
```
output
```
LLVM OMP version: 5.0.20140926
LLVM OMP library type: performance
LLVM OMP link type: dynamic
LLVM OMP build time: no_timestamp
LLVM OMP build compiler: Clang 16.0
LLVM OMP alternative compiler support: yes
LLVM OMP API version: 5.0 (201611)
LLVM OMP dynamic error checking: no
LLVM OMP thread affinity support: no
2.5.0.dev20240625 True
tensor(1.0000)
```
- Make sure it still uses bundled OpenMP if none is available in the environment
```
conda uninstall numpy -c conda-forge
KMP_VERSION=true python -c "from ctypes import cdll, c_char_p, c_uint32; import torch; from ctypes import cdll, c_char_p, c_uint32; libdyld = cdll.LoadLibrary('libSystem.dylib'); libdyld._dyld_image_count.restype = c_uint32; libdyld._dyld_get_image_name.restype = c_char_p; libdyld._dyld_get_image_name.argtypes = [c_uint32]; print(torch.rand(300, 300).abs().max()); libs = [libdyld._dyld_get_image_name(i).decode('ascii') for i in range(libdyld._dyld_image_count())]; print([l for l in libs if 'libomp.dylib' in l])"
```

Fixes https://github.com/pytorch/pytorch/issues/124497 and https://github.com/pytorch/pytorch/issues/126385
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129473
Approved by: https://github.com/atalman
2024-06-25 19:12:34 +00:00
b045878f81 Revert "Remove test_mps_allocator_module XFAIL (#129340)"
This reverts commit c888ee36325148ed99db4298bf2ae739ebbeacdc.

Reverted https://github.com/pytorch/pytorch/pull/129340 on behalf of https://github.com/huydhn due to The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation ([comment](https://github.com/pytorch/pytorch/pull/129340#issuecomment-2189701706))
2024-06-25 18:37:54 +00:00
7ebffef4d0 [FSDP2] Ran post-acc-grad hooks manually (#129450)
FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually.

**Discussion**
Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity.

Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not.

Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually.

**Caveats**
- Running `foreach=False` optimizer _per parameter tensor_  incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass).
    - On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be.
    - One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers.
    - If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`.
- The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream.
    - This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues.
- This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope.

**Experiments (torchtitan)**
- Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision:
    - Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU
    - With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped)
    - With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450
Approved by: https://github.com/weifengpy
2024-06-25 18:34:56 +00:00
dd00f5e78d Fixes T192448049 (#129146)
Differential Revision: D58767610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129146
Approved by: https://github.com/angelayi
2024-06-25 17:50:15 +00:00
53f462c506 Write dynamo benchmarks performance result to csv when throw exceptions (#126764)
**Performance mode Issue**: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files.
![image](https://github.com/pytorch/pytorch/assets/84730719/9043d215-130b-46b4-a835-f148c225947c)

- **Fix**: The warm-up failed models will be recorded into csv file shown as following:
![image](https://github.com/pytorch/pytorch/assets/84730719/7907a3c2-c942-42bb-b31c-55424a0e8117)

**Accuracy mode issue**: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR ee557d8f61.
```
Dynamic Shape:
Traceback (most recent call last):
  File "benchmarks/dynamo/torchbench.py", line 449, in <module>
    torchbench_main()
  File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main
    process_entry(0, runner, original_dir, args)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry
    return run(runner, args, original_dir)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```
![image](https://github.com/pytorch/pytorch/assets/84730719/f25392f0-f982-46c8-8e2c-a8a25d85a21a)

- **Fix**: same as PR ee557d8f61, the batch_size will be skipped to set as 4 when testing dynamic shapes.

Dynamic shapes passrate improved from 89% -> **95%**
| Comp Item | Compiler | suite      | before     | After fix  |
|-----------|----------|------------|------------|------------|
| Pass Rate | Inductor | torchbench | 89%, 73/82 | 95%, 79/83 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764
Approved by: https://github.com/jansel
2024-06-25 17:49:04 +00:00
e317a8b264 Add guard to use AMX for x86_64 only (#129479)
Trying to mitigate aarch64 and s390 nightly failures as per this comment:
https://github.com/pytorch/pytorch/pull/127195#issuecomment-2189177949

Fixes https://github.com/pytorch/pytorch/issues/129443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129479
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2024-06-25 17:31:28 +00:00
45b2931b7e Revert "[Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414)"
This reverts commit b24787b7576c184a54d13c1833ada23a395f5c31.

Reverted https://github.com/pytorch/pytorch/pull/129414 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures.  Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))
2024-06-25 17:05:55 +00:00
fb40ba6fc2 Revert "[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247)"
This reverts commit aa4ee2cb9e1f9be6bbdd27654e0f768b7fe9be6c.

Reverted https://github.com/pytorch/pytorch/pull/127247 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures.  Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))
2024-06-25 17:05:55 +00:00
ad76da6c16 Revert "[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)"
This reverts commit 7b57ddd38c6d502ba313c0e6b0c92b6787d69986.

Reverted https://github.com/pytorch/pytorch/pull/129257 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 4c1e4c5f30, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))
2024-06-25 16:48:32 +00:00
b38f6d4cd2 Revert "[inductor] Enable FX graph caching in OSS by default (#125863)"
This reverts commit 4c1e4c5f307f9743014a08cf97d3fa8de7e1ce5f.

Reverted https://github.com/pytorch/pytorch/pull/125863 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 4c1e4c5f30, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))
2024-06-25 16:48:32 +00:00
f8db12a538 Fix logic to find sbgemm in BLAS library (#125227)
Current logic to set the HAS_SBGEMM flag is ignored in case the BLAS libraries are found already, ie, if set from environment variable BLAS=OpenBLAS . If BLAS_LIBRARIES are already set the code to find if BLAS_LIBRARY has sbgemm is never executed. The following commit brings out this logic outside unconditionally.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125227
Approved by: https://github.com/malfet
2024-06-25 16:34:38 +00:00
665d6ea05b [export] Fix IR canonlization. (#129401)
Summary: as title. we should unpack results from _canonicalize_graph.

Differential Revision: D58963429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129401
Approved by: https://github.com/tugsbayasgalan
2024-06-25 16:33:02 +00:00
e364290718 Support linear backward for NJT with dim > 3 (#129393)
Replaces usage of `torch.mm()` with `torch.matmul()` in NJT's impl of linear_backward to support higher dims. See [here](https://github.com/pytorch/pytorch/issues/125214#issuecomment-2184968703) for more context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129393
Approved by: https://github.com/soulitzer
2024-06-25 16:06:23 +00:00
0e6bb7f1ce [caffe2][be] migrate gloabl static initializer (#128784)
Summary:
Caffe2 lib has 200+ global static initializer usage, which are papar-cut reference to startup perf. Detail in this post https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154.

This Diff migrate StorageImpl.cpp

Addtional Context: https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154

Test Plan: CI

Differential Revision: D58639283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128784
Approved by: https://github.com/aaronenyeshi
2024-06-25 15:30:49 +00:00
fd4af87855 Fix non-portable path warning (#129474)
MacOS uses case-insensitive filesystem by default, but it's better to specify include path using proper capitalization

Should fix
```
MultiTensorApply.h:4:10: warning: non-portable path to file '<ATen/native/mps/operations/FusedOptimizerOps.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path]
#include <Aten/native/mps/operations/FusedOptimizerOps.h>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129474
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/qqaatw
2024-06-25 15:17:21 +00:00
cb1c56caba Set target dependencies to always build for sm90a on rowwise scaling (#129402)
# Summary

Instead of landing global builder changes; https://github.com/pytorch/builder/pull/1878

This PR targets only the Rowwise file and adds the sm90a featurs.

Verified locally by setting:
```
TORCH_CUDA_ARCH_LIST=9.0
```

We can see in the build.ninja file that the proper flags are set:

```
build caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o: CUDA_COMPILER__torch_cuda_unscanned_Release /home/drisspg/meta/pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu || cmake_object_order_depends_target_torch_cuda
  DEFINES = -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS
  DEP_FILE = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o.d
  FLAGS = -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-unused-function,-Wno-maybe-uninitialized -Wno-deprecated-copy -gencode arch=compute_90a,code=sm_90a
  INCLUDES = -I/home/drisspg/meta/pytorch/build/aten/src -I/home/drisspg/meta/pytorch/aten/src -I/home/drisspg/meta/pytorch/build -I/home/drisspg/meta/pytorch -I/home/drisspg/meta/pytorch/third_party/onnx -I/home/drisspg/meta/pytorch/build/third_party/onnx -I/home/drisspg/meta/pytorch/third_party/foxi -I/home/drisspg/meta/pytorch/build/third_party/foxi -I/home/drisspg/meta/pytorch/aten/src/THC -I/home/drisspg/meta/pytorch/aten/src/ATen/cuda -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/drisspg/meta/pytorch/build/caffe2/aten/src -I/home/drisspg/meta/pytorch/aten/src/ATen/.. -I/home/drisspg/meta/pytorch/build/nccl/include -I/home/drisspg/meta/pytorch/c10/cuda/../.. -I/home/drisspg/meta/pytorch/c10/.. -I/home/drisspg/meta/pytorch/third_party/tensorpipe -I/home/drisspg/meta/pytorch/build/third_party/tensorpipe -I/home/drisspg/meta/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/drisspg/meta/pytorch/torch/csrc/api -I/home/drisspg/meta/pytorch/torch/csrc/api/include -isystem /home/drisspg/meta/pytorch/build/third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/drisspg/meta/pytorch/third_party/protobuf/src -isystem /home/drisspg/meta/pytorch/third_party/ittapi/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda-12.3/include -isystem /home/drisspg/meta/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/drisspg/meta/pytorch/third_party/ideep/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/cudnn_frontend/include
  OBJECT_DIR = caffe2/CMakeFiles/torch_cuda.dir
  OBJECT_FILE_DIR = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129402
Approved by: https://github.com/malfet
2024-06-25 13:54:51 +00:00
71ebe5121a [MPS] Fast math env var (#129007)
Allow users to decide whether they want to have fast math enabled via env var
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129007
Approved by: https://github.com/malfet
ghstack dependencies: #129006, #129008
2024-06-25 13:52:07 +00:00
bbdeff76fc fix add decomposition for complex numbers (#129044)
Fixes #125745

Bug source: When addition requires broadcasting, adding complex numbers is not implemented correctly in `torch/_inductor/decomposition.py` because `x.view(x.real.dtype)` would multiply the last dimension by 2, and then broadcasting wouldn't work.

Fix: re-shape the complex tensors after view and before broadcasting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129044
Approved by: https://github.com/zou3519, https://github.com/lezcano
2024-06-25 11:05:41 +00:00
6508f0f5d4 Improved backward tracking and attribution, fixed typing for python < 3.10 (#129400)
For #125323
* Fixes typing for python < 3.10
* Fixes #129390

For #124688
* Improved attribution by registering `register_hook` and `post_accumulate_grad_hook` on params.
* Fixed pre-mature per module bw peak state initialization for AC.
* This improves per-module stats, global `peak_mem` was already accurate and remains unaffected.

For #128508
* When AC is applied to a `mod (nn.Module)` the backward order of execution is `pre-bw -> pre-fw -> post-fw -> post-bw`. Since the `ModTracker` maintains the `parents` attribute as set, the `post-fw` during backward was prematurely removing it from parents.
* With the fix we now maintain a per-module counter and only remove a module from `parents` when its counter goes to 0.
* Added tests to ensure this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129400
Approved by: https://github.com/awgu, https://github.com/huydhn
2024-06-25 10:54:58 +00:00
63474620ab test_jit: Replace plain assert by test assert (#128950)
The plain assert doesn't show the values in case of failure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128950
Approved by: https://github.com/zou3519
2024-06-25 09:04:53 +00:00
0314c4c101 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-06-25 08:28:38 +00:00
4ca8eecca4 skip test_graph_capture_oom for jetson (#128661)
On Jetson IGX, `python test/test_cuda.py -k test_graph_capture_oom` fails with the following error:

```
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 2255, in test_graph_capture_oom
    with self.assertRaisesRegex(RuntimeError, oom_regex):
  File "/usr/lib/python3.10/unittest/case.py", line 239, in __exit__
    self._raiseFailure('"{}" does not match "{}"'.format(
  File "/usr/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
    raise self.test_case.failureException(msg)
AssertionError: "out of memory" does not match "NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. "

```

This is a known issue as nvml support on Jetson is limited, and the OOM reporting in CUDACachingAllocator.cpp requires nvml to be properly loaded, which fails on Jetson.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128661
Approved by: https://github.com/eqy, https://github.com/atalman
2024-06-25 08:25:11 +00:00
eqy
8bfd9e9815 [cuDNN] Graph-capturable cuDNN CTCLoss (#128271)
cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant

~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~
Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-06-25 06:01:50 +00:00
533c4190f9 [inductor][cpp] support nested kernel with indirect indexing (#129223)
This PR makes sure the current kernel is used for generating CSE variables when nested kernel codegen is involved, e.g., nested CppKernel is used to generate epilogue of CppTemplateKernel. Without the fix, the epilogue with indirect indexing would fail to run.

pytest -k test_linear_with_embedding_bias_False_cpu test_cpu_select_algorithm.py

Epilogue code Before:
```c++
                {
                    #pragma GCC ivdep
                    for(long x0=static_cast<long>(0L); x0<static_cast<long>(m_end + ((-1L)*m_start)); x0+=static_cast<long>(1L))
                    {
                        for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L*(c10::div_floor_integer(N0, 16L))); x1+=static_cast<long>(16L))
                        {
                            auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)];
                            auto tmp11 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<long>(x1 + (N0*x0)), 16);
                            auto tmp1 = 64L;
                            auto tmp2 = c10::convert<int64_t>(tmp1);
                            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
                            auto tmp4 = tmp0 ? tmp3 : tmp0;
                            auto tmp5 = decltype(tmp4)(tmp4 + tmp2);
                            auto tmp6 = tmp1 ? tmp5 : tmp4;
                            auto tmp7 = tmp6;
                            auto tmp8 = c10::convert<int64_t>(tmp7);
                            TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L");
                            auto tmp10 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(n_start + x1 + (384L*tmp6)), 16);
                            auto tmp12 = (tmp11);
                            auto tmp13 = tmp10 + tmp12;
                            tmp13.store(Y + static_cast<long>(n_start + x1 + (384L*m_start) + (384L*x0)));
                        }
                        #pragma omp simd simdlen(8)
                        for(long x1=static_cast<long>(16L*(c10::div_floor_integer(N0, 16L))); x1<static_cast<long>(N0); x1+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)];
                            auto tmp11 = local_acc_buf[static_cast<long>(x1 + (N0*x0))];
                            auto tmp1 = 64L;
                            auto tmp2 = c10::convert<int64_t>(tmp1);
                            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
                            auto tmp4 = tmp0 ? tmp3 : tmp0;
                            auto tmp5 = decltype(tmp4)(tmp4 + tmp2);
                            auto tmp6 = tmp1 ? tmp5 : tmp4;
                            auto tmp7 = tmp6;
                            auto tmp8 = c10::convert<int64_t>(tmp7);
                            TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L");
                            TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L");
                            auto tmp10 = in_ptr3[static_cast<long>(n_start + x1 + (384L*tmp6))];
                            auto tmp12 = c10::convert<float>(tmp11);
                            auto tmp13 = decltype(tmp10)(tmp10 + tmp12);
                            Y[static_cast<long>(n_start + x1 + (384L*m_start) + (384L*x0))] = tmp13;
                        }
                    }
                }
```

Epilogue code After:
```c++
                {
                    #pragma GCC ivdep
                    for(long x0=static_cast<long>(0L); x0<static_cast<long>(m_end + ((-1L)*m_start)); x0+=static_cast<long>(1L))
                    {
                        for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L*(c10::div_floor_integer(N0, 16L))); x1+=static_cast<long>(16L))
                        {
                            auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)];
                            auto tmp13 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<long>(x1 + (N0*x0)), 16);
                            auto tmp1 = 64L;
                            auto tmp2 = c10::convert<int64_t>(tmp1);
                            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
                            auto tmp4 = tmp0 < 0;
                            auto tmp5 = tmp4 ? tmp3 : tmp0;
                            auto tmp6 = decltype(tmp5)(tmp5 + tmp2);
                            auto tmp7 = tmp5 < 0;
                            auto tmp8 = tmp7 ? tmp6 : tmp5;
                            auto tmp9 = tmp8;
                            auto tmp10 = c10::convert<int64_t>(tmp9);
                            TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L");
                            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(n_start + x1 + (384L*tmp8)), 16);
                            auto tmp14 = (tmp13);
                            auto tmp15 = tmp12 + tmp14;
                            tmp15.store(Y + static_cast<long>(n_start + x1 + (384L*m_start) + (384L*x0)));
                        }
                        #pragma omp simd simdlen(8)
                        for(long x1=static_cast<long>(16L*(c10::div_floor_integer(N0, 16L))); x1<static_cast<long>(N0); x1+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)];
                            auto tmp13 = local_acc_buf[static_cast<long>(x1 + (N0*x0))];
                            auto tmp1 = 64L;
                            auto tmp2 = c10::convert<int64_t>(tmp1);
                            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
                            auto tmp4 = tmp0 < 0;
                            auto tmp5 = tmp4 ? tmp3 : tmp0;
                            auto tmp6 = decltype(tmp5)(tmp5 + tmp2);
                            auto tmp7 = tmp5 < 0;
                            auto tmp8 = tmp7 ? tmp6 : tmp5;
                            auto tmp9 = tmp8;
                            auto tmp10 = c10::convert<int64_t>(tmp9);
                            TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L");
                            TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L");
                            auto tmp12 = in_ptr3[static_cast<long>(n_start + x1 + (384L*tmp8))];
                            auto tmp14 = c10::convert<float>(tmp13);
                            auto tmp15 = decltype(tmp12)(tmp12 + tmp14);
                            Y[static_cast<long>(n_start + x1 + (384L*m_start) + (384L*x0))] = tmp15;
                        }
                    }
                }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129223
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-06-25 05:21:00 +00:00
665dbc2f52 [easy][DCP] Fix test_fine_tuning.py for get/set_state_dict API changes (#129365)
Update test/distributed/checkpoint/e2e/test_fine_tuning.py for https://github.com/pytorch/pytorch/pull/112203 change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129365
Approved by: https://github.com/fegin
2024-06-25 05:12:02 +00:00
0e1e289033 [ONNX] Benchmark refactored ONNX export (#129427)
Reuse torch.onnx.export with torch_onnx patch to test ExportedProgram -> ONNX IR exporter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129427
Approved by: https://github.com/justinchuby
2024-06-25 04:47:53 +00:00
f18becaaf1 Add example for torch.serialization.add_safe_globals (#129396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129396
Approved by: https://github.com/albanD
ghstack dependencies: #129244, #129251, #129239
2024-06-25 04:19:44 +00:00
381ce0821c Add warning for weights_only (#129239)
Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239
Approved by: https://github.com/albanD
ghstack dependencies: #129244, #129251
2024-06-25 04:19:44 +00:00
c5f7755e86 Allow BUILD/NEWOBJ instruction for items added via torch.serialization.add_safe_globals (#129251)
Previously, allowlisting functions/classes via `torch.serialization.add_safe_globals(obj)` for the `weights_only` Unpickler had the following effect:

- For a [`GLOBAL`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1926-L1939) instruction, `GLOBAL obj.__module__ obj.__name__` would be allowed and translated back to obj to be pushed back to the stack.
- For a [`REDUCE`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1926-L1982) instruction where we expect the stack to contain `func` and `args`, `func` is allowed if it was added via `add_safe_globals`

However, it did not have an effect on `BUILD` and `NEWOBJ` instructions

Some classes may be rebuilt via [`NEWOBJ`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L2091-L2104) instruction, which indicates that their constructor should be used to rebuild the class.

Further, a [`BUILD`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1984-L2007) instruction might be used if an object's `__reduce__`/`__reduce_ex__` returns a non-None value for `state`. Which indicates a `__setstate__` or `__dict__.update`.

**This PR makes sure that adding objects to the allowlist will also allow `NEWOBJ` and `BUILD` instructions for them.**

In particular, the update for `NEWOBJ` should unblock allowlisting of [`ScaledMMConfig`](d4ade877df/float8_experimental/float8_tensor.py (L26-L30)) in float8_experimental @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129251
Approved by: https://github.com/albanD
ghstack dependencies: #129244
2024-06-25 04:19:44 +00:00
1bb1e3463c Fix allowlisting of builtins for weights_only unpickler (#129244)
Since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), some functions/classes that were renamed from python 2-->3 will be pickled with their python2 name. This PR ensures that when a mod `GLOBAL <python2_mod>.<python2_name> ` is encountered, [following the strategy used by pickle](https://github.com/python/cpython/blob/main/Lib/pickle.py#L1590C13-L1593C63) it is properly mapped to `<python3_mod>.<python3_name>`.

This fix ensures that `add_safe_globals` works properly for such functions/classes (i.e. users will allowlist the python3 func and the weights_only unpickler will do the appropriate translation when checking whether a class was allowlisted).

An example is as follows:
`__builtin__` was named to `builtins`, see the [release notes for Python 3.0](https://docs.python.org/3/whatsnew/3.0.html)

> Renamed module `__builtin__` to [`builtins`](https://docs.python.org/3/library/builtins.html#module-builtins) (removing the underscores, adding an ‘s’). The __builtins__ variable found in most global namespaces is unchanged. To modify a builtin, you should use [builtins](https://docs.python.org/3/library/builtins.html#module-builtins), not `__builtins__`!

However, since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), builtins will be pickled with their module string as `__builtin__`.

```python
>>> import pickle
>>> import pickletools
>>> print.__module__
'builtins'
>>> with open('print.pkl', 'wb') as f:
>>>      pickle.dump(print, f, protocol=2) # 2 because this is the default protocol used by pytorch
>>> with open('print.pkl', 'rb') as f:
>>>     pickletools.dis(f)
0: \x80 PROTO      2
2: c    GLOBAL     '__builtin__ print' # pickle saves the module string as __builtin__ !!! :(
21: q    BINPUT     0
23: .    STOP
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129244
Approved by: https://github.com/albanD
2024-06-25 04:19:44 +00:00
aa4ee2cb9e [Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247)
Test command:
`pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127247
Approved by: https://github.com/bdhirsh
ghstack dependencies: #129414
2024-06-25 03:13:38 +00:00
b24787b757 [Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414)
This makes it easier to do pattern-matching on `fsdp.split_with_sizes_copy` in Inductor passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129414
Approved by: https://github.com/bdhirsh
2024-06-25 03:08:56 +00:00
e6bfa2958b Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-25 02:45:02 +00:00
4d04203852 [BE] Runner determinator: Expect usernames to be prefixed with '@' (#129246)
Expect the username in the runner rollover issue (https://github.com/pytorch/test-infra/issues/5132) to be prefixed with a "@".

This will make typos way less likely since github's autocomplete/autoformating will help out

For now, I've updated the issue to have usernames both with and without the @ while this change rolls out

Testing:
Ran the script locally on both this issue and a new test issue and verified they both had the expected output:
```
(venv) (base) ➜  ~/pytorch git:(zainr/improve-get-workflow-type)
python .github/scripts/get_workflow_type.py --github-token github_pat_***  --github-issue 5132 --github-user ZainRizvi --github-branch "zainr/stuff"
{"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129246
Approved by: https://github.com/zxiiro, https://github.com/huydhn
2024-06-25 02:39:33 +00:00
533395e204 Fix build error on s390x (#129326)
This PR fixes the build error on s390 after #127195.

The following is the log of the build on s390x. This is because `SYS_arch_prctl` is not defined on s390x.
```
...
[792/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/FlushDenormal.cpp.o
[793/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o
/usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/cmake/../third_party/benchmark/include -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -I/pytorch/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src -I/pytorch/build/caffe2/../aten/src -I/pytorch/torch/csrc -I/pytorch/third_party/miniz-2.1.0 -I/pytorch/third_party/kineto/libkineto/include -I/pytorch/third_party/kineto/libkineto/src -I/pytorch/third_party/cpp-httplib -I/pytorch/aten/src/ATen/.. -I/pytorch/c10/.. -I/pytorch/third_party/FP16/include -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/third_party/fmt/include -I/pytorch/third_party/flatbuffers/include -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/cmake/../third_party/googletest/googlemock/include -isystem /pytorch/cmake/../third_party/googletest/googletest/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/cmake/../third_party/eigen -isystem /pytorch/build/include -Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -fPIC -DTORCH_USE_LIBUV -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -c /pytorch/aten/src/ATen/cpu/Utils.cpp
/pytorch/aten/src/ATen/cpu/Utils.cpp: In function 'bool at::cpu::init_amx()':
/pytorch/aten/src/ATen/cpu/Utils.cpp:60:21: error: 'SYS_arch_prctl' was not declared in this scope; did you mean 'SYS_prctl'?
   60 |   long rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
      |                     ^~~~~~~~~~~~~~
      |                     SYS_prctl
[794/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Integration.cpp.o
[795/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/GridSampler.cpp.o
[796/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/CPUGuardImpl.cpp.o
[797/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ThreadLocalState.cpp.o
[798/2147] Building CXX object caffe2/CMakeFiles/vec_test_all_types_DEFAULT.dir/__/aten/src/ATen/test/vec_test_all_types.cpp.o
[799/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Utils.cpp.o
[800/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapModeRegistrations.cpp.o
[801/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ZeroTensorFallback.cpp.o
[802/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/autocast_mode.cpp.o
ninja: build stopped: subcommand failed.
Building wheel torch-2.5.0a0+git94dc325
-- Building version 2.5.0a0+git94dc325
cmake -GNinja -DBUILD_CAFFE2=0 -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/pytorch/torch -DCMAKE_PREFIX_PATH=/usr/local/lib/python3.10/dist-packages -DPython_EXECUTABLE=/usr/bin/python3 -DTORCH_BUILD_VERSION=2.5.0a0+git94dc325 -DUSE_GLOO=0 -DUSE_NUMPY=True /pytorch
cmake --build . --target install --config Release
Build step 'Execute shell' marked build as failure
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129326
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-06-25 02:39:13 +00:00
c4dd752d97 [dynamo][compile-time][inlining-inbuilt-nn-modules] Manually implement nn.Module._call_impl (#129285)
# Compile time for eager backend
## AlbertForMaskedLM
No inlining - 3.65 seconds
Inlining on main - 7.48 seconds
Inlining + this PR - 2.86 seconds

## MobileBertForMaskedLM
No inlining - 26.90 seconds
Inlining on main - 48.21 seconds
Inlining + this PR - 24.25 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129285
Approved by: https://github.com/jansel
ghstack dependencies: #129316, #129315
2024-06-25 01:31:26 +00:00
514f9279f8 [dynamo][compile-time] Manually implement nn.Module.__getattr__ to reduce compile time (#129315)
# Compile time for eager backend
## AlbertForMaskedLM
No inlining - 3.65 seconds
Inlining on main - 7.48 seconds
Inlining + this PR - 6.70 seconds

## MobileBertForMaskedLM
No inlining - 26.90 seconds
Inlining on main - 48.21 seconds
Inlining + this PR - 43.85 seconds

*Next PR in the stack makes the total compile time better/comparable to no inlining*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129315
Approved by: https://github.com/jansel
ghstack dependencies: #129316
2024-06-25 01:31:26 +00:00
c012013aa6 Revert "Add Strided Input test for flex attention (#128915)"
This reverts commit 41bb81b58279f492e72bd270b3b071dd2953ed8c.

Reverted https://github.com/pytorch/pytorch/pull/128915 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its tests are failing in trunk, i.e. 41bb81b582 (26627138290) ([comment](https://github.com/pytorch/pytorch/pull/128915#issuecomment-2187695317))
2024-06-25 00:43:34 +00:00
1315be4893 [aotinductor] only autotune at compile time when enabled via config (#129413)
internal breakage when enabled.

Test Plan: CI

Differential Revision: D58965784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129413
Approved by: https://github.com/jingsh, https://github.com/desertfire
2024-06-25 00:41:10 +00:00
78e40b271b Change index_put on GPU to accept FP8 inputs (#128758)
As the title says, this PR changes the dispatcher for the CUDA index_put_ kernel to accept FP8 inputs. This is useful for Transformers models where the KV cache is FP8 and has been pre-allocated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128758
Approved by: https://github.com/eqy, https://github.com/drisspg
2024-06-25 00:38:03 +00:00
8b6391ee59 [Test][DTensor] Temporarily skip gloo test for test_depthwise_convolution (#129391)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129391
Approved by: https://github.com/awgu
2024-06-25 00:29:50 +00:00
81de71fdc5 [inductor] fix a double clone in coordesc tuning (#129399)
It's embarrassing that there is a hidden double clone bug in coordinate descent tuning.

In `CachingAutotuner.coordinate_descent_tuning`, we clone mutated args to make sure benchmarking does not cause numerical problems. But latter on in `CachingAutotuner.bench` we do that again.

This double clone is fine if
- the tensor is small
- the allocation of the tensor is not on the critical path for memory footprint.

But neither holds for quite common usage of cross entropy loss.

This is related to the memory usage debugging in https://github.com/pytorch/pytorch/pull/129043 . Note that the general issue that peak memory usage increasing due to autotuning still exists. This bug just makes it worse (since we double allocate).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129399
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-06-25 00:18:51 +00:00
14dc08ddc7 Inductor to fail gracefully on Voltas for bf16 tensors (#129288)
Volta(sm_7x) do not have a HW support for bfloat16 datatype, and while it is is emulated to ted in software, so PyTorch eager can use bfloat16 tensors, but not in Triton. So if graph with either CUDA bf16 input or output tensors is used, raise warnings and skip the frame.

Add optional parameter `including_emulation` to `torch.cuda.is_bf16_supported` method and call it from `torch._inductor.compile_fx. _check_triton_bf16_support`.

Test plan: Modify `is_bf16_supported` to return False and see that warning is generated

Fixes https://github.com/pytorch/pytorch/issues/118122 and https://github.com/pytorch/pytorch/issues/118581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129288
Approved by: https://github.com/eqy, https://github.com/jansel
2024-06-25 00:04:13 +00:00
4c1e4c5f30 [inductor] Enable FX graph caching in OSS by default (#125863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125863
Approved by: https://github.com/eellison, https://github.com/oulgen
ghstack dependencies: #129257
2024-06-24 23:39:43 +00:00
7b57ddd38c [inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)
Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR:
* Fix the with_fresh_cache_if_config() decorator
* Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257
Approved by: https://github.com/oulgen
2024-06-24 23:39:43 +00:00
b22f0f5f51 [torchbind] fix bug of mutating FakeScriptObjects twice in aot_export (#128844)
This PR does two things:
1. it duplicates the fake script object because aot_export trace the program twice. The result of tracing in the first time would cause the tracing result of second time be wrong.
2. Also add a new test for methods that return constant outputs. Before the PR, there's is no meta["val"] for these nodes because fx won't track these constants. We still need to preserve these constant return operators in the graph because torchbind objects are stateful and deleting it would remove the implicit state mutation inside of the object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128844
Approved by: https://github.com/angelayi
2024-06-24 23:14:34 +00:00
41bb81b582 Add Strided Input test for flex attention (#128915)
Test strided inputs to the flex_attention HOP. Similar to how inputs are generated in
https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128915
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-06-24 22:56:39 +00:00
00f675bb4c [Nested Tensor]fix sdpa backward for the special case with ragged second batch dim and constant length (#128349)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128349
Approved by: https://github.com/jbschlosser
2024-06-24 22:35:07 +00:00
7b7f357042 Fix DEBUG=1 asserts with NJT ops (#129014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129014
Approved by: https://github.com/YuqingJ, https://github.com/soulitzer
2024-06-24 22:32:01 +00:00
5f912f480c Fix max_pool2d decomposition for empty list and integer limits (#129106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129106
Approved by: https://github.com/peterbell10, https://github.com/lezcano, https://github.com/malfet
ghstack dependencies: #129096, #129097
2024-06-24 22:19:42 +00:00
e096faaf30 Fix rot90 decomposition for no rotation (#129097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129097
Approved by: https://github.com/peterbell10
ghstack dependencies: #129096
2024-06-24 22:19:42 +00:00
fbca70718f Fix scatter lowering when src is a Number (#129096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129096
Approved by: https://github.com/peterbell10
2024-06-24 22:19:39 +00:00
8edb7b96b1 Enable dynamic rollout for pull workflow (#129243)
Enables dynamic migration of jobs to the LF AWS account for the pull workflow.  For now, it leaves out a few jobs that need a bit more testing: Namely Windows and Android runners.

The new runners are only given to people specified in this issue:
https://github.com/pytorch/test-infra/issues/5132

Note: The non-pull jobs updated are the ones that have are synced to jobs in pull.yml (via `sync-tag`) and thus have to be updated whenever their corresponding pull.yml jobs are edited

Based on https://github.com/pytorch/pytorch/pull/128597
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129243
Approved by: https://github.com/zxiiro, https://github.com/huydhn, https://github.com/malfet
2024-06-24 22:15:53 +00:00
30bfdf1afc Errors when 0-dim tensor of complex or bool type passed to aminmax. (#128404)
Fixes #126742

Added errors for the case of 0-dim tensors of complex or bool types passed to aminmax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128404
Approved by: https://github.com/janeyx99
2024-06-24 21:46:49 +00:00
18fdc0ae5b [executorch hash update] update the pinned executorch hash (#129099)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129099
Approved by: https://github.com/pytorchbot
2024-06-24 21:01:40 +00:00
93a33bf3ac [BE] update type annotations for basic utilities in torch/__init__.py (#129001)
Changes:

1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
2024-06-24 18:04:38 +00:00
1a54bb0f96 Revert "[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417)"
This reverts commit 4f9399bd0d2bc0cbd14348b80e32b263de5c6bc0.

Reverted https://github.com/pytorch/pytorch/pull/126417 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/126417#issuecomment-2186999121))
2024-06-24 16:50:15 +00:00
063facf352 Revert "[halide-backend] Generate standalone runtime (#129025)"
This reverts commit 10c64c3b49e2008a50f9229e600c68c8a3d49292.

Reverted https://github.com/pytorch/pytorch/pull/129025 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129025#issuecomment-2186995467))
2024-06-24 16:47:25 +00:00
c888ee3632 Remove test_mps_allocator_module XFAIL (#129340)
Not sure why this test starts to fail (maybe runner update) 8a2fed7e6a/1 or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340
Approved by: https://github.com/kit1980
2024-06-24 16:26:38 +00:00
cb4919344a Revert "[BE] update type annotations for basic utilities in torch/__init__.py (#129001)"
This reverts commit e53d9590287cbf97521f96d055910394f6e9a849.

Reverted https://github.com/pytorch/pytorch/pull/129001 on behalf of https://github.com/XuehaiPan due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/129001#issuecomment-2186944549))
2024-06-24 16:18:43 +00:00
7b910285db Revert "[inductor] Refactor fusion of inplace operations (#128979)"
This reverts commit 72e3aca227ae1e3dc1b91aee415cf27b0cb22f2b.

Reverted https://github.com/pytorch/pytorch/pull/128979 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128979#issuecomment-2186846940))
2024-06-24 15:29:40 +00:00
df51d0b623 [aotinductor][UserDefinedTritonKernel] use appropriate expr printer when printing args (#129301)
Encountered the following C++ compile error.
```
Declared in this scope; did you mean ‘std::max’?
  619 |     auto var_5 = max(1, u0);
```

This PR will use the C++ printer when it's doing C++ codegen, before this PR it was using the Python printer even during C++ codegen.

Differential Revision: [D58913123](https://our.internmc.facebook.com/intern/diff/D58913123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129301
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-06-24 15:23:05 +00:00
e53d959028 [BE] update type annotations for basic utilities in torch/__init__.py (#129001)
Changes:

1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
2024-06-24 14:35:41 +00:00
c89a9f5d17 Allow SAC policy_fn to return bool for backward compatibility (#129262)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129262
Approved by: https://github.com/Chillee, https://github.com/fmassa
ghstack dependencies: #125795, #128545
2024-06-24 13:54:30 +00:00
9094248090 [FSDP2] Fixed unshard without lazy init (#129241)
Previously, the `FSDPCommContext` only defines the stream attributes when `FSDPCommContext.init` is called from lazy initialization. This means that if the user calls `module.unshard()` before lazy init (e.g. first forward pass), then it would error in `wait_for_unshard()`. This PR fixes this by making sure that the stream attributes are defined, only with the default stream, at construction time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129241
Approved by: https://github.com/Skylion007, https://github.com/weifengpy
2024-06-24 13:31:54 +00:00
d21f311af8 [Easy][Traceable FSDP2] Skip rocm for the E2E tests (#129339)
The CUDA implementation of `resize_storage_bytes_` doesn't run on rocm yet, so need to skip it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129339
Approved by: https://github.com/msaroufim
2024-06-24 06:38:33 +00:00
662e9e1076 [BE] enable UFMT for torch/nn/functional.py (#128592)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592
Approved by: https://github.com/mikaylagawarecki
2024-06-24 06:24:12 +00:00
8a2fed7e6a [Inductor][CPP] Fallback QLinear Binaryfusion from postop sum to binary add when others is view (#128808)
**Summary**
In int8 GEMM Template, we will view the input from 3D to 2D and view the output back to 3D for QLinear which makes the output of this QLinear as `view`. So, if this output view inputs to a QLinear-Binary fusion which breaks the assumption of QLinear-Binary with post op inplace `sum`. We change the postop name from inplace `sum` to outplace `add` for this case which is similar as FP32/BF16 Linear Inplace as in 1208347d09/torch/_inductor/fx_passes/mkldnn_fusion.py (L541-L543).

**TestPlan**
```
clear && numactl -C 56-111 -m 1 python -u -m pytest -s -v inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_cpu_input_dim_exceeds_2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128808
Approved by: https://github.com/jgong5
ghstack dependencies: #128804
2024-06-24 01:12:18 +00:00
287c68c5ec [Inductor][Quant] Use output dtype torch.uint8 explicitly (#128804)
**Summary**
Previously, we use `None` as output data type in the lowering of QLinear/QConv for uint8 implicitly. It's not clear and we should use `torch.uint8` explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128804
Approved by: https://github.com/Xia-Weiwen, https://github.com/jgong5
2024-06-24 01:08:49 +00:00
7b9e6430ed [Split Build] Add periodic and trunk CI for cuda builds (#129269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129269
Approved by: https://github.com/atalman
2024-06-23 17:04:37 +00:00
f85d1e845a [BE] enable UFMT for torch/nn/*.py (#128593)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593
Approved by: https://github.com/mikaylagawarecki
2024-06-23 16:05:13 +00:00
dadc0ed4c8 [Traceable FSDP2] Add aot_eager backend E2E tests for transformer model (#129157)
This PR adds Traceable FSDP2 `aot_eager` backend E2E tests for simple MLP as well as transformer model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129157
Approved by: https://github.com/awgu
ghstack dependencies: #129203
2024-06-23 06:11:11 +00:00
b91a9dc328 [Brian's PR #128754] Use torch.ops.fsdp.set_ for FSDP2 storage resize; dont functionalize resize_, set_, split_with_sizes_copy.out (#129203)
This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128754, with some changes in the test_distributed_patterns.py unit tests to more closely reflect FSDP2 patterns. Also disabled two tests `test_input_mutation_storage_resize_up_down` and `test_input_mutation_storage_resize_not_supported` in test_aotdispatch.py until we figure out the right behavior for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129203
Approved by: https://github.com/bdhirsh
2024-06-23 06:07:19 +00:00
62ccf6d7cd [BE] enable UFMT for torch/nn/modules (#128594)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594
Approved by: https://github.com/mikaylagawarecki
2024-06-23 05:37:57 +00:00
440d8fbd4a FSDP2 Memory Tracker (#125323)
* __->__ #125323
### Why do we need the FSDP Memory Tracker?

**Tuning Decisions**

1. What is the expected peak memory with current configuration?
2. If I change my FSDP wrapping, how much effect will it have on peak memory?
3. What is the best batch size to use?
4. What is the maximum sequence length that one can run with current configuration?
5. How does increasing/decreasing the “DP” world size affect peak memory?
6. How much memory do I save if I move the optimizer to the CPU?
7. Which activation checkpointing policy should I use?
8. If I have various SAC policies, How do they compare against each other?
9. What happens if I apply different SAC policies to different FSDP units?
10. If I make my gradient reduction in fp32, what effect will it have on memory?
11. If I want to use a custom mixed precision policy, how will it affect the peak memory?
12. When does it make sense to use HSDP?
13. Can I reshard to a smaller mesh without increasing peak memory substantially?
14. Can safely disable post forward reshard without causing an OOM?

**Debugging**

1. Which module contributes most to activation memory?
2. Which FSDP unit is holding a lot of unsharded memory?
3. AC is not releasing memory?

The FSDP2 Memory Tracker addresses all of the above. It is based on:
 *  #124688
 *  #128508

Example and Output:

```
if __name__== "__main__":
    from contextlib import nullcontext
    from functools import partial
    import torch
    from torch.distributed._composable import checkpoint
    from torch.distributed._composable.fsdp import (
        CPUOffloadPolicy,
        fully_shard,
        MixedPrecisionPolicy,
    )
    from torch.distributed._tensor import DeviceMesh
    from torch.distributed._tools.fsdp2_mem_tracker import FSDPMemTracker
    from torch._subclasses.fake_tensor import FakeTensorMode
    from torch.testing._internal.distributed._tensor.common_dtensor import (
    ModelArgs,
    Transformer,
    TransformerBlock,
    )
    from torch.testing._internal.distributed.fake_pg import FakeStore
    dev = torch.device("cuda:0")
    torch.cuda.set_device(dev)
    world_size = 4
    store = FakeStore()
    torch.distributed.init_process_group(
        "fake", rank=0, world_size=world_size, store=store
    )
    mesh = DeviceMesh("cuda", torch.arange(0, world_size))
    torch.cuda.empty_cache()
    torch.manual_seed(42)
    use_fake_mode = False
    with FakeTensorMode() if use_fake_mode else nullcontext():
        vocab_size = 8192
        bsz, seq_len = 32, 1024
        with torch.device(dev):
            model_args = ModelArgs(
                n_layers=2,
                n_heads=16,
                vocab_size=vocab_size,
                max_seq_len=seq_len,
                dropout_p=0.1,
            )
            model = Transformer(model_args)
        foreach = True
        mp_policy = MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32)
        offload_policy = CPUOffloadPolicy(pin_memory=not use_fake_mode)
        reshard_after_forward = True
        fsdp_config = {

        }
        fully_shard_fn = partial(
            fully_shard,
            mesh=mesh,
            reshard_after_forward=reshard_after_forward,
            offload_policy=offload_policy,
            mp_policy=mp_policy,
        )
        for module in model.modules():
            if isinstance(module, TransformerBlock):
                checkpoint(module, preserve_rng_state=not use_fake_mode)
                fully_shard_fn(module)
        fully_shard_fn(model)
        optim = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=foreach)

        torch.manual_seed(42)
        inp = torch.randint(0, vocab_size, (bsz, seq_len), device=dev)
        torch.cuda.reset_accumulated_memory_stats()
        torch.cuda.reset_peak_memory_stats()
        fmt = FSDPMemTracker(model, optim)
        fmt.track_inputs((inp,))
        with fmt:
            for iter_idx in range(2):
                loss = model(inp).sum()
                loss.backward()
                optim.step()
                optim.zero_grad()
                if iter_idx == 0:
                    fmt.reset_mod_stats()
    mem_stats = torch.cuda.memory_stats()
    tracker_peak = fmt.get_tracker_snapshot("peak")[dev]["Total"]
    cuda_peak_active = mem_stats["active_bytes.all.peak"]
    fmt.display_modulewise_snapshots(depth=4, units="MiB", tabulate=True)
    fmt.display_snapshot("peak", units="MiB", tabulate=True)
    print(
        f"peak active: {cuda_peak_active / (1024**3)} GiB | "
        f"Tracker Max: {tracker_peak / (1024 ** 3)} GiB"
    )
    if not use_fake_mode:
        print(f"Accuracy: {tracker_peak/cuda_peak_active}")

    try:
        torch.distributed.destroy_process_group()
    except Exception as e:
        print(e)
```

<img width="1236" alt="Screenshot 2024-06-21 at 5 16 49 PM" src="https://github.com/pytorch/pytorch/assets/12934972/9be40b8b-e635-4112-b111-418413e6b959">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125323
Approved by: https://github.com/awgu
2024-06-23 05:23:00 +00:00
17d1723aee [dynamo][unspecialized-nn-modules] Remove dead (also incorrect) code (#129316)
This code is unused because we just inline the `.parameters` call. The code was also wrong because side-effects only track the first level of mutations. An object might not marked mutated if one of the child objects (like a dict) is mutated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129316
Approved by: https://github.com/jansel
2024-06-23 03:02:27 +00:00
cac6f99d41 Fix Windows CUDA periodic inductor/test_pattern_matcher test (#129198)
The check was run on Windows and crashed there because Windows doesn't have triton, i.e. https://github.com/pytorch/pytorch/actions/runs/9606662121/job/26502347998#step:15:13196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129198
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/malfet
2024-06-23 02:32:27 +00:00
749c03406c [metal] Add int4mm weight packing mps kernel, and improved int4mm shader (#128965)
Adds _convert_weight_to_int4pack MPS kernel
Replaces previous int4mm Metal shader, with shader authored by @kimishpatel  which improves perf by ~40%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128965
Approved by: https://github.com/malfet
2024-06-23 02:10:46 +00:00
856541c701 [custom_op] support default dtype values (#129189)
This PR:
- moves some of the dtype-string utilities into ScalarType.{h, cpp}
- adds a new utility to get a mapping from dtype name to the C++ dtype
- the perser now checks if the string is a dtype name; if it is then it
  pulls the c++ dtype from the mapping.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129189
Approved by: https://github.com/albanD
ghstack dependencies: #129177, #129178, #129179
2024-06-23 00:13:23 +00:00
3e02ecd740 Test only one sample with huber_loss (#129245)
Fixes https://github.com/pytorch/pytorch/issues/129238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129245
Approved by: https://github.com/huydhn
2024-06-22 21:15:39 +00:00
94dc3253a0 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin, https://github.com/wconstab
2024-06-22 18:53:28 +00:00
e165a5971f [Traceable FSDP2] Fix support for CUDA resize_storage_bytes_ (#129215)
Currently if `x` is a CUDA tensor, calling `x.untyped_storage().resize_()` seems to always go into the `built without cuda` branch of `resize_storage_bytes_()` regardless of whether PyTorch is built with CUDA. I suspect this is because `inductor_ops.cpp` is only included in `libtorch_cpu.so` thus doesn't have the `USE_CUDA` information or ability to link to CUDA-related functions.

This PR moves `resize_storage_bytes_()` related custom op functions out of `inductor_ops.cpp` into its standalone file `resize_storage_bytes.cpp` to be included in `libtorch_python.so` instead. This mimics the setup for `StorageMethods.cpp`. This way, `resize_storage_bytes_()` can have access to the CUDA-related functions, which passes the CUDA unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129215
Approved by: https://github.com/jansel
2024-06-22 18:38:47 +00:00
0e6118a68e [dtensor][debug] added logging module tracing table to file feature (#128721)
**Summary**
Currently, only way for users to view the module tracing table is to print in the console which could be hard to read. I have added the functionality to comm_debug_mode for a user to log the module tracing table to output.txt file giving the user more options to view module tracing. I have implemented the use case in the module tracing examples. The expected output is shown below for MLPModule tracing:
<img width="349" alt="Screenshot 2024-06-14 at 10 39 07 AM" src="https://github.com/pytorch/pytorch/assets/50644008/a05288a9-3cdb-483b-8e27-daab50da6251">

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128721
Approved by: https://github.com/tianyu-l, https://github.com/XilunWu
ghstack dependencies: #128720
2024-06-22 18:14:13 +00:00
1afd492d88 [dtensor][example] add functionality allowing users to choose which example they'd to run (#128720)
**Summary**
The previous example file would run all examples at the same time, leading to confusing output as the 4 processors would mix up the order. In order to fix this, I have added the functionality to choose which example to run to make it easier for users to read the output. Due to importing from torch.testing._internal.distributed._tensor.common_dtensor, the argparser from a file in the dependency tree would overwrite the argparser that I attempted to place in the example file. As a result, I created an argparser in a different file and imported it above previously mentioned import.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -h

The first four outputs will be the same as the outputs seen in previous PRs. The expected output for help argument is seen below:
<img width="931" alt="Screenshot 2024-06-14 at 10 25 06 AM" src="https://github.com/pytorch/pytorch/assets/50644008/547ca112-1e7a-4769-857a-558292c6fe7b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128720
Approved by: https://github.com/XilunWu
2024-06-22 18:14:13 +00:00
10c64c3b49 [halide-backend] Generate standalone runtime (#129025)
This puts the halide runtime in a global shared object, rather than copying it to each kernel.  Having many copies of the runtime causes many issues with cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417
2024-06-22 17:39:52 +00:00
4f9399bd0d [halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417
Approved by: https://github.com/shunting314, https://github.com/eellison
2024-06-22 17:39:52 +00:00
79aabaf626 [3.13, dynamo] codegen PUSH_NULL when callable is codegen'd (#129172)
Significant bytecode generation API change!

The new suggested convention to generating bytecode to call a function is now to wrap instructions that push a callable to the stack with `add_push_null`, then that callable is called with `create_call_function` with `push_null=False` (see diff for examples).

In Python 3.13, NULL is now expected to be pushed after the callable. In <=3.12, the NULL was pushed before the callable.  This change abstracts away the exact placement of the NULL, but the developer must be aware that a NULL may be needed when codegen'ing a callable.

This abstraction also reduces the need for the `push_null=True` option in `create_call_function`, which removes the need to rotate a NULL to the right place on the stack with a sequence of `SWAP` instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129172
Approved by: https://github.com/jansel
2024-06-22 17:25:23 +00:00
905dfa186c Fix ConstraintViolationError exception string when exprs are int (#129271)
As titled. If `expr1` `expr2` are int, don't need to do `.xreplace`.

See example error:

```
UserError: L['args'][0][0].size()[1] = 35 is not equal to L['args'][0][2].size()[1] = 23
```

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129271
Approved by: https://github.com/lezcano
2024-06-22 16:33:40 +00:00
920ebccca2 [inductor][cpp] refactor CppTemplateKernel to inherit CppKernel (#129101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129101
Approved by: https://github.com/leslie-fang-intel
2024-06-22 12:50:37 +00:00
72e3aca227 [inductor] Refactor fusion of inplace operations (#128979)
`WeakDep`s force readers to have completed before a mutation overwrites the
buffer, but we want to allow fusions to occur for inplace mutations where the
same index is read and written.

Currently this is achieved by:
1. Identifying the buffers used by the mutating op in its `dep_closure`
2. Not creating `WeakDep`s for buffers in the `dep_closure`
3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical`

So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup.

This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to
`can_fuse_vertical` which selectively allows inplace operation to fuse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128979
Approved by: https://github.com/lezcano
ghstack dependencies: #129082, #129083
2024-06-22 12:38:22 +00:00
88a35b5b64 BE: User future annotations in _inductor/comms.py (#129083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129083
Approved by: https://github.com/lezcano
ghstack dependencies: #129082
2024-06-22 12:38:22 +00:00
73ba226d98 [inductor] Linear time dead node elimination (#129082)
The nodes are already topologically sorted by this point, so DCEing a chain of
nodes will take one full iteration per node. Simply reversing the iteration
order means all users will be removed before checking a node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129082
Approved by: https://github.com/lezcano
2024-06-22 12:38:17 +00:00
cb126711cd [merge_rule] add more cpp inductor files (#129192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129192
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman
2024-06-22 09:04:14 +00:00
b57fa8d9c0 [BE] Remove JNI from libtorch builds (#124995)
Removes jni files from the libtorch build as we do not plan to distribute them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124995
Approved by: https://github.com/malfet
2024-06-22 07:41:54 +00:00
9ffdbb5d12 Forward Fix PR for #128683 (#129037)
Summary:
This forward fixes this diff:
D58699985

Since we have a few things in flight it would be much better to forward fix this test

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda -- --exact 'caffe2/test/inductor:test_inductor_cuda - test_red_followed_by_transposed_pointwise (caffe2.test.inductor.test_torchinductor.TritonCodeGenTests)'

Differential Revision: D58767577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129037
Approved by: https://github.com/vkuzo
2024-06-22 05:50:21 +00:00
64743de6d8 [Split Build][BE] consolidate pip install commands (#129253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129253
Approved by: https://github.com/atalman
ghstack dependencies: #129011
2024-06-22 05:49:14 +00:00
7661d1220a [Split Build] Fix typo in pull ci (#129270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129270
Approved by: https://github.com/atalman
2024-06-22 05:48:01 +00:00
b0044e2e18 [Split Build] Support nightly release (#129011)
This PR adds the split build to our binaries workflow. Validation for the workflow is done using the PR above in conjunction with https://github.com/pytorch/builder/pull/1876.

Test Workflow: Check CI in the workflow above
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129011
Approved by: https://github.com/atalman
2024-06-22 05:45:14 +00:00
b72ef9df0d Update torchbench model expected accuracy values after pinning numpy (#129213)
After pinning numpy on torchbench, we need to move torchbench inductor benchmark jobs out of unstable state asap, so that more failures don't sneak it.  I'm updating the expected values here to make trunk green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129213
Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/desertfire
2024-06-22 04:59:50 +00:00
f42d5b6dca [Memory Snapshot] Make recordAnnotations callback initialize lazily (#129242)
Summary: Make the recordAnnotations' Record function callback lazily initialize when record memory history starts. This will help reduce the impact on Time To First Batch metric.

Test Plan: CI and ran locally.

Differential Revision: D58875576

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129242
Approved by: https://github.com/zdevito
2024-06-22 04:05:55 +00:00
858fb05dac Modify ExternKernelAlloc with NoneLayout to not assign its result to anything (#129188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129188
Approved by: https://github.com/yifuwang
2024-06-22 02:57:44 +00:00
2f8b301c32 Clean up distributed/CONTRIBUTING.md (#128450)
Click [here](cf6c88af48/torch/distributed/CONTRIBUTING.md) to see the rendered version of the file in this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128450
Approved by: https://github.com/wanchaol
2024-06-22 02:41:22 +00:00
5b14943213 Run TestAOTAutograd test suite with cache (#128222)
This diff introduces AOTAutogradTestWithCache, which runs AOTAutogradTests with both dynamo and AOTAutogradCache.

To do this, for any verify_aot_autograd() calls in the original tests, we run compiled_f an extra time. We also turn on a new strict mode that throws any time a cache is missed due to weird reasons, like BypassAOTAutogradCache or FxGraphCacheMiss.

We use a mocked version of FXGraphCache to decrease the number of variables for these tests. The normal tests in test_aot_autograd_cache.py will still run with FXGraphCache. I might change my mind and unmock these in the future.

In total, 87 of the tests pass naturally. None of the tests fail in non strict cache mode, so the cache never crashes, it just misses more often than we'd like. The remaining 27 tests fail due to relatively simple (though not necessarily easy to fix) reasons. I'll fix the remaining test failures in the next few PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128222
Approved by: https://github.com/bdhirsh
2024-06-22 02:13:28 +00:00
c5b9ee7408 [easy][dynamo] Remove try except from call_getattr (#129217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129217
Approved by: https://github.com/lezcano
ghstack dependencies: #129098, #129015
2024-06-21 23:56:00 +00:00
1c75ddff35 Revert "[cuDNN] Graph-capturable cuDNN CTCLoss (#128271)"
This reverts commit 40e8675fcbb233c98ec532607d5cd421ec850253.

Reverted https://github.com/pytorch/pytorch/pull/128271 on behalf of https://github.com/malfet due to This makes PyTorch buildable only with CuDNN v9 ([comment](https://github.com/pytorch/pytorch/pull/128271#issuecomment-2183576996))
2024-06-21 23:29:20 +00:00
ef55446538 [FSDP2] Add 'TORCH_LOGS=+fsdp' to log hooks(pre/post forward/backward) and FQN (_init_fqns) (#128663)
Summary:
Add  '`TORCH_LOGS=+fsdp`' in the CLI to print fsdp logs
Example:
`TORCH_LOGS=+fsdp torchrun --standalone --nproc_per_node=2 run_fsdp.py`
Description:
Add logging to `FSDPParamGroup.pre_forward`, `FSDPParamGroup.post_forward`, `FSDPParamGroup.pre_backward`, and `FSDPParamGroup.post_backward`, `FSDPState._root_pre_forward` if is the root, and `FSDPState._root_post_backward_final_callback`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128663
Approved by: https://github.com/weifengpy, https://github.com/awgu
2024-06-21 23:25:58 +00:00
9d1b65b569 [PT2][Observability] Change the log logic (#129201)
Summary: We only log the multiplier when users changes the default value.

Test Plan: see signal

Differential Revision: D58854330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129201
Approved by: https://github.com/Skylion007, https://github.com/dshi7
2024-06-21 21:48:34 +00:00
40e8675fcb [cuDNN] Graph-capturable cuDNN CTCLoss (#128271)
cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant

~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~
Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271
Approved by: https://github.com/ezyang
2024-06-21 21:40:23 +00:00
9103b40a47 Fix small typo in docstring in ParameterList (#129193)
In the docstring of `nn.ParameterList`, ParameterDict.append/extend was being used, which is most likely a typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129193
Approved by: https://github.com/mikaylagawarecki
2024-06-21 20:53:52 +00:00
92ca17d85d Update triton pin (#126098)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126098
Approved by: https://github.com/bertmaher
2024-06-21 18:46:15 +00:00
d52684e9a8 [BE]: Update CUDNN_frontend submodule to v1.5.1 (#128612)
Updates submodule to cudnn_frontend v1.5.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128612
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-06-21 18:17:35 +00:00
ebf25e128c [autograd] Do not stash version counter for saved tensor (#128545)
Fixes https://github.com/pytorch/pytorch/issues/128611

We detach using tensor_data, which already preserves the version counter, so there is no reason to save it prior to unpacking:
```
at::TensorBase VariableHooks::tensor_data(const at::TensorBase& self) const {
  TORCH_CHECK(self.defined(), "cannot call tensor_data() on undefined tensor");
  auto self_impl_copy = self.unsafeGetTensorImpl()->shallow_copy_and_detach(
      /*version_counter=*/self.unsafeGetTensorImpl()->version_counter(),
      /*allow_tensor_metadata_change=*/
      self.unsafeGetTensorImpl()->allow_tensor_metadata_change());
  return at::Tensor(self_impl_copy);
}
```
This changes the behavior when hooks are involved:
- Previously, if you had a hook that replaced the saved tensor with an entirely new tensor, we would've smashed the saved version counter onto that during unpack, which is not quite correct because the tensor returned by user's pack hook is not necessarily aliased to the tensor originally being saved (unlikely), and even if it were, the version counter would already be shared, if the user did their operations not in inference mode (unlikely).
- In this PR, we restore the version counter using the version counter from the unpack hook's output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128545
Approved by: https://github.com/albanD
ghstack dependencies: #125795
2024-06-21 18:03:06 +00:00
58cefaf53b Fix hipify regular expression for AOTI wrapper (#128912)
Summary: We need to redefine RE_PYTORCH_PREPROCESSOR here since in hipify_torch, it will apply positive lookbehind (?<=\W) and lookahead (?=\W) to the pattern to avoid matching keyword at the beginning and end of code line. However, this can  happen in codegen, which will cause the pattern to not match.

Test Plan:
```
buck2 run //caffe2/test/inductor:test_cpp_wrapper_hipify
```

```
File changed: fbcode//caffe2/test/inductor/test_cpp_wrapper_hipify.py
Buck UI: https://www.internalfb.com/buck2/395155fa-b2dc-4892-8c71-74e52c65fa2f
Note:    Using experimental modern dice
Network: Up: 0B  Down: 0B  (reSessionID-8fcfc520-755c-48f9-bacc-507c62f59231)
Jobs completed: 10947. Time elapsed: 0.5s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
BUILD SUCCEEDED
/data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:282: NCCL_DEBUG env var is set to None
/data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:300: NCCL_DEBUG is forced to WARN from None
test_hipify_aoti_driver_header (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok
test_hipify_basic_declaration (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok
test_hipify_cross_platform (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.262s

OK
```

e2e test:

```
TORCH_LOGS="output_code,graph_code" buck2 run mode/{opt,amd-gpu,inplace} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //aiplatform/modelstore/model_generation/gpu_lowering_service:gpu_lowering_cli -- --model_input_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/input.merge" --model_output_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/mi300_inductor_output.merge" --lowering_backend AOT_INDUCTOR --is_ads_model False --aot_inductor_lowering_settings_json='{"use_scripting":true,"preset_lowerer":"standalone_hstu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":4,"output_precision":4, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}' 2>&1 | tee local_benchmark_log.txt
```

Differential Revision: D58705216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128912
Approved by: https://github.com/desertfire
2024-06-21 18:00:40 +00:00
2db33054b3 Disable fast path in TransformerEncoderLayer when there are forward (pre-)hooks attached to modules (#128415)
Fixes #128413

Disable fast-path if there are forward hooks or pre-hooks.

Example failure case given in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128415
Approved by: https://github.com/mikaylagawarecki
2024-06-21 17:38:08 +00:00
8edd4c71c6 [AOTI][refactor] Remove GridExprCppPrinter (#129142)
Summary: Previously we thought using CppPrinter is not ABI-compatibility safe, but c10/util/generic_math.h has been changed to header-only implementation, so we can remove GridExprCppPrinter now.

Differential Revision: [D58854214](https://our.internmc.facebook.com/intern/diff/D58854214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129142
Approved by: https://github.com/chenyang78
2024-06-21 17:18:37 +00:00
bdc39eef3b [inductor] Add --inductor-config benchmark flag (#129034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129034
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #129024, #129033
2024-06-21 16:53:42 +00:00
bb4ab59651 [inductor] Run more test on correct device (#129033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129033
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #129024
2024-06-21 16:53:42 +00:00
feb3f3ad77 [inductor] Refactors for Halide backend (#129024)
Pulling these inductor-related refactors out of the larger Halide
backend PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129024
Approved by: https://github.com/shunting314, https://github.com/eellison
2024-06-21 16:53:35 +00:00
237c4e6163 Improved flexattention bwd perf + added configurations for benchmarks (#129013)
Before:
<img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb">

After:
<img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534">

After fixing strides:
![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #128938
2024-06-21 15:58:53 +00:00
bdd11483ea [3.13] get C dynamo to compile with python callback and custom frame eval (#129171)
Start enabling parts of C Dynamo for 3.13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129171
Approved by: https://github.com/jansel, https://github.com/albanD
2024-06-21 15:58:02 +00:00
b0ae0db815 [Inductor][Intel GPU] Support reduction split. (#129120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129120
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
ghstack dependencies: #129124
2024-06-21 15:11:59 +00:00
fb0c51b61c [Inductor UT] Fix UT failure 'test_polar_dynamic_shapes_xpu' introduced by #128722 (#129124)
[Inductor UT] Fix UT failure 'test_polar_dynamic_shapes_xpu' introduced by #128722. Currently, XPU CI does not gate PR merge. So, we have to do some post-CI fixing as some PRs may break XPU CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129124
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2024-06-21 15:08:17 +00:00
715b09ae2d Revert "Fix DEBUG=1 asserts with NJT ops (#129014)"
This reverts commit 2bb8ee602b264b652a9dbd6877da61018054d313.

Reverted https://github.com/pytorch/pytorch/pull/129014 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129014#issuecomment-2182922009))
2024-06-21 15:03:02 +00:00
cyy
479ce5e2f4 Remove outdated CUDA code from CMake (#128801)
It's possible to simplify some CUDA handling logic in CMake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128801
Approved by: https://github.com/r-barnes, https://github.com/malfet
2024-06-21 15:00:00 +00:00
cyy
2c7c286fa4 [1/N] Fix clang-tidy warnings in torch/csrc/jit/serialization (#129055)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129055
Approved by: https://github.com/r-barnes
2024-06-21 14:56:31 +00:00
53be7ff0e4 Make tl.atomic_add relaxed (#129133)
We don't use any fancy synchronization within out atomic ops, we just
want them to be atomic, so better to have them be relaxed than the
default aquire/release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129133
Approved by: https://github.com/peterbell10
2024-06-21 14:49:58 +00:00
62e5d045c0 [AOTI] Auto-tune Triton kernels in a seperate block (#129057)
Summary: Currently AOTI does a two-pass compilation for the CUDA backend. In the first pass AOTI generates Python code, runs the generated code once with real example inputs to trigger Triton kernel compilation and tuning, and then AOTI runs the second pass to generate cpp code and compiles that into a shared library.

There are several problems with this approach when we want to enable the cpp wrapper mode for JIT Inductor:
* Compilation time: JIT compilation is more sensitive to compilation time than AOT compilation. The two-pass approach does add extra overhead for compilation.
* Peak memory size: when executing the first-pass generated code with real inputs, some inputs need to be cloned to avoid side effect coming from input mutation. This can raise the high-water mark for memory consumption.
* Missing triton kernel autotuning: Because kernel autotune depends on the kernel being executed in the two-pass approach, some kernels will not be autotuned when a model contains control flow such as torch.if or torch.while.

This PR is the first step towards solving these problems by moving Triton kernel autotuning to the compile time and use random inputs for tuning. The cpp wrapper codegen still has two passes, but in the first pass, Inductor will generate a separate code just for kernel autotuning, with https://gist.github.com/desertfire/606dc772b3e989b5e2edc66d76593070 as an example, and we no longer need to execute the model after the first-pass finishes. After that we rerun a second pass to generate cpp code. This reduces peak memory consumption and enables kernel autotuning when there is control flow. Truly making the codegen into one-pass will come later once this solution is proven stable and generates as performant kernels as before.

Differential Revision: [D58782766](https://our.internmc.facebook.com/intern/diff/D58782766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129057
Approved by: https://github.com/jansel, https://github.com/eellison
2024-06-21 14:34:13 +00:00
9795dba1e0 Optim package docstring fix (#129086)
Fix docstrings in various files in optim package. This is a last remaining fix for the issue #112593

The fix can be verified by running pydocstyle path-to-file --count

Fixes #112593

Related #128248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129086
Approved by: https://github.com/janeyx99
2024-06-21 14:30:53 +00:00
b697808056 [BE][Easy] eliminate relative import in torchgen (#128872)
Fix generated by:

```bash
ruff check --config 'lint.flake8-tidy-imports.ban-relative-imports="all"' --fix --select=TID $(fd '.pyi?$' torchgen)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128872
Approved by: https://github.com/zou3519
2024-06-21 14:11:46 +00:00
e1c1052829 Backward support for unbind() with NJT (#128032)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032
Approved by: https://github.com/soulitzer
2024-06-21 14:05:23 +00:00
27ae1f981d [inductor] fix linear_add_bias for autocast case (#129138)
Previously `linear_add_bias` only support the added tensor is `bfloat16`.

```
        class M(torch.nn.Module):
            def __init__(self, dtype):
                super().__init__()
                self.linear1 = torch.nn.Linear(10, 64, bias=False)
                self.bias1 = torch.randn(64).bfloat16()  # if the bias is not bf16, we will crash

            def forward(self, x):
                return self.linear1(x) + self.bias1
```
For `Autocast(bf16)` cases, `self.bias1` will not be converted to bf16. And we also not checked the dtype for weight and bias in the pattern matcher, this will lead to error if weight is bfl6 while bias is fp32.

We have 2 options to resolve this:
 - Check bias/weight dtype, only fold the bias when they are same dtype
 - We will fold them even they are not same dtype. By inserting to_dtypes for `bias node` to enforce it have same dtype with weight.

This PR chose option1, since we can't implicitly cast bias to bf16 here which would lose precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129138
Approved by: https://github.com/jgong5
2024-06-21 14:04:30 +00:00
5d8e23b49c [custom_op] Support string default values in schema (#129179)
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129179
Approved by: https://github.com/albanD
ghstack dependencies: #129177, #129178
2024-06-21 13:31:40 +00:00
08b616281f [custom ops] Switch out references from old landing page to new landing page (#129178)
Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129178
Approved by: https://github.com/albanD
ghstack dependencies: #129177
2024-06-21 13:31:40 +00:00
311fadb1fb [docs] Redirect custom ops landing page to the correct place (#129177)
I'm moving it to pytorch/tutorials
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129177
Approved by: https://github.com/albanD
2024-06-21 13:31:32 +00:00
217aac96d7 Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-21 08:49:11 +00:00
f0443ad174 [compiled autograd] flatten runtime inputs with fast path (#129116)
covered by test_compiled_autograd.py and test_standalone_compile.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129116
Approved by: https://github.com/jansel
ghstack dependencies: #127960, #128905, #128982, #128987, #129181
2024-06-21 08:16:33 +00:00
d97dfe9313 [compiled autograd] move inputs to cuda with non_blocking=True (#129181)
non_blocking=True requires first pinning, which shouldn't be a problem given that they are cpu scalars

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129181
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #127960, #128905, #128982, #128987
2024-06-21 08:16:33 +00:00
8f320fd6c6 [compiled autograd] treat input params as static (#128987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128987
Approved by: https://github.com/eellison, https://github.com/BoyuanFeng
ghstack dependencies: #127960, #128905, #128982
2024-06-21 08:16:33 +00:00
fafa1867d1 [compiled autograd] use in_compiled_autograd_region instead of compiled_autograd_enabled_count (#128982)
current implementation of compiled_autograd_enabled_count affects the entire region under the context manager. so if the context manager wraps torch.compile calls unrelated to the backward, they are affected too:
- no lazy compile for compiled fw
- no aot autograd cache for inference graphs

we instead maintain a flag when we execute the compiled backward callable, to isolate the special handling to the compiled backward graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128982
Approved by: https://github.com/jansel
ghstack dependencies: #127960, #128905
2024-06-21 08:16:33 +00:00
68b33453f4 [aot autograd] collect static parameter metadata when graphs fallback to inference (#128905)
https://github.com/pytorch/pytorch/pull/126820 but for graphs that have requires_grad inputs but no requires_grad outputs i.e. inference graph

the implementation of inference graph fallback was throwing away the static parameter information during metadata recomputation

also adding a cudagraphs counter to test this easier

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128905
Approved by: https://github.com/mlazos
ghstack dependencies: #127960
2024-06-21 08:16:33 +00:00
123812790b [compiled autograd] update benchmarks to use cli flags for fullgraph/dynamic (#127960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127960
Approved by: https://github.com/jansel
2024-06-21 08:16:33 +00:00
aee512cc9d [dtensor][op] Fixed stack op strategy (#129018)
**Summary**
The previous stack op strategy was causing the input to be resharded, resulting in list index out of range error. I delayed the resharding for after the input_specs were created so that the new dimension could be inserted, preventing the error above. I have also ran all the other test cases to ensure changes did not introduce any new bugs

**Test Plan**
pytest test/distributed/_tensor/test_tensor_ops.py -s -k test_stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129018
Approved by: https://github.com/XilunWu
2024-06-21 08:10:28 +00:00
6b5fbc544e [dynamo] Use polyfill to trace through the attributes of torch.jit.* and lru_cache_wrapper (#128336)
Earlier we were taking the vt for `obj` and then monkeypatching that `vt.source` to be `obj._torchdynamo_inline`. If one accesses `obj.attr_a`, this would cause problems because Dynamo would then search it in `obj._torchdynamo_inline.attr_a`. This PR makes it more functional, so that we have different vts for obj and `ob._torchdynamo_inline`.

Fixes https://github.com/pytorch/pytorch/issues/93698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128336
Approved by: https://github.com/jansel, https://github.com/yanboliang
ghstack dependencies: #129117
2024-06-21 07:44:44 +00:00
914d3ca2ba [inductor][cpp] BF16 AMX micro-gemm support (#127195)
This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to `codecache.py` to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via `codecache.py`.

Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C:
Static shapes
Single-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
| timm_models | mixer_b16_224 | 1.54 |
| timm_models | convit_base | 1.53 |
| huggingface | MobileBertForQuestionAnswering | 1.52 |
| torchbench | fastNLP_Bert | 1.44 |
| torchbench | llama | 1.33 |
| timm_models | swin_base_patch4_window7_224 | 1.31 |
| torchbench | dlrm | 1.28 |
| torchbench | timm_vision_transformer_large | 1.28 |
| huggingface | MobileBertForMaskedLM | 1.27 |
| timm_models | vit_base_patch16_224 | 1.26 |
| timm_models | beit_base_patch16_224 | 1.23 |
| timm_models | jx_nest_base | 1.21 |
| torchbench | pyhpc_equation_of_state | 1.18 |
| huggingface | Speech2Text2ForCausalLM | 1.15 |
| timm_models | pit_b_224 | 1.14 |
| timm_models | twins_pcpvt_base | 1.14 |
| torchbench | maml_omniglot | 1.1 |
| timm_models | botnet26t_256 | 1.1 |

Multi-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
| torchbench | BERT_pytorch | 1.35 |
| torchbench | lennard_jones | 2.43 |
| torchbench | hf_Albert | 1.35 |
| torchbench | hf_T5 | 1.34 |
| torchbench | soft_actor_critic | 1.34 |
| torchbench | fastNLP_Bert | 1.28 |
| huggingface | LayoutLMForSequenceClassification | 1.26 |
| torchbench | llama | 1.24 |
| huggingface | GPT2ForSequenceClassification | 1.19 |
| torchbench | hf_Bart | 1.17 |
| torchbench | hf_Bert_large | 1.16 |
| torchbench | hf_GPT2 | 1.16 |
| timm_models | gmixer_24_224 | 1.16 |
| torchbench | hf_GPT2_large | 1.15 |
| torchbench | maml_omniglot | 1.14 |
| torchbench | hf_Bert | 1.13 |
| torchbench | hf_DistilBert | 1.13 |
| torchbench | hf_T5_large | 1.12 |
| huggingface | MT5ForConditionalGeneration | 1.11 |

Dynamic shapes
Single-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|-------|
| timm_models | mixer_b16_224 | 1.52 |
| timm_models | convit_base | 1.5 |
| huggingface | MobileBertForQuestionAnswering | 1.49 |
| torchbench | fastNLP_Bert | 1.42 |
| torchbench | timm_vision_transformer_large | 1.28 |
| timm_models | swin_base_patch4_window7_224 | 1.27 |
| torchbench | llama | 1.26 |
| huggingface | MobileBertForMaskedLM | 1.25 |
| timm_models | vit_base_patch16_224 | 1.25 |
| timm_models | beit_base_patch16_224 | 1.24 |
| timm_models | jx_nest_base | 1.2 |
| torchbench | dlrm | 1.19 |
| timm_models | pit_b_224 | 1.13 |
| timm_models | twins_pcpvt_base | 1.13 |
| torchbench | hf_Bert_large | 1.12 |
| torchbench | hf_BigBird | 1.11 |
| huggingface | Speech2Text2ForCausalLM | 1.11 |
| timm_models | eca_botnext26ts_256 | 1.11 |
| timm_models | botnet26t_256 | 1.1 |

Multi-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|-------|
| torchbench | BERT_pytorch | 1.18 |
| torchbench | lennard_jones | 2.18 |
| torchbench | hf_Albert | 1.37 |
| torchbench | soft_actor_critic | 1.31 |
| huggingface | GPT2ForSequenceClassification | 1.29 |
| torchbench | hf_T5 | 1.28 |
| torchbench | fastNLP_Bert | 1.27 |
| torchbench | hf_Bart | 1.21 |
| torchbench | hf_Bert_large | 1.19 |
| torchbench | hf_T5_large | 1.19 |
| torchbench | hf_Bert | 1.16 |
| torchbench | hf_GPT2 | 1.16 |
| huggingface | CamemBert | 1.16 |
| torchbench | hf_GPT2_large | 1.13 |
| torchbench | functorch_maml_omniglot | 1.12 |
| huggingface | BertForMaskedLM | 1.12 |
| huggingface | MT5ForConditionalGeneration | 1.12 |
| torchbench | hf_DistilBert | 1.11 |
| timm_models | mixnet_l | 1.11 |
| timm_models | tf_mixnet_l | 1.11 |

No perf regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127195
Approved by: https://github.com/jansel
2024-06-21 07:21:47 +00:00
632910e2a8 Add test to xfail_list only for abi_compatible (#128506)
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.

We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.

- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-21 07:19:28 +00:00
62e425ab03 Memory Tracker for tracking Module wise memory (#124688)
We present a utility MemTracker, that tracks the module-wise memory for the code executed under its context. The core features that this tool aims to provide are:

1. Capturing 'snapshots' of memory for each module during its execution. Specifically, at 8 points, during pre-forward, post-forward, pre-backward, 2nd pre-forward (if AC is applied), 2nd post-forward (if AC is applied), post-backward. Also capturing peak memory snapshot during forward and backward.
2. Each such snapshot provides the per device (cpu, cuda etc) memory breakdown in terms of the global parameters, gradients, activations, optimizer states and temporary memory.
3. A summary for each module (that can be analyzed or processed later), in terms of the memory occupied by its own parameters, buffers, inputs and outputs. The remaining components can be derived from these per module attributes and its corresponding captured snapshots.
4. Record the global peak memory consumption per device and their respective breakdowns.
5. Ability to do all of this under the FakeTensorMode so that all these statistics can be obtained without executing code on real data.
6. Ability to register and track modules, optimizers and any other tensors that are created outside the context of MemTracker.
7. Ability to capture a custom memory snapshot at any point during program execution execution.
8. Utility functions to display all of these statistics in user-friendly and human readable manner.

These features will enable users to anticipate OOMs, debug and pinpoint where majority of memory comes from, experiment with different activation checkpointing policies, batch sizes, mixed precision, model architecture features (ex. number of layers, hidden dimensions, number of attention heads etc.) and inter-device memory movement (ex. CPU off-loading) among others. Basically anything and everything related to device memory.

* __->__ #128508

Example:

>     import torch
>     import torchvision.models as models
>     from torch.distributed._tools.mem_tracker import MemTracker
>     device, dtype = "cuda", torch.float32
>     with torch.device(device):
>         model = models.resnet18().to(dtype=dtype)
>     optim = torch.optim.Adam(model.parameters(), foreach=True)
>     mem_tracker = MemTracker()
>     mem_tracker.track_external(model, optim)
>     with mem_tracker as mt:
>         for i in range(2):
>             input_batch = torch.randn(256, 3, 224, 224, device=device, dtype=dtype)
>             model(input_batch).sum().backward()
>             optim.step()
>             optim.zero_grad()
>             if i == 0:
>                 # to account for lazy init of optimizer state
>                 mt.reset_mod_stats()
>     mt.display_snapshot("peak", units="MiB", tabulate=True)
>     mt.display_modulewise_snapshots(depth=2, units="MiB", tabulate=True)
>     # Check for accuracy of peak memory
>     tracker_max = mt.get_tracker_snapshot('peak')[device]['Total']
>     cuda_max = torch.cuda.max_memory_allocated()
>     accuracy = tracker_max / cuda_max
>     print(f"Tracker Max: {tracker_max}, CUDA Max: {cuda_max}, Accuracy: {accuracy}")

Output

<img width="1197" alt="Screenshot 2024-06-15 at 12 10 12 AM" src="https://github.com/pytorch/pytorch/assets/12934972/83e953db-43dc-4094-90eb-9f1d2ca8e758">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124688
Approved by: https://github.com/awgu
2024-06-21 07:15:32 +00:00
2b1b055a96 [Split Build] Fix libtorch_python RPATH (#129088)
In the split build we end up with an incorrect RPATH for `libtorch_python.so`. This PR fixes said RPATH.

What the rpath should look like:
```
sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/main_so_files/libtorch_python.so | grep "RPATH"                        (pytorch-3.10)
  RPATH                /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib:
```

Before

```
sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/split_so_files/libtorch_python.so | grep "RPATH"                       (pytorch-3.10)
  RPATH                /home/sahanp/pytorch/torch/lib:/home/sahanp/pytorch/build/lib:
```

After
```
sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p build/lib/libtorch_python.so | grep "RPATH"                              (pytorch-3.10)
  RPATH                /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/pytorch/torch/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib:
```

Testing that this works is in the above PR. Similarly, after running ciflow/binaries the output of objdump -p should not change https://www.diffchecker.com/14PRmCNz/ (checked manywheel py 3.10 cuda 12.1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129088
Approved by: https://github.com/malfet
2024-06-21 06:49:19 +00:00
c008488b9c [dynamo][guards] Dont run TYPE_MATCH for DICT_LENGTH C++ guard (#129163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129163
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-06-21 06:27:19 +00:00
cyy
5c676bb8b3 Remove Caffe2 handling from onnx_unpack_quantized_weights (#129021)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129021
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-06-21 06:16:44 +00:00
3a2fdbb142 [dynamo] - Add JK killswitch for dynamo compilation. (#128538)
This allows easy disablement of dynamo in emergency situations where env variables are hard to set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128538
Approved by: https://github.com/jansel
2024-06-21 06:14:06 +00:00
f73b451e78 Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013)"
This reverts commit ff89ebc50a738c734496393dc25313cf197fd0b4.

Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/huydhn due to Sorry for reverting your change but one of the test_torchinductor_opinfo test starts to fail after this commit ff89ebc50a, I am reverting to see if it helps trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2182042422))
2024-06-21 05:46:46 +00:00
b542825066 Enable deterministic support for oneDNN (#127277)
This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848.
For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui
2024-06-21 05:21:24 +00:00
e8dbb45e98 [dynamo][user-defined-object] Check that object is valid (#129117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129117
Approved by: https://github.com/yf225
2024-06-21 04:18:54 +00:00
cyy
e99a24ce7c Remove TensorImpl_test.cpp (#129054)
It's not used because of removal of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129054
Approved by: https://github.com/albanD, https://github.com/malfet
2024-06-21 04:17:36 +00:00
880e894c39 [Brian's PR #128981] fix dynamo isinstance inlining for nn.Parameter + subclasses (#129162)
This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128981, with very small changes to work around numpy related errors.

For discussions, please see Brian's original PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129162
Approved by: https://github.com/bdhirsh
2024-06-21 03:48:10 +00:00
8cd9b10456 Fix exp decomp numerics (#129154)
Our previous implementation would sometimes generate `inf` because we did not do the same numerics tricks as in eager:

See comment / [link](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/TransformationHelper.h#L123-L144) :
```
    # curand_uniform has (0,1] bounds. log(1) is 0 and exponential excludes 0.
    # we need log to be not 0, and not underflow when converted to half
    # fast __logf approximation can underflow, so set log to -epsilon/2 for 1 or close to 1 args
```

Fix for https://github.com/pytorch/pytorch/issues/127749.

Added a test for non-inf, but it would be great to have more robust decomp distribution tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129154
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2024-06-21 03:21:30 +00:00
ff89ebc50a Improved flexattention bwd perf + added configurations for benchmarks (#129013)
Before:
<img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb">

After:
<img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534">

After fixing strides:
![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #128938
2024-06-21 03:01:16 +00:00
0acd09aecd [torchrec][pt-d][model store] introduce LocalShardsWrapper for DTensor (#129150)
Summary:
Same as D57688538, recreated because of GH issues

This diff introduces LocalShardsWrapper which is crucial to migrating from using ShardedTensor to DTensor in TRec state dict representation. As well as any changes needed in PT-D and ModelStore to support this.

It allows us to extend DTensor to support multiple shards on a rank as well as empty shards on a rank as needed by TRec sharding logic.

This diff also extends the support for LocalShardsWrapper to be used in conjunction with DTensor in checkpointing cases (ModelStore and DCP)

See D54375878 for how it is used.

**LocalShardsWrapper supports the following torch ops:**
+ torch.ops._c10d_functional.all_gather_into_tensor.default
+ aten._to_copy.default
+ aten.view.default
+ aten.equal.default
+ aten.detach.default

With extensibility to add more as required by use cases.

See https://docs.google.com/document/d/16Ptl50mGFJW2cljdF2HQ6FwsiA0scwbAbjx_4dhabJw/edit?usp=drivesdk for more info regarding design and approach.

NOTE: This version of LocalShardsWrapper does not support empty shards, that is added in the next diff enabling CW. D57063512

Test Plan:
` buck test mode/opt -c python.package_style=inplace aiplatform/modelstore/client/tests_gpu:dist_checkpoint_save_load_with_stateful_tests -- --print-passing-details`

`buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_tensor_configs -- --print-passing-details`

Sandcastle

Reviewed By: XilunWu, wanchaol

Differential Revision: D58570479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129150
Approved by: https://github.com/XilunWu
2024-06-21 01:58:51 +00:00
31c9e3d2f4 [FSDP][Test] Test save model save with FSDP1 and load into FSDP2 applied model (#129028)
A lot of models have already been saving the model state in FULL_STATE_DICT mode with FSDP1 in APF. This unit test is just to demonstrate FSDP1 -> FSDP2 transition. The use of deprecating APIs in this test is intentional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129028
Approved by: https://github.com/awgu, https://github.com/fegin
2024-06-21 01:40:58 +00:00
8758fedbfc [export] copy sym ops when respecting call module signature (#129153)
Summary:
Export, through AOTAutograd, [deduplicates](11ff5345d2/torch/fx/experimental/proxy_tensor.py (L198)) sym_size calls, which can cause issues during unflattening when the sym_size node is used in multiple submodules.

If preserve_call_module_signature is set, these nodes can't be passed between submodules as placeholders, so the calls (and any downstream un-duplicated nodes) must be copied. Adding this to unflattener

Test Plan: export unflatten test case

Reviewed By: TroyGarden, angelayi

Differential Revision: D58697231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129153
Approved by: https://github.com/angelayi
2024-06-21 01:40:22 +00:00
5da428d9eb [cpu][flash attention] fix attention mask issue (#128816)
For attention mask in flash attention:

- Fix the issue of accessing illegal memory when the last size of mask is 1.
- Add UT of attention mask for various shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128816
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-06-21 01:12:48 +00:00
d4022b4658 Revert "[BE] enable UFMT for torch/nn/modules (#128594)"
This reverts commit 95ac2d648279ebc73feccf6d8eccafa4b2759de8.

Reverted https://github.com/pytorch/pytorch/pull/128594 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128594#issuecomment-2181788935))
2024-06-21 00:50:08 +00:00
cc8193c707 Revert "[BE] enable UFMT for torch/nn/functional.py (#128592)"
This reverts commit f6e6e55fa7d883a89ba99584f8632c260519ba73.

Reverted https://github.com/pytorch/pytorch/pull/128592 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128592#issuecomment-2181783936))
2024-06-21 00:44:16 +00:00
9c929f6ce9 Revert "[BE][Easy] enable UFMT for torch/distributed/ (#128870)"
This reverts commit a0e1e20c4157bb3e537fc784a51d7aef1e754157.

Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))
2024-06-21 00:38:28 +00:00
9dd8f8cf8b [cpuinfo][submodule] bump cpuinfo to the latest to support amx isa check (#127505)
Fix https://github.com/pytorch/pytorch/issues/127368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127505
Approved by: https://github.com/ezyang
2024-06-21 00:17:44 +00:00
c027c8935b [distributed] NCCL result code update (#128777)
The nccl result codes are outdated. This PR fixes #128756.

Fixes #128756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128777
Approved by: https://github.com/Skylion007
2024-06-20 23:51:39 +00:00
43060a1dbc Add shard support to test_inductor (#129160)
I added one more shard for inductor tests earlier in https://github.com/pytorch/pytorch/pull/129108, but didn't realize that the second shard didn't do any inductor tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129160
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-06-20 23:41:00 +00:00
31d5753247 Short-term fix to preserve NJT metadata cache in torch.compile (#122836)
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.

For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.

**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**

Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
2024-06-20 23:15:53 +00:00
63a724d8e1 Revert "Introduce a prototype for SymmetricMemory (#128582)"
This reverts commit 8771e3429c3d7327f08c48d547ad73546d5603b3.

Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2181656181))
2024-06-20 22:31:29 +00:00
5fba5d83f0 add xpu for amp (#127276)
As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to AMP doc.

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127276
Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/malfet
2024-06-20 21:49:35 +00:00
adc14adb88 Fix flakiness with test_binary_op_list_error_cases (#129003)
So how come this PR fixes any flakiness?

Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky.

Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR https://github.com/pytorch/pytorch/pull/119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach.

So we improve the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129003
Approved by: https://github.com/soulitzer
2024-06-20 21:48:22 +00:00
61fa3de4cb ci: Hardcode runner-determinator (#128985)
Hardcode the runner-determinator script for testing ALI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128985
Approved by: https://github.com/ZainRizvi
2024-06-20 21:14:23 +00:00
aace8ffc00 Revert "[BE] enable UFMT for torch/nn/*.py (#128593)"
This reverts commit a87d82abd746240e7b46b992fa9df7ae6d3e6d4a.

Reverted https://github.com/pytorch/pytorch/pull/128593 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128593#issuecomment-2181562604))
2024-06-20 21:09:44 +00:00
f2f4dde2d3 [dynamo] Remove ID_MATCH for FSDPModuleVariable (#129015)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129015
Approved by: https://github.com/yf225
ghstack dependencies: #129098
2024-06-20 19:23:32 +00:00
e84cf805d2 Revert "Modularize aten parameter parser and checker (#125308)"
This reverts commit 60bbdc0b40656cf70b2b098c7d715e19f031fb0d.

Reverted https://github.com/pytorch/pytorch/pull/125308 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125308#issuecomment-2181327211))
2024-06-20 18:52:05 +00:00
254487f288 Revert "Separate AOTI Eager utils as a single file (#125819)"
This reverts commit 18634048a1f939a961b7c96b0acfe78b474c821e.

Reverted https://github.com/pytorch/pytorch/pull/125819 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125819#issuecomment-2181317332))
2024-06-20 18:49:08 +00:00
73340f0909 Revert "[3/N] Non-Tensor: Support string parameter for aten operations (#125831)"
This reverts commit a52c8ace98afe76dc9e2c330b415972fd1529077.

Reverted https://github.com/pytorch/pytorch/pull/125831 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125831#issuecomment-2181313892))
2024-06-20 18:45:41 +00:00
8c2542623b [Traceable FSDP2] [Dynamo] Add tracing support for out-variant custom ops that return None (#129078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129078
Approved by: https://github.com/yanboliang
2024-06-20 17:46:13 +00:00
734891ac22 Fix export log script (#128967)
Summary: Title

Test Plan: CI

Differential Revision: D58699557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128967
Approved by: https://github.com/jiashenC
2024-06-20 17:01:00 +00:00
ddb95dbb0d Fixing equalize with three things and improving functionality (#124632)
Summary:
(1) Make code work when a first layer does not have a bias.
(2) Make it possible to provide both modules and module names as input
(3) Allow sequences of contiguous layers as input, that then get split into pairs
(4) fix documentation to be more clear on inputs to be provided

Test Plan:
Run this new version of the algorithm on a network and see if it throws errors.

There's also this notebook to run and test N5199827

It you tell me where I can find the tests for this code, I can add some simple unit tests as well.

Differential Revision: D55895862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124632
Approved by: https://github.com/jerryzh168
2024-06-20 16:55:56 +00:00
832fc35211 Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013)"
This reverts commit 6d2b3c90f144d7b77d51da27e6696192b2b97ebd.

Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing a flexattention test to fail on ROCm. Can you please fix that test before remerging this in? See 6d2b3c90f1 for details ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2181133070))
2024-06-20 16:51:41 +00:00
65286883d4 [export] reland "experimental joint graph API." (#129081)
Summary: previous diff got reverted despite CI was green.

Test Plan: CI

Differential Revision: D58790048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129081
Approved by: https://github.com/tugsbayasgalan
2024-06-20 16:50:53 +00:00
fc5b0ff2d7 [BE][Hackaday] deprecate legacy cuda docker image (#128859)
Fixes https://github.com/pytorch/builder/issues/1795 from the pytorch side specifically for the cuda image

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128859
Approved by: https://github.com/atalman
2024-06-20 16:30:49 +00:00
b2a9b8d485 [CpuInductor] Enable NEON ISA detection on Linux ARM (#129075)
Also, cleanup code a bit to use `x in [y, z]` instead of `x == y or x == z`

And do not redefine `at_align`, but instead use `alignas(64)` as was suggested in https://github.com/pytorch/pytorch/pull/128686/files#r1639365978

Test plan: `python3 -c "import torch._inductor.codecache as cc; isa = cc.valid_vec_isa_list()[0];print(str(isa), bool(isa))"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129075
Approved by: https://github.com/jansel
2024-06-20 16:22:57 +00:00
e0aa992d73 Fix inductor and deploy jobs timing out (#129108)
Some trunk and periodic jobs are timing out at the moment, including:

* `deploy`.  This is because https://github.com/pytorch/pytorch/pull/127952 has removed `deploy` config, but there is one left over in periodic.
    * [periodic / linux-focal-cuda12.4-py3.10-gcc9 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu](https://github.com/pytorch/pytorch/actions/runs/9525590191/job/26260620457).
* `inductor`, including `py3.10`, `py3.12`, and `cuda12.1`, `cuda12.4`.  The increase comes from this change https://github.com/pytorch/pytorch/pull/128343, so I add another GPU shard.
    * [inductor / cuda12.1-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9522817887/job/26255069269)
    * [inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9524651902/job/26260009757)
    * [inductor-cu124 / cuda12.4-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9587982228/job/26440205869)
    * [inductor-cu124 / cuda12.4-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9587982228/job/26440634200)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129108
Approved by: https://github.com/malfet
2024-06-20 16:03:11 +00:00
2bb8ee602b Fix DEBUG=1 asserts with NJT ops (#129014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129014
Approved by: https://github.com/YuqingJ, https://github.com/soulitzer
2024-06-20 15:15:28 +00:00
7178b4e987 [Dynamo x torch_function] fix incorrect source (#128980)
Fixes https://github.com/pytorch/pytorch/issues/128964

The problem was that we were installing the source for a type
incorrectly.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128980
Approved by: https://github.com/mlazos
2024-06-20 14:54:00 +00:00
ea47d542ca [dynamo][guards] Remove BOOL_FALSE - not needed after C++ guards (#129098)
PyDict_Size is very fast ... earlier with Python guards, Cpython will go through layers of fluff to finally call the PyDict_Size. With C++ guards, its not needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129098
Approved by: https://github.com/jansel
2024-06-20 14:40:27 +00:00
54b0006cb2 Evaluate symexprs on load path of cache not write (#128997)
When caching is enabled, an internal model fails with
```
assert_size_stride(bmm_9, (17, s0, 512), (54784, 512, 1))
AssertionError: expected size 17==17, stride 57344==54784 at dim=0
```
looking at this model, the exact problem is when the cache is hit on the forward graph, the generated code for backward fails since the strides of the outputs of forward, passed to backward as inputs, are not what we expected.

This PR changes the evaluation logic so that we defer evaluation of output stride exprs to load path as opposed to eagerly doing it on save path.

I have not been able to come up with a unit test repro for this problem.

Differential Revision: [D58796503](https://our.internmc.facebook.com/intern/diff/D58796503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128997
Approved by: https://github.com/ezyang
2024-06-20 08:55:12 +00:00
799acd31b4 [MPS] Add lu_factor (#99269)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at d75cde1</samp>

Added MPS support and autograd formulas for LU factorization of tensors. Implemented the `linalg_lu_factor` and `linalg_lu_factor.out` functions for the MPS backend in `LinearAlgebra.mm` and added tests in `test_mps.py`. Added the corresponding dispatch entries in `native_functions.yaml` and the backward and forward formulas in `derivatives.yaml`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99269
Approved by: https://github.com/kulinseth, https://github.com/lezcano
2024-06-20 07:35:29 +00:00
0d25f096c1 [CppInductor] Fix erfinv codegen when non-vectorized isa (#129090)
Fix erfinv codegen when ISA could not be detected

Manual test plan (on MacOS):
 - Modify `valid_vec_isa_list` to return empty list
 - Run `python3 inductor/test_torchinductor_opinfo.py -v -k test_comprehensive_erfinv_cpu_bool`

Before this change, abovementioned test will fail with
```
Output:
/var/folders/rk/fxg20zvx6vvb5bk7cplq4xrc0000gn/T/tmpgic60b6c/ns/cnsp7snp7fyclkm5lsfiyiv3m6c3svevkbhcb3v7pijdfjwlyaij.cpp:11:25: error: use of undeclared identifier 'calc_erfinv'
            auto tmp2 = calc_erfinv(tmp1);
                        ^
1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129090
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-20 06:09:48 +00:00
6d2b3c90f1 Improved flexattention bwd perf + added configurations for benchmarks (#129013)
Before:
<img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb">

After:
<img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534">

After fixing strides:
![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #128938
2024-06-20 05:15:48 +00:00
ad2593cb86 [Animesh's PR #125340] [dynamo][fsdp] Track FSDPNNModuleVariable for mutations (#129045)
This is a copy of Animesh's work in https://github.com/pytorch/pytorch/pull/125340, with very small changes to the unit test. It's needed sooner for the Traceable FSDP2 work, so I copy it here and will work through landing it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129045
Approved by: https://github.com/anijain2305
2024-06-20 04:02:36 +00:00
19f3abcde4 [Docs][MPS] Add mps environment variable table (#129008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129008
Approved by: https://github.com/malfet
ghstack dependencies: #129006
2024-06-20 03:30:35 +00:00
609ffaf717 Add more shards for slow CPU and ROCm jobs (#128873)
As they start to timeout in trunk fc2913fb80/1.  Adding one more shard for slow CPU job is trivial.  ROCm runners is harder to find, but I assume that this is ok because slow jobs only run periodically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128873
Approved by: https://github.com/PaliC
2024-06-20 03:13:19 +00:00
d8db074988 [Traceable FSDP2] [Dynamo] Fix OptimizedModule._initialize to allow tracing into FSDP2 module hooks for module from user-defined module class (#129046)
This is a workaround to allow inplace fully-sharded module to still go into this branch:
3a185778ed/torch/_dynamo/eval_frame.py (L163)
instead of the second branch:
3a185778ed/torch/_dynamo/eval_frame.py (L166)

If we don't do this, `torch.compile(fully_shard(module_from_user_defined_module_class))` will ignore all module hooks which will break FSDP tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129046
Approved by: https://github.com/anijain2305
2024-06-20 00:15:55 +00:00
859fa183fe BE: Use future annotations in inductor scheduler and ir (#128892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128892
Approved by: https://github.com/lezcano
2024-06-20 00:10:43 +00:00
a2b1673dfb [Horace's PR #126446] Prevent partitioner from ever saving views (#129039)
Most work is done by Horace in https://github.com/pytorch/pytorch/issues/126446, this PR just additionally adds the config for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129039
Approved by: https://github.com/Chillee
2024-06-19 23:21:16 +00:00
9d06e3783d [Inductor][CPP] Fix the symbolic size cast issue in GEMM Benchmark (#128824)
**Summary**
The symbolic size generated from size hint (python int) is different with c type `long` of kernel args which may cause the benchmark failing to run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128824
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-19 23:11:53 +00:00
a6ac6447b5 Re-enable py3.12 nightly wheel builds and add triton dependency for ROCm (#128525)
The llnl-hatchet developers have published the py3.12 binaries on [PyPI](https://pypi.org/project/llnl-hatchet/#files). In fact, looking [here](https://download.pytorch.org/whl/nightly/llnl-hatchet), it seems we already have the py3.12 wheels mirrored. This should allow us to re-enable py3.12 binaries for ROCm.

This PR reverts commit 9d849d4312cd1e62d97b9e9d58979ec78d36c95f.

It also adds the pytorch-triton-rocm dependency for torch wheels on ROCm since pytorch-triton-rocm py3.12 wheels are available now

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128525
Approved by: https://github.com/malfet
2024-06-19 21:56:54 +00:00
571a0db132 [inductor] Fix logging for run_and_get_cpp_code (#128794)
Summary: Found during testing with remote caching: Use the same output logger object between graph.py and codecache.py since it's patched in `run_and_get_cpp_code`. That allows us to capture any logging produced from the codecache path when using `run_and_get_cpp_code`. I'm also fixing a few tests that were passing mistakenly because logging was missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128794
Approved by: https://github.com/oulgen, https://github.com/leslie-fang-intel
2024-06-19 21:32:34 +00:00
cyy
277f2914a5 [9/N] Remove unused functions (#128704)
MKL can not be enabled on aarch64, and as CI compiles code with `-Werror=unused-function` it will fail to compile with
```
/usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/third_party/foxi -I/var/lib/jenkins/workspace/build/third_party/foxi -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-2.1.0 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/FP16/include -I/var/lib/jenkins/workspace/third_party/tensorpipe -I/var/lib/jenkins/workspace/build/third_party/tensorpipe -I/var/lib/jenkins/workspace/third_party/tensorpipe/third_party/libnop/include -I/var/lib/jenkins/workspace/third_party/fmt/include -I/var/lib/jenkins/workspace/build/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/src/../include -I/var/lib/jenkins/workspace/third_party/flatbuffers/include -isystem /var/lib/jenkins/workspace/build/third_party/gloo -isystem /var/lib/jenkins/workspace/cmake/../third_party/gloo -isystem /var/lib/jenkins/workspace/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/workspace/third_party/protobuf/src -isystem /var/lib/jenkins/workspace/third_party/XNNPACK/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/eigen -isystem /var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/workspace/third_party/ideep/include -isystem /var/lib/jenkins/workspace/build/include -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Werror -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -std=gnu++17 -fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/mkldnn/Linear.cpp
/var/lib/jenkins/workspace/aten/src/ATen/native/mkldnn/Linear.cpp:426:15: error: ‘at::Tensor at::native::mkl_linear(const at::Tensor&, const at::Tensor&, const at::Tensor&, const std::optional<at::Tensor>&, int64_t)’ defined but not used [-Werror=unused-function]
  426 | static Tensor mkl_linear(
      |               ^~~~~~~~~~
```

Follows #128499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128704
Approved by: https://github.com/malfet
2024-06-19 20:46:45 +00:00
fca408fa29 s390x vectorization: rework operators (#129066)
Move operators from member functions to free functions. This is needed to fix torch inductor on s390x.

This change fixes tests like
DynamicShapesMiscTests::test_numpy_min_dynamic_shapes from test/dynamo/test_dynamic_shapes.py

This change also fixes recently intorduced build failure on s390x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129066
Approved by: https://github.com/malfet
2024-06-19 20:12:41 +00:00
73f5d2b787 Run ET unit tests on PT CI (#128560)
This is the first PR to add all existing ET unit tests into PT CI.  The goal is to improve the coverage there to avoid breaking change from PT that could break ET.  With this, any future unit tests on ET will automatically be run on PT CI.  The duration of the job is now 40+ minutes, not too bad.

This also fixed the failed ET build in https://github.com/pytorch/pytorch/pull/123043.

Adding model coverage is a bit more evolved and requires adding new shards, so I will follow up on that in separate PRs.

[T192117506](https://www.internalfb.com/intern/tasks/?t=192117506), with the failed diffs D58295865 and D58394154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128560
Approved by: https://github.com/guangy10, https://github.com/digantdesai
2024-06-19 20:08:58 +00:00
df94d57c0a Revert "[export] experimental joint graph API. (#128847)"
This reverts commit 0707811286d1846209676435f4f86f2b4b3d1a17.

Reverted https://github.com/pytorch/pytorch/pull/128847 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128847#issuecomment-2179326891))
2024-06-19 19:04:36 +00:00
b5d541609d [Memory Snapshot] Add recordAnnotations to capture record_function annotations (#129072)
Summary:
Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations.

Test Plan:
CI

Pulled By:
aaronenyeshi

Differential Revision: D55941362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129072
Approved by: https://github.com/zdevito
2024-06-19 18:05:41 +00:00
bafd68b4fc [inductor] fix windows python module ext and func export declaration (#129059)
I have run the first inductor case on Windows base on the exploration code: https://github.com/pytorch/pytorch/pull/128330
Due to some fundamental PR still need pass `fb_code`: https://github.com/pytorch/pytorch/pull/128303
This PR would land some part of exploration code:
1. Fix Windows python module ext type: pyd.
2. Add function export declaration for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129059
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-19 17:51:32 +00:00
0707811286 [export] experimental joint graph API. (#128847)
Summary:
WARNING: This API is highly unstable and will be subject to change in the future.

Add a protoype to "decompose" an ExportedProgram into a joint graph form, so that we can compute the gradients on this graph.

Test Plan: buck test mode/opt caffe2/torch/fb/export:test_experimental

Differential Revision: D55657917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128847
Approved by: https://github.com/tugsbayasgalan
2024-06-19 16:45:27 +00:00
0fc603ece4 [optim] Fused implementation stability table (#129006)
I'd like to discuss the criteria that we regard an implementation as stable. If there is no existing standard, my initial proposal would be a 6 month period after the commit to regard it as stable. As a result, now Adam and AdamW on CUDA would be considered as stable, while the rest are of beta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129006
Approved by: https://github.com/malfet
2024-06-19 16:29:49 +00:00
1b92bdd0ea [ALI] [Reland] Use LF runners for Lint (#129071)
Quick experiment with using LF runners for lint jobs.

Picking a set of jobs where infra failures would be obvious to most people (lint)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129071
Approved by: https://github.com/malfet
2024-06-19 16:10:51 +00:00
236fbcbdf4 [Split Build] Test split build in pull CI workflow (#126813)
This PR builds the split build in the pull workflow and runs the appropriate tests against them. A single linux cpu and single gpu build were chosen arbitrarily to not add too many tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126813
Approved by: https://github.com/atalman
ghstack dependencies: #127934
2024-06-19 15:57:21 +00:00
7d33ff59ba [Split Build]Use same package (#127934)
This PR removes the second separate package we were using for the libtorch wheel.
In terms of testing that this works we will look use the PRs above this in the stack.

As for sanity checking these are the wheels that are produced by running
```
python setup.py clean && BUILD_LIBTORCH_WHL=1 with-proxy python setup.py bdist_whee
l && BUILD_PYTHON_ONLY=1 with-proxy python setup.py bdist_wheel --cmake
```

```
sahanp@devgpu086 ~/pytorch ((5f15e171…))> ls -al dist/                                                        (pytorch-3.10)
total 677236
drwxr-xr-x 1 sahanp users       188 Jun  4 12:19 ./
drwxr-xr-x 1 sahanp users      1696 Jun  4 12:59 ../
-rw-r--r-- 1 sahanp users  81405742 Jun  4 12:19 torch-2.4.0a0+gitca0a73c-cp310-cp310-linux_x86_64.whl
-rw-r--r-- 1 sahanp users 612076919 Jun  4 12:19 libtorch-2.4.0a0+gitca0a73c-py3-none-any.whl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127934
Approved by: https://github.com/atalman
2024-06-19 15:57:21 +00:00
lyb
ffb50fb691 [ONNX] Add onnx::Gelu support for version 20 (#128773)
Fixes https://github.com/pytorch/pytorch/issues/128772
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128773
Approved by: https://github.com/justinchuby
2024-06-19 15:39:02 +00:00
3397d5ef90 Revert "[ALI] Use lf runners for Lint" (#129070)
Reverts pytorch/pytorch#128978
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129070
Approved by: https://github.com/atalman
2024-06-19 14:48:16 +00:00
118f9ceb7c [inductor][ci] Fix torchbench dependency issue with numpy (#128968)
For some reason, pip will always upgrade the numpy version even when an older version has been installed.
We have to lock numpy version to the old version to make this constraint explicit.

Torchbench commit: 23512dbebd

Second attempt to fix #128845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128968
Approved by: https://github.com/eellison
2024-06-19 12:10:50 +00:00
e49525275d Make TraceUtils.h to be device-agnostic (#126969)
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.

In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
2024-06-19 09:06:49 +00:00
7fac03aee9 [ALI] Use lf runners for Lint (#128978) 2024-06-19 10:59:07 +02:00
50567f7081 Pass device to is_pinned call inside TensorProperties.create_from_tensor (#128896)
Summary:
The default input device for is_pinned function is Cuda. This can unnecessarily create Cuda context for CPU tensors when just generating TensorProperties, bloating memory usage. Passing the device to the is_pinned call site inside def create_from_tensor solves this issue.

This also fixes Model Store test
https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0
which is currently broken on memory usage assertions.

Test Plan: UT

Differential Revision: D58695006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128896
Approved by: https://github.com/fegin
2024-06-19 08:50:46 +00:00
d3e8b8bf47 Remove cuda check in the CUDAGraph destructor (#127382)
Fixes #125804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127382
Approved by: https://github.com/eqy, https://github.com/eellison
2024-06-19 08:09:31 +00:00
ba92f5277f [inductor][refactor] Unify the use of generate_kernel_call (#128467)
Summary: Refactor TritonTemplateKernel.call_kernel and ForeachKernel.call_kernel to use wrapper.generate_kernel_call to generate kernel calls instead of explicitly composing the kernel call string. This consolidates the entry point of generate_kernel_call and similifies later changes in this PR stack.

Differential Revision: [D58733631](https://our.internmc.facebook.com/intern/diff/D58733631)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128467
Approved by: https://github.com/shunting314
2024-06-19 07:47:25 +00:00
3a185778ed [aotinductor] Add torch.polar fallback op for shim v2 (#128722)
Compilation error:
```
$ TORCHINDUCTOR_C_SHIM_VERSION=2 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_LOGS_FORMAT="%(pathname)s:%(lineno)s: %(message)s" TORCH_LOGS="+output_code" python test/inductor/test_cpu_cpp_wrapper.py -k test_polar

/tmp/tmp2sp128xj/dy/cdypvu3hvgg3mwxydwbiuddsnmuoi37it3mrpjktcnu6vt4hr3ki.cpp:59:33: error: ‘aoti_torch_cpu_polar’ was not declared in this scope; did you mean ‘aoti_torch_cpu_topk’?
```

Steps:
1. Add aten.polar
2. run `python torchgen/gen.py --update-aoti-c-shim`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128722
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-06-19 05:06:58 +00:00
a584b2a389 Revert "Add test to xfail_list only for abi_compatible (#128506)"
This reverts commit df85f34a14dd30f784418624b05bd52b12ab8b0b.

Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to The failure shows up in trunk df85f34a14 ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2177744578))
2024-06-19 04:59:10 +00:00
fcf2a1378b Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 (#128989)
# Summary
First PR got reverted and needed a redo

This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128989
Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo
2024-06-19 04:49:39 +00:00
2f88597aad [inductor] For internal, allow multiple workers if the method is "subprocess" (#129002)
Summary: This does not change the current default behavior in fbcode ("fork" if unspecified and no worker processes if unspecified). But it allows us to more easily test the subprocess-based parallel if we override the start method to subprocess.

Test Plan: Set `TORCHINDUCTOR_WORKER_START=subprocess` and locally ran all torchbench models listed [here](https://www.internalfb.com/intern/wiki/PyTorch/Teams/PyTorch_Perf_Infra/TorchBench/#torchbench-internal-mode)

Differential Revision: D58755021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129002
Approved by: https://github.com/eellison
2024-06-19 04:28:27 +00:00
1f0a68b572 [ROCm] Fix fp32 atomicAdd for non-MI100 GPUs (#128750)
Current implementation is very specific to MI100.
This is causing performance degradation for other GPUs.

Fixes #128631

Benchmarking on MI300X:
```
Before:  1918.5126953125 ms
After: 0.8285150527954102 ms
```

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128750
Approved by: https://github.com/xw285cornell
2024-06-19 03:56:20 +00:00
acefc5c016 [torch.compile] Enable bwd compilation metrics (#128973)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128973
Approved by: https://github.com/dshi7
2024-06-19 03:45:41 +00:00
eb9f4da11e Modified template indexing to broadcast indices to out instead of mask and some other flexattention micro-opts (#128938)
For headdim=64 and headdim=128

Old:
<img width="656" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/2c5d1613-96dc-4300-8dc0-dccaef59e73c">

New:
<img width="644" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/730004a8-6d5f-46a5-82a0-2594feb5e192">

Note, this does regress headdim=256. We can unregress it by special casing `headdim=256`, but ehh.... we can do it later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128938
Approved by: https://github.com/drisspg
2024-06-19 03:41:22 +00:00
8771e3429c Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-19 03:38:58 +00:00
ed5b8432cd Enable mixed_mm only if casting from lower-bitwidth type to a higher one (#128899)
This PR changes the behavior of `cuda_and_enabled_mixed_mm` such that mixed_mm is only enabled if we are casting from a lower-bitwidth type to a higher one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128899
Approved by: https://github.com/eellison
2024-06-19 03:12:18 +00:00
df85f34a14 Add test to xfail_list only for abi_compatible (#128506)
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.

We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.

- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-19 01:18:37 +00:00
4bc90185fb fix: Print statements causing parse error (#128969)
The print statements for the get_workflow_type script is problematic because the shell script calling this script is expecting the output to only be JSON. This PR resolves this by removing all print statements to covert them to a message field in the JSON return output so that the output can continue to expect to be JSON while giving us the debug data we are looking for.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128969
Approved by: https://github.com/tylertitsworth, https://github.com/ZainRizvi
2024-06-19 01:17:08 +00:00
eda375a490 [Inductor] Remove min/max from inductor opinfo test (#128925)
**Summary**
Remove `max.binary, min.binary, maximum, minimum` from `inductor_one_sample` op list as we fix the bool vectorization issue in https://github.com/pytorch/pytorch/pull/126841.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_maximum
python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_minimum
python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_min_binary
python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_max_binary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128925
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
2024-06-19 01:14:27 +00:00
2458f79f83 [Inductor UT][Intel GPU] Skip newly added test case test_torchinductor_strided_blocks:test_reduction for Intel GPU (#128881)
Skip newly added test case test_torchinductor_strided_blocks:test_reduction for Intel GPU because
it have not implemented reduction kernel split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128881
Approved by: https://github.com/blaine-rister, https://github.com/EikanWang, https://github.com/malfet
2024-06-19 00:44:57 +00:00
b0d2fe6299 Revert "Short-term fix to preserve NJT metadata cache in torch.compile (#122836)"
This reverts commit 2a41fc03903de63270d325bd1886a50faf32d7e4.

Reverted https://github.com/pytorch/pytorch/pull/122836 on behalf of https://github.com/jbschlosser due to internal test failures with DEBUG=1 asserts ([comment](https://github.com/pytorch/pytorch/pull/122836#issuecomment-2177298245))
2024-06-19 00:28:53 +00:00
5ffb032be6 Revert "Backward support for unbind() with NJT (#128032)"
This reverts commit 5dc4f652bc5c068ef15130c955e3f2ffe11f4b74.

Reverted https://github.com/pytorch/pytorch/pull/128032 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128032#issuecomment-2177296325))
2024-06-19 00:26:40 +00:00
35c78668b4 Improve the debugging message for when foreach mta_called (#128991)
The hope that lives in this PR: I am currently trying to debug why the foreach tests are so flaky. It looks like every flaky test falls under this pattern:
- a test is flaky due to the mta_called assertion, which gathers data from the profiler regarding whether the multi_tensor_apply_kernel has been called.
- then, a later test fails deterministically, usually failing to compare two results.

```
================== 1 failed, 241 deselected, 2 rerun in 1.76s ==================
Got exit code 1
Stopping at first consistent failure
The following tests failed and then succeeded when run in a new process ['test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_bfloat16']
The following tests failed consistently: ['test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16']
```

So my suspicion is that the first causes the second, but what causes the first? Idk! So it would be nice to have the error message tell us what the profiler actually saw in case it's getting muddled. This change would help mostly because I have not been able to repro this flakiness locally.

Also undo the useless changes in #128220 which are actually redundant as Joel and I realized that we set the seed during the setUp of every test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128991
Approved by: https://github.com/clee2000
2024-06-19 00:25:09 +00:00
99f042d336 Revert "Forward fix to skip ROCm tests for #122836 (#128891)"
This reverts commit 4061b3b8225f522ae0ed6db00111441e7d3cc3d5.

Reverted https://github.com/pytorch/pytorch/pull/128891 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128891#issuecomment-2177291249))
2024-06-19 00:21:21 +00:00
670b94c9c8 [inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484
Approved by: https://github.com/mlazos
ghstack dependencies: #128428
2024-06-19 00:06:46 +00:00
c5e0b84484 [dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428
Approved by: https://github.com/yanboliang, https://github.com/mlazos
2024-06-19 00:06:46 +00:00
cyy
cb5e9183c6 [Caffe2] [2/N] Remove Caffe2 from tests (#128911)
Follows #128675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128911
Approved by: https://github.com/titaiwangms, https://github.com/r-barnes
2024-06-19 00:05:50 +00:00
ac5f565fa7 [FSDP2] Added set_post_optim_event (#128975)
This PR adds `set_post_optim_event` that allows power users to provide their own CUDA event that is recorded after the optimizer step for the FSDP root module to wait the all-gather streams on.
```
def set_post_optim_event(self, event: torch.cuda.Event) -> None:
```
By default, the root would have the all-gather streams wait on the current stream (`wait_stream`), which may introduce false dependencies if there is unrelated computation after the optimizer step and before the wait. For example, this pattern can appear in recommendation models.

To avoid those false dependencies while preserving the correctness guarantee, we provide this API so that the user can provide their own CUDA event to wait the all-gather streams on.

We include both correctness test (`test_fully_shard_training.py`) and overlap test (`test_fully_shard_overlap.py`).

---

One possible way to use the API is to register a post-step hook on the optimizer. For example:
12e8d1399b/test/distributed/_composable/fsdp/test_fully_shard_training.py (L546-L552)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128975
Approved by: https://github.com/sanketpurandare, https://github.com/weifengpy
ghstack dependencies: #128884
2024-06-18 22:26:14 +00:00
d9c294c672 [Inductor] Fix arguments passed to triton kernel launch hooks (#128732)
`binary.launch_enter_hook` is treated as an instance method and will add a `self` argument to the hooks.
`CompiledKernel.launch_enter_hook` is a static method, which matches the hook calling convention of profilers (i.e., a single `LazyDict` argument only).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128732
Approved by: https://github.com/shunting314, https://github.com/bertmaher
2024-06-18 22:06:55 +00:00
a0e1e20c41 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin
ghstack dependencies: #128868, #128869
2024-06-18 21:49:08 +00:00
3b798df853 [BE][Easy] enable UFMT for torch/distributed/{fsdp,optim,rpc}/ (#128869)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869
Approved by: https://github.com/fegin
ghstack dependencies: #128868
2024-06-18 21:49:08 +00:00
cec31050b4 [BE][Easy] enable UFMT for torch/distributed/{tensor,_tensor}/ (#128868)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128868
Approved by: https://github.com/fegin
2024-06-18 21:49:02 +00:00
e47603a549 Fix weight_norm decomposition behavior (#128956)
By upcasting norm to float32 to align with CUDA and CPU behaviors
e6d4451ae8/aten/src/ATen/native/WeightNorm.cpp (L56-L59)

Discovered this when started running OpInfo tests, see https://github.com/pytorch/pytorch/actions/runs/9552858711/job/26332062502#step:20:1060
```
  File "/var/lib/jenkins/workspace/test/test_decomp.py", line 185, in op_assert_ref
    assert orig.dtype == decomp.dtype, f"{i} Operation:  {op}"
AssertionError: 1 Operation:  aten._weight_norm_interface.default
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128956
Approved by: https://github.com/albanD
ghstack dependencies: #128955
2024-06-18 21:24:12 +00:00
2227da4431 [Profiler] Clean up use_mtia to follow standard use_device instead (#126284)
Summary:
use_mtia should instead set use_device='mtia' similar to cuda, xpu, and privateuseone. Avoid an ever-growing list of use_* arguments.

Since use_mtia is specific to FBCode, we don't need a deprecation warning.

Test Plan: CI.

Differential Revision: D57338005

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126284
Approved by: https://github.com/fenypatel99
2024-06-18 21:01:03 +00:00
4cc3fb5ee2 Bump urllib3 from 2.2.1 to 2.2.2 in /tools/build/bazel (#128908)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.1 to 2.2.2.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.1...2.2.2)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-18 13:38:22 -07:00
5dc4f652bc Backward support for unbind() with NJT (#128032)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032
Approved by: https://github.com/soulitzer
2024-06-18 20:29:00 +00:00
44722c6b10 Revert "[dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453)"
This reverts commit 2b28b107dbafeec18d1095a2002e79511aa241df.

Reverted https://github.com/pytorch/pytorch/pull/128453 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))
2024-06-18 20:09:00 +00:00
1babeddbbf Revert "[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484)"
This reverts commit 1f6e84fa6852805e15ddc9583c5f36c3a7f93df8.

Reverted https://github.com/pytorch/pytorch/pull/128484 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))
2024-06-18 20:09:00 +00:00
5bc9835d64 Revert "[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428)"
This reverts commit c52eda896eb3ec7f8d04b6321861f4c5614a40bb.

Reverted https://github.com/pytorch/pytorch/pull/128428 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))
2024-06-18 20:09:00 +00:00
9a7e2519d3 [MPS] Fused Adam & AdamW (#127242)
Summary:

This PR adds fused Adam and AdamW implementations.

Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory:
**Fast math enabled:**
```
[---------------------------------------------- Fused Adam ----------------------------------------------]
                                                                           |  Fused: True  |  Fused: False
1 threads: -----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100        |       10      |       100
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100       |        9      |        89
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100       |        9      |        90
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100      |        9      |        83
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100       |       12      |        94
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100      |       11      |        88
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100      |       12      |        90
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100     |       11      |       100
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100     |       27      |       100
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100    |       23      |       100
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100    |       27      |       100
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100   |       23      |        98
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500        |       82      |       480
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500       |       72      |       450
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500       |       82      |       450
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500      |       73      |       420
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500       |       91      |       500
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500      |       83      |       400
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500      |       94      |       500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500     |       78      |       400
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500     |      170      |       500
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500    |      140      |       600
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500    |      170      |       600
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500   |      140      |       500
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000       |      250      |       890
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000      |      220      |       850
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000      |      250      |       830
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000     |      220      |       770
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000      |      270      |       870
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000     |      230      |       840
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000     |      270      |       810
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000    |      240      |       800
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000    |      400      |      1000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000   |      360      |      2000
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000   |      430      |      2000
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000  |      360      |      1300

Times are in milliseconds (ms).
```

**Fast math disabled:**
```
[---------------------------------------------- Fused Adam ----------------------------------------------]
                                                                           |  Fused: True  |  Fused: False
1 threads: -----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100        |       10      |       100
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100       |        9      |        84
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100       |        9      |        84
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100      |        9      |        79
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100       |       11      |        93
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100      |       10      |        90
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100      |       11      |        91
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100     |       11      |        81
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100     |       34      |       100
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100    |       31      |       100
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100    |       34      |        95
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100   |       31      |       100
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500        |       94      |       500
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500       |       82      |       430
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500       |       92      |       430
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500      |       81      |       390
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500       |       98      |       500
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500      |       88      |       430
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500      |      100      |       500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500     |       88      |       400
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500     |      210      |       500
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500    |      190      |       610
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500    |      210      |       510
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500   |      190      |       500
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000       |      300      |       900
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000      |      260      |       850
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000      |      295      |       900
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000     |      260      |       800
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000      |      320      |       910
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000     |      280      |       900
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000     |      320      |       900
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000    |      300      |       900
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000    |      500      |      2000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000   |      480      |      2000
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000   |      540      |      1500
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000  |      480      |      1200

Times are in milliseconds (ms).
```

```python
def profile_fused_adam():
    from torch.optim import adam, adamw
    import torch.utils.benchmark as benchmark

    import itertools

    def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused):
        fn(
            params,
            grads,
            exp_avgs,
            exp_avg_sqs,
            max_exp_avg_sqs,
            state_steps,
            foreach=False,
            capturable=False,
            fused=fused,
            amsgrad=amsgrad,
            beta1=0.9,
            beta2=0.99,
            lr=1e-3,
            weight_decay=.0,
            eps=1e-5,
            maximize=False,
            grad_scale=None,
            found_inf=None,
        )
        torch.mps.synchronize()

    device = "mps"

    results = []

    for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]):
        print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}")
        params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)]
        max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else []
        state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)]
        if adamWflag:
            fn = adamw.adamw
        else:
            fn = adam.adam

        for fused in [True, False]:

            t = benchmark.Timer(
                    stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)',
                    label='Fused Adam',
                    sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}",
                    globals=locals(),
                    description= f"Fused: {fused}",
                ).blocked_autorange(min_run_time=5)
            results.append(t)

    compare = benchmark.Compare(results)
    compare.trim_significant_figures()
    compare.colorize(rowwise=True)
    compare.print()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242
Approved by: https://github.com/kulinseth, https://github.com/janeyx99
2024-06-18 19:59:50 +00:00
fe8558b7aa [DSD] Add unittest to verify HSDP1 + broadcast_from_rank0 (#128755)
HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case.

This test relies on the fix from https://github.com/pytorch/pytorch/pull/128446.

Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128755
Approved by: https://github.com/Skylion007, https://github.com/wz337
ghstack dependencies: #128685
2024-06-18 19:42:51 +00:00
abde6cab4c Remove compile_threads=1 in test_inductor_collectives.py (#128580)
Summary: I believe https://github.com/pytorch/pytorch/issues/125235 should be fixed after switching to subprocess-based parallel compile.

Test Plan: Ran locally with python-3.9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128580
Approved by: https://github.com/eellison
2024-06-18 19:31:13 +00:00
04a5d3228e [ts migration] Support prim::tolist and aten::len (#128894)
Support prim::tolist and aten::len. Add unit tests for prim::min.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128894
Approved by: https://github.com/angelayi
2024-06-18 19:11:07 +00:00
44483972bd [EZ] Keep weight_norm var name aligned (#128955)
To keep it aligned with
e6d4451ae8/aten/src/ATen/native/native_functions.yaml (L6484)
I.e.  `x`->`v`, `y`->`g`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128955
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-06-18 18:40:59 +00:00
bdffd9f0c6 [export] Graph break on nn.Parameter construction (#128935)
Fixes https://github.com/pytorch/pytorch/issues/126109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128935
Approved by: https://github.com/angelayi
2024-06-18 18:37:44 +00:00
1a527915a6 [DSD] Correctly handle shared parameters for optimizer state_dict (#128685)
*
Fixes https://github.com/pytorch/pytorch/issues/128011

See the discussion in https://github.com/pytorch/pytorch/pull/128076

Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue.

Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128685
Approved by: https://github.com/LucasLLC
2024-06-18 18:34:32 +00:00
d77a1aaa86 DOC: add note about same sized tensors to dist.gather() (#128676)
Fixes #103305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128676
Approved by: https://github.com/wconstab
2024-06-18 18:26:07 +00:00
1877b7896c [checkpoint] Clean up selective activation checkpoint and make public (#125795)
### bc-breaking for existing users of the private API:
- Existing policy functions must now change their return value to be [CheckpointPolicy](c0b40ab42e/torch/utils/checkpoint.py (L1204-L1230))  Enum instead of bool.
   - To restore previous behavior, return `PREFER_RECOMPUTE` instead of `False` and `{PREFER,MUST}_SAVE` instead of `True` depending whether you prefer the compiler to override your policy.
- Policy function now accepts a `ctx` object instead of `mode` for its first argument.
   - To restore previous behavior, `mode = "recompute" if ctx.is_recompute else "forward"`.
- Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `create_selective_checkpoint_contexts `. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint).

Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit

Memory considerations:
- As with the existing SAC, cached values are cleared upon first use.
- We error if the user wishes to backward a second time on a region forwarded with SAC enabled.

In-place:
- We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed.
- `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place)

Randomness, views
- Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors)

Tensor object preservation
- ~We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object.~ UPDATE: We guarantee that if a tensor is of non-differentiable dtype AND it is not a view, and it is saved, then what you get out is the same tensor object. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor.

Policy function
- Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error.
- The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3).
- The number of times we call the policy_fn is something that should be documented as part of public API. We call the policy function for all ops except ~~detach~~ UPDATE :  metadata ops listed in `torch.utils.checkpoint.SAC_IGNORED_OPS`) because these ops may be called a different number of times by AC itself between forward and recompute.
- The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below).
Tensors guaranteed to be the same tensor as-is
- Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795
Approved by: https://github.com/Chillee, https://github.com/fmassa
2024-06-18 18:18:50 +00:00
77830d509f Revert "Introduce a prototype for SymmetricMemory (#128582)"
This reverts commit 7a39755da28d5a109bf0c37f72b364d3a83137b1.

Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2176685232))
2024-06-18 18:11:43 +00:00
84c86e56bd Update tracker issues after successfully cherry-picking a PR (#128924)
This extends the capacity of the cherry-pick bot to automatically update the tracker issue with the information.  For this to work, the tracker issue needs to be an open one with a `release tracker` label, i.e. https://github.com/pytorch/pytorch/issues/128436.  The version from the release branch, i.e. `release/2.4`, will be match with the title of the tracker issue, i.e. `[v.2.4.0] Release Tracker` or `[v.2.4.1] Release Tracker`

### Testing

`python cherry_pick.py --onto-branch release/2.4 --classification release --fixes "DEBUG DEBUG" --github-actor huydhn 128718`

* On the PR https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174846771
* On the tracker issue https://github.com/pytorch/pytorch/issues/128436#issuecomment-2174846757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128924
Approved by: https://github.com/atalman
2024-06-18 17:48:47 +00:00
eqy
4e03263224 [CUDA][Convolution] Add missing launch bounds to vol2col_kernel (#128740)
Fix "too many resources requested" that can happen with recent toolkits on V100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128740
Approved by: https://github.com/mikaylagawarecki
2024-06-18 17:26:23 +00:00
26e374e3ca [EZ] Fix typos in RELEASE.md (#128769)
This PR fixes typo in `RELEASE.md`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128769
Approved by: https://github.com/yumium, https://github.com/mikaylagawarecki
2024-06-18 17:15:05 +00:00
9818283da1 re-enable jacrev/jacfwd/hessian after #128028 landed (#128622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128622
Approved by: https://github.com/zou3519
2024-06-18 17:08:58 +00:00
eqy
ec616da518 RNN API cleanup for cuDNN 9.1 (#122011)
Can potentially avoid a bit of boilerplate if we move directly to cuDNN 9.1's RNN API...

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122011
Approved by: https://github.com/Skylion007
2024-06-18 16:16:38 +00:00
108318ad10 [BE][JIT] Handle case where codegen object can be unset (#128951)
Summary:
Unblocks a test that's failing.

`codegen` can be unset until `compile` is called. If `codegen` is not set, then just use the kernel name directly.

Test Plan:
```
buck2 run //caffe2/test:tensorexpr -- --regex test_simple_add
```

Differential Revision: D58727391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128951
Approved by: https://github.com/aaronenyeshi
2024-06-18 15:40:45 +00:00
4817180601 make fallback for aten.argsort.stable (#128907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128907
Approved by: https://github.com/lezcano
ghstack dependencies: #128343
2024-06-18 14:56:35 +00:00
22d258427b [BE][Easy] enable UFMT for torch/distributed/_shard/ (#128867)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128867
Approved by: https://github.com/fegin
ghstack dependencies: #128866
2024-06-18 14:39:25 +00:00
e6d4451ae8 [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866
Approved by: https://github.com/fegin
2024-06-18 13:51:53 +00:00
f2805a0408 [FSDP2] Added APIs for explicit fwd/bwd prefetching (#128884)
This PR adds two APIs `set_modules_to_forward_prefetch` and `set_modules_to_backward_prefetch` to enable explicit forward/backward all-gather prefetching, respectively.

```
def set_modules_to_forward_prefetch(self, modules: List[FSDPModule]): -> None
def set_modules_to_backward_prefetch(self, modules: List[FSDPModule]): -> None
```

**Motivation**
FSDP2 implements _reasonable defaults_ for forward and backward prefetching. In forward, it uses implicit prefetching and allows two all-gather output tensors to be alive at once (so that the current all-gather copy-out can overlap with the next all-gather). In backward, it uses explicit prefetching based on the reverse post-forward order.

However, there may be cases where with expert knowledge, we can reduce communication bubbles by moving all-gathers manually. One way to expose such behavior is to expose _prefetching limits_, i.e. integers that configure how many outstanding all-gathers/all-gather output tensors can be alive at once. IMIHO, this leans toward _easy_, not _simple_ (see [PyTorch design principles](https://pytorch.org/docs/stable/community/design.html#principle-2-simple-over-easy)).

The crux of the problem is that there may be special cases where manual intervention can give better performance. Exposing a prefetching limit and allowing users to pass a value >1 just smooths over the problem since such a limit would generally apply over the entire model even though it possibly should not. Then, expert users will see a specific all-gather that they want to deviate from this limit, and there is little we can do.

Thus, we instead choose to expose the most primitive extension point: namely, every `FSDPModule` gives an opportunity to prefetch other all-gathers in forward and in backward. How to leverage this extension point is fully up to the user. Implementing the prefetch limit can be done using this extension point (e.g. record the post-forward order yourself using forward hooks, iterate over that order, and call the `set_modules_to_forward_prefetch` / `set_modules_to_backward_prefetch` APIs).

Differential Revision: [D58700346](https://our.internmc.facebook.com/intern/diff/D58700346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128884
Approved by: https://github.com/ckluk2, https://github.com/weifengpy
2024-06-18 13:32:57 +00:00
3dd5f0ecbb Remove circular import (#128875)
Summary: A spurious import is causing circular dependency errors

Test Plan: phabricator signals

Differential Revision: D58685676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128875
Approved by: https://github.com/kit1980
2024-06-18 12:30:13 +00:00
304c934572 Move MKLDNN Specific IR to Separate File (#126504)
**Summary**
Following the discussion in https://github.com/pytorch/pytorch/pull/122593#discussion_r1604144782, Move Inductor MKLDNN specific IRs to a separate file.

Co-authored-by: Isuru Fernando <ifernando@quansight.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126504
Approved by: https://github.com/desertfire, https://github.com/jgong5
ghstack dependencies: #126841, #126940
2024-06-18 09:29:13 +00:00
6e43897912 [BE][ptd_fb_test][3/N] Enable TestSlide for MultiThreadedTestCase (#128843)
Enabling testslide for MultiThreadedTestCase, similar to https://github.com/pytorch/pytorch/pull/127512.

Differential Revision: [D58677457](https://our.internmc.facebook.com/intern/diff/D58677457/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128843
Approved by: https://github.com/wz337
2024-06-18 07:05:31 +00:00
60baeee59f [BE] Skip the test if CUDA is not available (#128885)
As title

Differential Revision: [D58690210](https://our.internmc.facebook.com/intern/diff/D58690210/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128885
Approved by: https://github.com/wz337
2024-06-18 07:02:44 +00:00
e3a39d49a0 [Traceable FSDP][Compiled Autograd] Add queue_callback() support (#126366)
Adds support for `Variable._execution_engine.queue_callback()`, which is used in FSDP2.

Important tests:
- `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_callback_graph_break_throws_error`
- `pytest -rA test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_callback_adds_callback`
- `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_callback_adds_callback`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126366
Approved by: https://github.com/xmfan
2024-06-18 06:22:14 +00:00
f7eae27946 Pass params to dump_nccl_trace_pickle (#128781)
Summary
Pass parameters from request to dump_nccl_trace_pickle handler.
The supported parameters + value are all lowercase.
includecollectives={true, false}
includestacktraces={true, false}
onlyactive={true, false}

Example post is:
/handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true

Test Plan:
unit tests

Differential Revision: [D58640474](https://our.internmc.facebook.com/intern/diff/D58640474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128781
Approved by: https://github.com/d4l3k
2024-06-18 03:46:57 +00:00
d9eaa224f2 Fixes #128429: NaN in triu op on MPS (#128575)
Fixes triu op when k > 0 and the lower triangle of the input tensor contains inf leading to NaNs in the computation through complement. Fixed by using select API instead.

Fixes #128429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128575
Approved by: https://github.com/kulinseth
2024-06-18 03:44:42 +00:00
59b4983dc0 DebugPlane: add dump_traceback handler (#128904)
This adds a `dump_traceback` handler so you can see all running threads for a job. This uses a temporary file as a buffer when calling `faulthandler.dump_traceback` and requires the GIL to be held during dumping.

Test plan:

```
python test/distributed/elastic/test_control_plane.py -v -k traceback
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128904
Approved by: https://github.com/c-p-i-o
2024-06-18 03:40:16 +00:00
17abbafdfc [inductor] Fix some windows cpp builder issue (#128765)
1. fix some Windows build args.
2. fix c++20 likely issue on Windows, reference: https://github.com/pytorch/pytorch/pull/124997.
3. remove compiler return value check, different compilers return variant value, let's check exception to catch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128765
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-18 03:25:20 +00:00
4061b3b822 Forward fix to skip ROCm tests for #122836 (#128891)
Fixes broken ROCm tests from #122836.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128891
Approved by: https://github.com/huydhn
ghstack dependencies: #127007, #128057, #122836
2024-06-18 03:01:19 +00:00
c017c97333 [dynamo][inlining-inbuilt-nn-modules] Update test output (#128880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128880
Approved by: https://github.com/mlazos
ghstack dependencies: #128315, #128748, #128877, #128878
2024-06-18 02:18:09 +00:00
4e97d37fd9 [inlining-inbuilt-nn-modules][pre-grad] Adjust efficient_conv_bn_eval_graph for inlining (#128878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128878
Approved by: https://github.com/mlazos
ghstack dependencies: #128315, #128748, #128877
2024-06-18 02:18:09 +00:00
22f1793c0a [dynamo][easy] Use LazyVariableTracker for UserDefinedObject var_getattr (#128877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128877
Approved by: https://github.com/mlazos
ghstack dependencies: #128315, #128748
2024-06-18 02:17:56 +00:00
43998711a7 [CUDAGraph] add more docs for cudagraph trees (#127963)
This PR adds more documentation for CUDAGraph Trees, including
- Iteration Support
- Input Mutation Support
- Dynamic Shape Support
- NCCL Support
- Reasons for Skipping CUDAGraph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127963
Approved by: https://github.com/eellison
2024-06-18 02:07:07 +00:00
e12fa93b8b add is_big_gpu(0) check to test_select_algorithm tests in tests/inductor/test_cuda_cpp_wrapper.py (#128652)
In NVIDIA internal CI, on Jetson devices we are seeing this failure for `python test/inductor/test_cuda_cpp_wrapper.py -k test_addmm_cuda_cuda_wrapper -k test_linear_relu_cuda_cuda_wrapper`:

```
/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:132: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm mode
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1)]
aot_autograd [('total', 1), ('ok', 1)]
F
======================================================================
FAIL: test_linear_relu_cuda_cuda_wrapper (__main__.TestCudaWrapper)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 9818, in new_test
    return value(self)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/pytorch/pytorch/test/inductor/test_cuda_cpp_wrapper.py", line 152, in fn
    _, code = test_torchinductor.run_and_get_cpp_code(
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 356, in run_and_get_cpp_code
    result = fn(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 43, in wrapped
    return fn(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3.10/unittest/mock.py", line 1379, in patched
    return func(*newargs, **newkeywargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 62, in test_linear_relu_cuda
    self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 3642, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Scalars are not equal!

Expected 1 but got 0.
Absolute difference: 1
Relative difference: 1.0
```
Looking into it, we see the failure is from https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L62. The warning `W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm ` is triggered from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L973. Printing torch.cuda.get_device_properties(0).multi_processor_count returns 16 on the computelab AGX Orin; thus it makes sense that this check is failing, since the min_required_sms is 68, thus not letting it pick the autotune algorithm. Looking at the main for test_select_algorithm.py, we see that these tests should only be run if is_big_gpu(0) is true: https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L344. Thus this PR adds a similar check to the invocation of these tests in test_cuda_cpp_wrapper.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128652
Approved by: https://github.com/soulitzer, https://github.com/eqy
2024-06-18 02:00:04 +00:00
9e8443b56f Remove dtype from gpt-fast micro benchmark experiments model name (#128789)
Per comments on https://github.com/pytorch/test-infra/pull/5344, we already have a dtype column with the same information

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128789
Approved by: https://github.com/yanboliang
2024-06-18 01:26:45 +00:00
fbc7559ceb [custom ops] convert string type annotation to real type (#128809)
Fixes #105157

Bug source: `from __future__ import annotations` converts type annotation to strings to make forwards references easier. However, existing custom ops do not consider strings to be valid types.

Fix: We check if the argument and return type annotation is string type. If so, we try to use `eval` to convert it to a type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128809
Approved by: https://github.com/zou3519
2024-06-18 00:55:50 +00:00
c35ffaf954 [Inductor][CPP] Add ne with VecMask (#126940)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/126824#issuecomment-2125039161 which is missing the support of `ne` with `VecMask`.

**Test Plan**
```
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_ne_cpu_bool
```

Co-authored-by: Isuru Fernando <ifernando@quansight.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126940
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126841
2024-06-18 00:23:03 +00:00
beb29836cd [Inductor][CPP] Add Min/Max with VecMask (#126841)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/126824 which is missing the support of `min/max` with `VecMask`.

**TestPlan**
```
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_max_cpu_bool
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_min_cpu_bool
```

Co-authored-by: Isuru Fernando <ifernando@quansight.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126841
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
2024-06-18 00:20:32 +00:00
11ff5345d2 Changed colored logging to only be turned on if printing to interactive terminal (#128874)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128874
Approved by: https://github.com/anijain2305
2024-06-17 23:53:26 +00:00
b70440f0a7 Document the torch.cuda.profiler.profile function (#128216)
Fixes https://github.com/pytorch/pytorch/issues/127901

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128216
Approved by: https://github.com/malfet, https://github.com/eqy
2024-06-17 23:42:40 +00:00
95b5ea9cde Add mark_unbacked (#128638)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128638
Approved by: https://github.com/IvanKobzarev
2024-06-17 23:39:48 +00:00
8415a4ba98 Back out "[ROCm] TunableOp for gemm_and_bias (#128143)" (#128815)
Summary:
Original commit changeset: 35083f04fdae

Original Phabricator Diff: D58501726

This PR is bringing a large numerical gap. e.g. for 256 x 4096 x 4096 GEMM, if we enable tunable op + DISABLE_ADDMM_HIP_LT=0, the results are way off.

Differential Revision: D58660832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128815
Approved by: https://github.com/mxz297, https://github.com/eqy, https://github.com/malfet
2024-06-17 22:52:27 +00:00
3b8c9b8ab1 [Docker Release] Test if pytorch was compiled with CUDA before pushing to repo (#128852)
Related to: https://github.com/pytorch/pytorch/issues/125879
Would check if we are compiled with CUDA before publishing CUDA Docker nightly image

Test
```
#18 [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo "Is torch compiled with cuda: ${IS_CUDA}";     if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then 	exit 1;     fi
#18 1.656 Is torch compiled with cuda: False
#18 ERROR: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo \"Is torch compiled with cuda: ${IS_CUDA}\";     if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1;     fi" did not complete successfully: exit code: 1
------
 > [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo "Is torch compiled with cuda: ${IS_CUDA}";     if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then 	exit 1;     fi:
1.656 Is torch compiled with cuda: False
------
Dockerfile:80
--------------------
  79 |     RUN /opt/conda/bin/pip install torchelastic
  80 | >>> RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');\
  81 | >>>     echo "Is torch compiled with cuda: ${IS_CUDA}"; \
  82 | >>>     if test "${IS_CUDA}" != "True" -a ! -z "${CUDA_VERSION}"; then \
  83 | >>> 	exit 1; \
  84 | >>>     fi
  85 |
--------------------
ERROR: failed to solve: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo \"Is torch compiled with cuda: ${IS_CUDA}\";     if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1;     fi" did not complete successfully: exit code: 1
(base) [ec2-user@ip-172-30-2-248 pytorch]$ docker buildx build --progress=plain  --platform="linux/amd64"  --target official -t ghcr.io/pytorch/pytorch:2.5.0.dev20240617-cuda12.4-cudnn9-devel --build-arg BASE_IMAGE=nvidia/cuda:12.4.0-devel-ubuntu22.04 --build-arg PYTHON_VERSION=3.11 --build-arg CUDA_VERSION= --build-arg CUDA_CHANNEL=nvidia --build-arg PYTORCH_VERSION=2.5.0.dev20240617 --build-arg INSTALL_CHANNEL=pytorch --build-arg TRITON_VERSION= --build-arg CMAKE_VARS="" .
#0 building with "default" instance using docker driver
```

Please note looks like we are installing from pytorch rather then nighlty channel on PR hence cuda 12.4 is failing since its not in pytorch channel yet:
https://github.com/pytorch/pytorch/actions/runs/9555354734/job/26338476741?pr=128852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128852
Approved by: https://github.com/malfet
2024-06-17 22:51:12 +00:00
1835e3beab Fix the inductor ci (#128879)
Fix the torchbench+inductor ci on trunk due to recent upgrade to numpy 2.0.0rc1.
We have to remove DALLE2_pytorch model, since it depends on embedding-reader, which is not compatible with numpy>2: https://github.com/rom1504/embedding-reader/blob/main/requirements.txt#L3

Fixes #128845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128879
Approved by: https://github.com/eellison
2024-06-17 22:20:33 +00:00
7baf32b5e7 [c10d] fix p2p group commsplit (#128803)
Summary:
For PointToPoint(sendrecv), the deviceId is lower_rank:higher_rank. This means a p2p group cannot be created through commSplit since it cannot find a parent.

Fix this by using the right device key of current rank.

Differential Revision: D58631639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128803
Approved by: https://github.com/shuqiangzhang
2024-06-17 22:07:40 +00:00
1fd7496ab2 [MTIA] Fix synchronize API (#128714)
Reviewed By: fenypatel99

Differential Revision: D58590313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128714
Approved by: https://github.com/aaronenyeshi
2024-06-17 21:58:46 +00:00
cyy
163847b1bb [1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128675
Approved by: https://github.com/r-barnes
2024-06-17 21:25:59 +00:00
8953725e6d [Inductor][FlexAttention] Tune backwards kernel block sizes (#128853)
This replaces #128767 which somehow closed by mistake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128853
Approved by: https://github.com/angelayi
2024-06-17 21:10:55 +00:00
a489792bb2 [GPT-benchmark] Fix memory bandwidth for MoE (#128783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128783
Approved by: https://github.com/Chillee
ghstack dependencies: #128768
2024-06-17 21:04:57 +00:00
8c06eae17e [GPT-benchmark] Add metric: compilation time for GPT models (#128768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128768
Approved by: https://github.com/Chillee
2024-06-17 21:04:57 +00:00
a59766ee05 replace AT_ERROR(...) with TORCH_CHECK(false, ...) (#128788)
as per title. encountered the old-fashioned by chance

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128788
Approved by: https://github.com/mikaylagawarecki
2024-06-17 20:50:22 +00:00
0f89e66d17 Validate logs are created by default (#128522)
Summary: Make sure that logs are caputured in default settings

Test Plan: ci

Differential Revision: D58395812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128522
Approved by: https://github.com/d4l3k
2024-06-17 20:07:13 +00:00
1577328ea4 Set bash shell on Windows (#128854)
Attempt to fix the missing python3 command on the new Windows AMI https://github.com/pytorch/pytorch/actions/runs/9551494945/job/26325922503.  I added the logic to copy python to python3 to make the command available, it worked with the previous AMI, but start to fail now and the cause is not clear (maybe it's not the AMI, but a new GitHub runner version)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128854
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman
2024-06-17 19:24:09 +00:00
b181b58857 Fix Storage.filename to not track the filename when storage was mmap-ed with MAP_PRIVATE (#128725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128725
Approved by: https://github.com/albanD
2024-06-17 18:55:47 +00:00
213eba7d2e Configure mergebot via config (#128840)
Fixes #ISSUE_NUMBER
* Companion to https://github.com/pytorch/test-infra/pull/5312
* See the above for details + possible risks
* Without the above PR, this should have no effects
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128840
Approved by: https://github.com/huydhn
2024-06-17 18:53:56 +00:00
c172b58fe0 Revert "Update DALLE2_pytorch expected accuracy result on CPU (#128718)"
This reverts commit fd27138c4a86bd763a6b8128d940a7c98f951603.

Reverted https://github.com/pytorch/pytorch/pull/128718 on behalf of https://github.com/huydhn due to This has reverted back to the previous expected value for some reason 153362fbc9 ([comment](https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174194219))
2024-06-17 18:49:15 +00:00
5344c41d43 Use forked torchbench branch with pinned numpy (#128856)
Adds pinned numpy commit to yolov3 dependencies to the existing pinned commit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128856
Approved by: https://github.com/huydhn, https://github.com/PaliC
2024-06-17 18:41:42 +00:00
cyy
d35cdee97f [Caffe2] Remove caffe2 onnx tests (#128687)
They are not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128687
Approved by: https://github.com/r-barnes
2024-06-17 18:17:58 +00:00
153362fbc9 Support HSDP + Monolith Checkpointing (#128446)
Fixes #128444. Rank 0 check should be in the same group as the broadcast

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128446
Approved by: https://github.com/fegin
2024-06-17 16:59:41 +00:00
c6b180a316 Created docs (and example) for cudart function in torch.cuda (#128741)
Fixes #127908

## Description

Created docs to document the torch.cuda.cudart function to solve the issue #127908.
I tried to stick to the [guidelines to document a function](https://github.com/pytorch/pytorch/wiki/Docstring-Guidelines#documenting-a-function) but I was not sure if there is a consensus on how to handle the docs of a function that calls an internal function. So I went ahead and tried what the function will raise, etc. from the user endpoint and documented it (i.e. I am giving what actually _lazy_init() will raise).

Updated PR from #128298 since I made quite a big mistake in my branch. I apologize for the newbie mistake.

### Summary of Changes

- Added docs for torch.cuda.cudart
- Added the cudart function in the autosummary of docs/source/cuda.rst

## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128741
Approved by: https://github.com/msaroufim
2024-06-17 16:50:37 +00:00
fc2913fb80 Remove amax return from _scaled_mm (#128683)
# Summary
The primary reason for the change was lack of current use case and the need to work around an two Inductor issue.
- Tensor arguments as kwarg only
- multiple outputs from triton templates

If the need for the amax return type arises we can consider either adding it, more likely creating a separate op.

In principle PyTorch is moving away from ops that bundle lots of functionality into "mega ops". We instead rely upon the compiler to generate appropriate fused kernels.

### Changes:
- This removes the amax return type from scaled_mm. We have found that the common use case is to return in "high-precision" ( a type with more precision than fp8). This is only relevant when returning in low-precision.
- We currently still allow for fp8 returns and scaled result.  Perhaps we should also ban this as well...

New signature:
```Python
def meta_scaled_mm(
    self: torch.Tensor,
    mat2: torch.Tensor,
    scale_a: torch.Tensor,
    scale_b: torch.Tensor,
    bias: Optional[torch.Tensor] = None,
    scale_result: Optional[torch.Tensor] = None,
    out_dtype: Optional[torch.dtype] = None,
    use_fast_accum: bool = False,
) -> torch.Tensor:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128683
Approved by: https://github.com/vkuzo
2024-06-17 16:48:00 +00:00
73b78d1cbe Document the torch.nn.parallel.scatter_gather.gather function (#128566)
Fixes #127899

### Description
Add docstring to `torch/nn/parallel/scatter_gather.py:gather` function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128566
Approved by: https://github.com/kwen2501
2024-06-17 16:44:17 +00:00
316b729677 [Fix] TS converter constant to tensor (#128442)
#### Issue
Tensor constant was previously lifted directly as an input in the fx graph, which results errors for multiple test cases with tensor constant. This PR introduces a fix to convert tensor constant to a `GetAttr` in the fx graph.

This PR also introduces other fixes to maintain a valid `state_dict` for exported program when there are tensor constants. In short, after tensor constants are converted as `GetAttr`, they are treated as buffers during retracing. The fix will convert those back from buffer to constant.

#### Test Plan
Add new test cases that generate tensor constants
* `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128442
Approved by: https://github.com/angelayi
2024-06-17 16:42:43 +00:00
a87d82abd7 [BE] enable UFMT for torch/nn/*.py (#128593)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #128596, #128594, #128592
2024-06-17 16:29:29 +00:00
f6e6e55fa7 [BE] enable UFMT for torch/nn/functional.py (#128592)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #128596, #128594
2024-06-17 16:29:29 +00:00
95ac2d6482 [BE] enable UFMT for torch/nn/modules (#128594)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #128596
2024-06-17 16:29:25 +00:00
dff6342a0b [BE][Easy] enable UFMT for torch/nn/parallel (#128596)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128596
Approved by: https://github.com/mikaylagawarecki
2024-06-17 16:29:22 +00:00
bfad0aee44 [export] Preserve requires_grad for export inputs. (#128656)
Summary: Today meta['val'] on placeholder nodes doesn't preserve the consistent requires_grad information with the original inputs. Seems there's no easy way to fix this directly at proxy tensor layer. This is useful for reexporting joint graph.

Test Plan: test_preserve_requires_grad_placeholders

Differential Revision: D58555651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128656
Approved by: https://github.com/tugsbayasgalan
2024-06-17 16:26:08 +00:00
2a41fc0390 Short-term fix to preserve NJT metadata cache in torch.compile (#122836)
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.

For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.

**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**

Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
ghstack dependencies: #127007, #128057
2024-06-17 15:25:09 +00:00
24443fe16a [inductor] parallel compile: Print traceback detail when there's an exception in a sub-process (#128775)
Summary: We lose traceback info when an exception occurs in a subprocess because Python traceback objects don't pickle. In the subprocess-based parallel compile, we _are_ logging an exception in the subprocess, but a) those messages are easy to miss because they're not in the traceback output, and b) it seems that logging in the subproc is swallowed by default in internal builds. This PR captures the traceback in the subprocess and makes it available in the exception thrown in the main process. Users now see failures that look like this:

```
  ...
  File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
SubprocException: An exception occurred in a subprocess:

Traceback (most recent call last):
  File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 270, in do_job
    result = SubprocMain.foo()
  File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 263, in foo
    SubprocMain.bar()
  File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 260, in bar
    SubprocMain.baz()
  File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 257, in baz
    raise Exception("an error occurred")
Exception: an error occurred
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128775
Approved by: https://github.com/jansel
2024-06-17 15:10:47 +00:00
e3093849e5 [Docs] Update links (#128795)
From
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding to
https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

And from
https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag  to
https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html

Fixes https://github.com/pytorch/pytorch/issues/128774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128795
Approved by: https://github.com/atalman
2024-06-17 14:55:32 +00:00
0f81473d7b Update fake tensor error checks for bool tensor subtraction (#128492)
Fixes #127003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128492
Approved by: https://github.com/soulitzer
2024-06-17 13:41:15 +00:00
b0282071c4 [dynamo] override torch.nn.modules.activation._is_make_fx_tracing (#128748)
Discovered while inlining `MultiHeadAttention` nn Module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128748
Approved by: https://github.com/jansel
ghstack dependencies: #128315
2024-06-17 08:49:29 +00:00
b40a033c38 [cpp_extension][inductor] Fix sleef windows depends. (#128770)
# Issue:
During I'm working on enable inductor on PyTorch Windows, I found the sleef lib dependency issue.
<img width="1011" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/423bd854-3c5f-468f-9a64-a392d9b514e3">

# Analysis:
After we enabled SIMD on PyTorch Windows(https://github.com/pytorch/pytorch/pull/118980 ), the sleef functions are called from VEC headers. It bring the sleef to the dependency.

Here is a different between Windows and Linux OS.
## Linux :
Linux is default export its functions, so libtorch_cpu.so static link to sleef.a, and then It also export sleef's functions.
<img width="647" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/00ac536c-33fc-4943-a435-25590508840d">

## Windows:
Windows is by default not export its functions, and have many limitation to export functions, reference: https://github.com/pytorch/pytorch/issues/80604
We can't package sleef functions via torch_cpu.dll like Linux.

# Solution:
Acturally, we also packaged sleef static lib as a part of release. We just need to help user link to sleef.lib, it should be fine.
1. Add sleef to cpp_builder for inductor.
2. Add sleef to cpp_extension for C++ extesion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128770
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-17 05:44:34 +00:00
a52c8ace98 [3/N] Non-Tensor: Support string parameter for aten operations (#125831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125831
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-06-17 05:11:29 +00:00
cyy
74e11a4210 Enable clang-tidy on torch/csrc/mps (#128782)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128782
Approved by: https://github.com/Skylion007
2024-06-17 02:19:48 +00:00
cyy
f9dae86222 Concat namespaces in torch/csrc/utils/* (#128787)
Concat namespaces in torch/csrc/utils/*
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128787
Approved by: https://github.com/Skylion007
2024-06-16 23:51:14 +00:00
6cbdbb6c3c Remove top lev numpy dependency from fuzzer.py (#128759)
Test CI

This fixes issues like this where I don't even intend to use the fuzzer. this way if someone is calling functions from the fuzzer numpy will be imported otherwise the import should not happen at the top of the file

```
>>> import torchao
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/__init__.py", line 26, in <module>
    from torchao.quantization import (
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/__init__.py", line 7, in <module>
    from .smoothquant import *  # noqa: F403
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/smoothquant.py", line 18, in <module>
    import torchao.quantization.quant_api as quant_api
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/quant_api.py", line 23, in <module>
    from torchao.utils import (
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/utils.py", line 2, in <module>
    import torch.utils.benchmark as benchmark
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/__init__.py", line 4, in <module>
    from torch.utils.benchmark.utils.fuzzer import *  # noqa: F403
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/utils/fuzzer.py", line 5, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128759
Approved by: https://github.com/Skylion007
2024-06-16 16:34:12 +00:00
f8d60e0e0a [Inductor][CPP] Fix Half data type cse cache issue for CPP Backend (#128498)
**Summary**
Fixing issue: https://github.com/pytorch/pytorch/issues/128263. After https://github.com/pytorch/pytorch/issues/115260, we cached the higher precision cse variable to avoid duplicate casting between buffers. However, it failed to check the original data type. This means if we convert `int32` to `bf16` for `store` and then convert `bf16` back to `fp32` for `load`, it would incorrectly hit the cache and reuse the `int32` cse var. This PR fixes the issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_128263
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128498
Approved by: https://github.com/jgong5, https://github.com/zhuhaozhe, https://github.com/jerryzh168
2024-06-16 11:27:13 +00:00
979edbbe12 [Traceable FSDP2] Dynamo support FSDP2 use_training_state context manager (#127854)
Improve Dynamo to support the FSDP2 `use_training_state()` context manager.

Test command:
`
pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_dynamo_trace_use_training_state
`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127854
Approved by: https://github.com/yanboliang
2024-06-16 08:48:52 +00:00
e4d8aa4d24 [torchbench] Enable some models with inline_inbuilt_nn_modules (#128315)
For all models, graph breaks/recompiles reduce.
For drq, it increases and this is a legit one.

Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128315
Approved by: https://github.com/jansel
2024-06-16 08:37:23 +00:00
cc518ebd38 [Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 2) (#124147)
Reuse Inductor test case for Intel GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124147
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-06-16 08:07:05 +00:00
f1ee3589a1 [Inductor] Emit strided block pointer from ModularIndexing and FloorDiv (#127342)
**Summary**

Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets.

Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be:
  - Powers of 2. Since Triton block sizes are powers of 2, any integer power of 2 either divides the block size, or is greater than the block size. In the latter case, `CielDiv(x, y)` rounds up to 1.
  - Multiples of the maximum block size. Since block sizes are powers of 2, the maximum block size is a multiple of every possible block size.

Note that a *slice* of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like `(5,8)`.

Feature proposal and discussion: https://github.com/pytorch/pytorch/issues/125077

Example kernel:
```
triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4096
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK])
    tmp1 = tmp0 + tmp0
    tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32))
''', device_str='cuda')
```

**Test Plan**

This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories:
  - Can we generate strided block pointers for the appropriate shapes?
     - Powers of 2
     - Non-power of 2, but multiple of the maximum block size
     - Arbitrary leading dimensions, with power of 2 inner dimensions
     - Weird strides and offsets
     - Reductions
     - Symbolic shapes that are multiples of the maximum block size (wasn't able to trace this through dynamo)
     - Broadcasts (some variables are missing from the indexing expression)
  - Do we still compile other cases correctly, even if we don't expect to be able to generate block pointers?
     - Unsupported static shapes
     - Unsupported symbolic shapes
  - Mixing and matching these cases:
     - Pointwise and reduction in the same kernel
  - Sanity check the test harness
     - Do we raise an exception if the expected number of block pointers and the actual number are different?

**Follow-ups**

There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs:
  - Handle non-divisible shapes
      - Change the tiling algorithm to generate a 2D (X,Y) blocking, if doing so enables block pointers to be emitted.
      - Pad unsupported loads up to the nearest divisible size, then mask/slice out the extra elements? This is probably the best solution, but I'm not yet sure how to go about it in triton.
 - Take advantage of this analysis when `triton.use_block_ptr=False`. I'm guessing we can still avoid `%` and `/` without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead?

Differential Revision: D56739375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127342
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-06-16 07:35:57 +00:00
a61939467a Enable passing dynamo-traced complex test (#128771)
Fixes https://github.com/pytorch/pytorch/issues/118159

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128771
Approved by: https://github.com/anijain2305
2024-06-16 07:28:09 +00:00
ab13980424 [ONNX] Update 'person_of_interest.rst', 'CODEOWNERS' and 'merge_rules.yaml' (#126364)
The following are all constrained under the ONNX exporter project scope.

- `personal_of_interest.rst`
  - Moving folks no longer working on the project to emeritus.
  - Adding @justinchuby, @titaiwangms, @shubhambhokare1 and @xadupre,
    who have all made countless contributions to this project.
- `CODEOWNERS`
  - Removing folks no longer working on the project.
  - Updating new owners who will now be notified with PRs related to
    the specific file paths.
- `merge_rules.yaml`
  - Removing folks no longer working on the project.

🫡

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126364
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/albanD
2024-06-16 04:52:16 +00:00
6079c50910 Make config.fx_graph_remote_cache be three-value switch (#128628)
Summary:
We want to allow for three configurations
False: Force off
True: Force on
None: OFF for OSS and JK config for internal

Test Plan: CI

Differential Revision: D58535897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128628
Approved by: https://github.com/masnesral, https://github.com/eellison
2024-06-15 17:52:09 +00:00
94c0dcbe1d [inductor] Parallel compile: handle crashes in subprocesses (#128757)
Summary: If any subprocess in the pool crashes, we get a BrokenProcessPool exception and the whole pool becomes unusable. Handle crashes by recreating the pool.

Test Plan:
* New unit test
* Started a long-running test (`test/inductor/test_torchinductor.py`), periodically killed subprocess manually, made sure the test run recovers and makes progress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128757
Approved by: https://github.com/jansel
2024-06-15 17:35:04 +00:00
f0d68120f4 [subclasses] Handle dynamo inputs that are subclass views with (-1) in the view (#128662)
When handling an input to dynamo that's a view of a subclass, dynamo does some handling to reconstruct the view. Part of this is to construct symints for the input parameters to the view.

Previously, the code would just call `create_symbol()` which by default specifies a _positive_ symint (>= 0); this fails in the case where you have an aten::view that was called with a -1.

Fix: just specify `positive=None` when calling `create_symbol()`, to avoid restricting the symint to >= 0 or <= 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128662
Approved by: https://github.com/jbschlosser
2024-06-15 14:58:18 +00:00
18634048a1 Separate AOTI Eager utils as a single file (#125819)
The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire
ghstack dependencies: #125308
2024-06-15 13:42:49 +00:00
7a39755da2 Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-15 10:20:21 +00:00
60bbdc0b40 Modularize aten parameter parser and checker (#125308)
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.

```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```

With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`, and other different types to be summarized as the following list. The list is collected from all aten operations and ordered by the number of being used.

- `Tensor`
- `bool`
- `int64_t`
- `TensorList`
- `Scalar`
- `c10::SymIntArrayRef`
- `::std::optional<Tensor>`
- `IntArrayRef`
- `double`
- `c10::SymInt`
- `::std::optional<ScalarType>`
- `::std::optional<double>`
- `::std::optional<bool>`
- `::std::optional<Layout>`
- `::std::optional<Device>`
- `::std::optional<int64_t>`
- `Dimname`
- `::std::optional<Generator>`
- `c10::string_view`
- `::std::optional<c10::string_view>`
- `OptionalIntArrayRef`
- `::std::optional<Scalar>`
- `OptionalSymIntArrayRef`
- `::std::optional<MemoryFormat>`
- `::std::optional<c10::SymInt>`
- `ScalarType`
- `ArrayRef<Scalar>`
- `DimnameList`
- `::std::optional<ArrayRef<double>>`
- `::std::array<bool,3>`
- `::std::optional<DimnameList>`
- `c10::List<::std::optional<Tensor>>`
- `::std::array<bool,2>`
- `Storage`
- `::std::array<bool,4>`
- `Device`
- `DeviceIndex`
- `ITensorListRef`
- `Stream`
- `Layout`
- `MemoryFormat`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-15 09:18:44 +00:00
de4f379cf2 run mkldnn test with inlining (#128749)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128749
Approved by: https://github.com/anijain2305
2024-06-15 09:04:08 +00:00
b50c0e94c2 TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY (#128739)
This adjusts the settings of the libuv backend to match the older TCPStore.

* DEFAULT_BACKLOG: setting this to -1 will enable using the host somaxconn value instead of a hardcoded 16k value. When going over this limit with `tcp_abort_on_overflow` set it results in connections being reset.
* TCP_NODELAY: Since TCPStore primarily sends small messages there's no benefit to using Nargle's algorithm and it may add additional latency for store operations.

Test plan:

```
python test/distributed/test_store.py -v -k LibUv
```

Benchmark script:
```
import time
import os

import torch.distributed as dist

rank = int(os.environ["RANK"])

store = dist.TCPStore(
    host_name="<server>",
    port=29500,
    world_size=2,
    is_master=(rank == 0),
    use_libuv=True,
)

if rank == 1:
    total_iters = 0
    total_dur = 0
    for iter in range(10):
        iters = 500000
        start = time.perf_counter()
        for i in range(iters):
            store.set(f"key_{i}", f"value_{i}")
        dur = time.perf_counter() - start
        print(f"{iter}. {iters} set, qps = {iters/dur}")
        total_iters += iters
        total_dur += dur

    print(f"overall qps = {total_iters/total_dur}")
else:
    print("sleeping")
    time.sleep(1000000000)
```

Performance seems to be negligible difference between TCP_NODELAY and not for a single host

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128739
Approved by: https://github.com/rsdcastro, https://github.com/kurman, https://github.com/c-p-i-o
2024-06-15 07:40:18 +00:00
cyy
e4c32d14a8 [3/N] Remove inclusion of c10/util/string_utils.h (#128504)
Follows #128372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128504
Approved by: https://github.com/malfet
2024-06-15 06:38:40 +00:00
472211c97a Make assert_size_stride to return all errors (#128764)
This will help debug some problems I'm encountering, but in general, it is best to show the entire error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128764
Approved by: https://github.com/jansel
2024-06-15 06:32:40 +00:00
4ccbf711e2 Learning Rate Scheduler docstring fix (#128679)
Fix docstrings in Learning Rate Scheduler.

The fix can be verified by running pydocstyle path-to-file --count

Related #112593

**BEFORE the PR:**
pydocstyle torch/optim/lr_scheduler.py --count

92


**AFTER the PR:**
pydocstyle torch/optim/lr_scheduler.py --count

0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128679
Approved by: https://github.com/janeyx99
2024-06-15 05:30:35 +00:00
108adbc726 [dynamo][side effects] Raise assertion error if the object is already tracked for mutation (#128590)
This issue was pointed out by @tombousso here - https://github.com/pytorch/pytorch/pull/128269#issuecomment-2163755792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128590
Approved by: https://github.com/mlazos
ghstack dependencies: #128715, #128269
2024-06-15 05:07:49 +00:00
9ebf77b13b Fix windows inductor defination issue (#128686)
Changes:
1. Add memory align macro support on Windows.
2. Fix `#pragma unroll` not support on MSVC cl compiler.
`#pragma unroll` occur error on msvc `cl` compiler, but it would be supported on Windows `clang`.
We'd better disable it only on `__msvc_cl__` compiler, and get better performance if we enabled `clang`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128686
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-15 03:02:00 +00:00
7e092a62e6 [dynamo] Support weakref objects (#128533)
Fixes https://github.com/pytorch/pytorch/issues/125720

I was earlier worried that DELETE_* or STORE_* on referent values should result in a graph break, because they could invalidate the weak ref. But then @zou3519 pointed out that weakref invalidation will happen EVENTUALLY, CPython provides no guarantees when the weakref will be invalidated (even when the user calls del x and x is the last reference).

So any code that relies on del x to invalidate the weakref of x right away is BAD code. CPython provide no guarantees. Therefore we can (ab)use this nuance, and can just ignore DELETE_* or STORE_* on the referent objects.

The only corner case is when Dynamo is reconstructing the weakref object. Dynamo will have a hard time being correct here, so just SKIP_FRAME on such a case. This is rare.

Cpython notes
1) https://docs.python.org/3/library/weakref.html
2) https://docs.python.org/3/reference/datamodel.html#index-2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128533
Approved by: https://github.com/jansel
2024-06-15 02:16:25 +00:00
62a0e39ced [dynamo][inlining-nn-modules] Update tests with new expected counts (#128463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128463
Approved by: https://github.com/yanboliang
2024-06-15 02:08:02 +00:00
2d01f87737 Enable torch.empty for float8 dtypes + deterministic mode + cpu (#128744)
Summary:

Enables creating empty float8 tensors for:
* cuda when `torch.use_deterministic_algorithms` is set to True
* cpu for all settings of `torch.use_deterministic_algorithms`

Context for NaN values of float8_e4m3fn and float8_e5m2: https://arxiv.org/pdf/2209.05433, Section 3, Table 1

Context for NaN values of float8_e4m3fnuz and float8_e5m2fnuz: https://arxiv.org/pdf/2206.02915, Section 3.2, "instead of reserving one exponent field to represent Inf and NaN, we reserve only a single codeword (corresponding to negative zero)"

Test Plan:

```
python test/test_quantization.py -k test_empty
```

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes https://github.com/pytorch/pytorch/issues/128733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128744
Approved by: https://github.com/malfet, https://github.com/drisspg
2024-06-15 02:05:30 +00:00
846bb30e13 Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)"
This reverts commit bd72e28314d8d63bb347becb8309f5ac7761c6b5.

Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build bd72e28314. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))
2024-06-15 01:58:20 +00:00
5efe71f134 Revert "[export] Add print_readable to unflattener (#128617)"
This reverts commit 5d9a609b4f6c94fb930188e4d7c99f53d989c022.

Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/huydhn due to Sorry for reverting your change but another failed test shows up in trunk inductor/test_flex_attention.py where it needs to be updated 5d9a609b4f.  I guess it is easier to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2169030779))
2024-06-15 01:46:23 +00:00
f37121bb74 Add model name, quantization and device to gpt_fast micro benchmark output (#128091)
A small enhancement to https://hud.pytorch.org/benchmark/llms with these columns in the output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128091
Approved by: https://github.com/yanboliang
2024-06-15 01:39:48 +00:00
3f47c72268 add multiprocessing checks in test_dataloader.py (#128244)
Add multiprocessing checks in test_dataloader.py for tests requiring multiprocessing similar to test_multiprocessing.py: https://github.com/pytorch/pytorch/blob/main/test/test_multiprocessing.py#L41-L52. Change all Jetson skips to TEST_CUDA_IPC checks since that is the root cause of the failures on Jetson in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128244
Approved by: https://github.com/eqy, https://github.com/malfet
2024-06-15 01:32:55 +00:00
73ba432d32 [custom_op]Fix None return schema (#128667)
Fixes #125044

If users define a schema returns `None`, it will be parsed to a `torch.NoneType`.  Auto functionalization support the `()` as a empty return but not for `None`. So, `None` return fails the check for [`can_auto_functionalize`](https://github.com/pytorch/pytorch/blob/findhao/fix_none_return_functionalize/torch/_higher_order_ops/auto_functionalize.py#L71) even we can take this as a `()` return. This PR is a fix to skip the check for None return.

I hope it can be fixed in a [deeper level](31e44c72ca), but this fix breaks a lot of existing schemas. So it's better to fix this issue in the auto_functionalize.py at this moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128667
Approved by: https://github.com/zou3519
2024-06-15 00:41:37 +00:00
6616ad030f [Inductor] Fix the High Order Op layout issue (#128275)
Fix the issue: https://github.com/pytorch/pytorch/issues/127995

- In current implementation of creating `FallbackKernel`, the `device` of the `NoneLayout` is set to `None` when `example_output` returns from `cls.process_kernel` is `None`. 921aa194c7/torch/_inductor/ir.py (L5632-L5649)
- If a `ExternalKernel schedulerNode` has None device, the previous buffer will not flush before codegen this `ExternalKernel schedulerNode`  which causes the wrong generated code.
ef2b5ed500/torch/_inductor/scheduler.py (L2701-L2709)

**Test Plan**
```
python -u -m pytest -s -v test/higher_order_ops/test_with_effects.py -k test_compile_inductor_external_op_return_none
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128275
Approved by: https://github.com/eellison
2024-06-15 00:33:21 +00:00
5d9a609b4f [export] Add print_readable to unflattener (#128617)
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.

Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
    def forward(self, x: "f32[2, 3]"):
        # No stacktrace found for following nodes
        rootparam: "f32[2, 3]" = self.rootparam

        # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
        mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam);  x = rootparam = None

        # No stacktrace found for following nodes
        foo: "f32[2, 3]" = self.foo(mul);  mul = None
        bar: "f32[2, 3]" = self.bar(foo);  foo = None
        return (bar,)

    class foo(torch.nn.Module):
        def forward(self, mul: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child1param: "f32[2, 3]" = self.child1param
            nested: "f32[2, 3]" = self.nested(mul);  mul = None

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
            add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param);  nested = child1param = None
            return add

        class nested(torch.nn.Module):
            def forward(self, mul: "f32[2, 3]"):
                # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
                div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul);  mul = None
                return div

    class bar(torch.nn.Module):
        def forward(self, add: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child2buffer: "f32[2, 3]" = self.child2buffer

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
            sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer);  add = child2buffer = None
            return sub
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
2024-06-15 00:26:04 +00:00
d67923b955 Adding kwargs to composable AC API to enable full capabilities (#128516)
Summary:
Firstly, this does not change any existing behaviour, since all the
default values for kwargs were hardcoded into the ``_checkpoint_without_reentrant_generator`` call.

Secondly, this is needed for unlocking the full potential of composable
checkpointing making it equivalent to ``torch.utils.checkpoint.checkpoint(use_reentrant=False)``.

Finally, an added benefit is now composable checkpointing can be used under ``FakeTensorMode`` by
passing ``preserve_rng_state=False``.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128516
Approved by: https://github.com/awgu
2024-06-15 00:23:48 +00:00
271852aa7e inductor: pre-grad bmm pass shouldn't match if output is mutated (#128570)
This PR is enough to get this test to pass when using `TORCHDYNAMO_INLINE_INBUILT_NN_MODULES`:
```
TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1  python test/inductor/test_group_batch_fusion.py -k TestPostGradBatchLinearFusion.test_batch_linear_post_grad_fusion
```

inductor has a pre-grad pass to swap out multiple `linear` layers with with `addbmm`, but it also needs to insert an `unbind()` at the end. If that unbind is then followed by a mutation (like `add_()`), the autograd engine will complain (autograd does not let you mutate the output of multiple-out-view ops like unbind).

I made a tweak to the pattern matching logic to avoid matching if the output of the linear is used in an op that mutates its input. My hope is that:
(1) this situation is rare enough that it won't materially impact pattern matching in real world code
(2) I had to use a heuristic for "is an op a mutable op", since the graph we get is from dynamo, so it can contain code like `operator.iadd` in it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128570
Approved by: https://github.com/eellison, https://github.com/mlazos
ghstack dependencies: #127927
2024-06-15 00:08:44 +00:00
ba19ed9a1a FunctionalTensor: dispatch metadata directly to inner tensor (#127927)
Fixes https://github.com/pytorch/pytorch/issues/127374

The error in the linked repro is:
```
AssertionError: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.sym_storage_offset.default(_to_functional_tensor(FakeTensor(..., device='cuda:0', size=(16, 4), dtype=torch.uint8),
       device='cuda:0'))
```

Where we hit FakeTensor.__torch_dispatch__, but our input is a C++ `FunctionalTensorWrapper`.

What should actually have happened is that the call to `aten.sym_storage_offset` hits the `Functionalize` dispatch key, which should remove the `FunctionalTensorWrapper`  and redispatch. I spent some time debugging and haven't actually figured out why this isn't happening. Instead, this PR just skips that step completely, and asks `FunctionalTensor` to directly unwrap the C++ `FunctionalTensorWrapper` when querying tensor metadata.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127927
Approved by: https://github.com/tugsbayasgalan
2024-06-15 00:08:44 +00:00
574a2cbcb7 Enable UFMT on common_device_type.py and common_dtype.py (#128490)
Part of: https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
> torch/testing/_internal/common_device_type.py
> torch/testing/_internal/common_dtype.py

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128490
Approved by: https://github.com/ezyang, https://github.com/XuehaiPan
2024-06-15 00:07:42 +00:00
0492ec460a [BE] Remove external testing of torch::deploy (#127952)
As we don't expect external users of torch::deploy as the library is no longer supported, we will remove external testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127952
Approved by: https://github.com/malfet
2024-06-14 23:32:02 +00:00
cyy
bd72e28314 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang
2024-06-14 23:21:01 +00:00
52d4442a00 [c10d] Socket, TCPStore: add better logging (#128673)
This adds better logging of errors to the socket and TCPStore classes.

All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged.

It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky.

Test plan:

```
python test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673
Approved by: https://github.com/c-p-i-o
2024-06-14 23:08:29 +00:00
4abecd7102 [AOTI] fixed performance issue for AOTI_TORCH_CHECK (#128402)
We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation
time issues. Unfortunately, it caused perf regressions for CPU
, as described in issue #126665. After some investigation, it turned
out the slow compilation was caused by the use of the builtin
function __builtin_expect provided by gcc/clang. Moreover,
nuking __builtin_expect doesn't seem to cause any performance penalty,
even though its purpose is to improve performance by providing the
compiler with branch prediction information.

abs latency numbers using the script shared by #126665:

                            before the fix      after the fix
T5Small                     1019.055694         917.875027
T5ForConditionalGeneration  1009.825196         916.369239
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128402
Approved by: https://github.com/desertfire
2024-06-14 23:03:17 +00:00
fd27138c4a Update DALLE2_pytorch expected accuracy result on CPU (#128718)
I suspect that the issue shows up because of the new version of https://pypi.org/project/pyarrow/16.1.0/#history released yesterday.  The package is a dependency of DALLE2_pytorch https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/DALLE2_pytorch/install.py#L22.

I'll just update the expected accuracy result on CPU benchmark because the model fails to run there anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128718
Approved by: https://github.com/malfet
2024-06-14 22:54:21 +00:00
d3a4d9e4fe Update cu124 dynamo benchmark expected values (#128737)
Missed one in https://github.com/pytorch/pytorch/pull/128589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128737
Approved by: https://github.com/Skylion007
2024-06-14 22:23:00 +00:00
bca2cf00ed [ONNX] Add dynamic axes support to torchscript exporter with dynamo=True (#128371)
This PR enables specific axe to be dynamic with calling torch.export.export and torch.export.Dim.

Features:
(1) Turn dynamic_axes to dynamic_shapes
(2) Dim constraints remain the same (see test case with hitting constraints). This might give different user experience, since we didn't have any constraints in torchscript-onnx exporting.
(3) If input_names is used in dynamic_axes, ValueError will be raised, as input_names is currently not supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128371
Approved by: https://github.com/justinchuby
2024-06-14 21:56:51 +00:00
f103247a14 Run all samples for torchinductor tests (#128343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343
Approved by: https://github.com/lezcano
2024-06-14 21:52:12 +00:00
e9c6e8369c Torchbind call method + effects support (#128397)
Adds effect token support to torchbind method calls by allowing `with_effects` to take in `torch.ops._higher_order_ops.call_torchbind` as an input.

Here is the print from `TORCH_LOGS="aot" python test/export/test_torchbind.py -k test_compile_obj_torchbind_op`:
```python
def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2]", arg2_1):
    # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1266 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.cos())
    cos: "f32[2]" = torch.ops.aten.cos.default(arg1_1)
    with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, cos);  arg0_1 = cos = None
    getitem: "f32[0]" = with_effects[0];  with_effects = None

    # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1267 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.cos() + 1)
    cos_1: "f32[2]" = torch.ops.aten.cos.default(arg1_1)
    add: "f32[2]" = torch.ops.aten.add.Tensor(cos_1, 1);  cos_1 = None
    with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, add);  getitem = add = None
    getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None

    # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1268 in f, code: torch.ops._TorchScriptTesting.queue_pop(tq)
    with_effects_2 = torch._higher_order_ops.effects.with_effects(getitem_2, torch.ops._TorchScriptTesting.queue_pop.default, arg2_1);  getitem_2 = None
    getitem_4: "f32[0]" = with_effects_2[0];  with_effects_2 = None

    # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1269 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.sin())
    sin: "f32[2]" = torch.ops.aten.sin.default(arg1_1);  arg1_1 = None
    with_effects_3 = torch._higher_order_ops.effects.with_effects(getitem_4, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, sin);  getitem_4 = sin = None
    getitem_6: "f32[0]" = with_effects_3[0];  with_effects_3 = None

    # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1270 in f, code: return tq.pop(), tq.pop() + tq.size(), tq
    with_effects_4 = torch._higher_order_ops.effects.with_effects(getitem_6, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'pop');  getitem_6 = None
    getitem_8: "f32[0]" = with_effects_4[0]
    getitem_9: "f32[2]" = with_effects_4[1];  with_effects_4 = None
    with_effects_5 = torch._higher_order_ops.effects.with_effects(getitem_8, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'pop');  getitem_8 = None
    getitem_10: "f32[0]" = with_effects_5[0]
    getitem_11: "f32[2]" = with_effects_5[1];  with_effects_5 = None
    with_effects_6 = torch._higher_order_ops.effects.with_effects(getitem_10, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'size');  getitem_10 = arg2_1 = None
    getitem_12: "f32[0]" = with_effects_6[0];  with_effects_6 = None
    add_1: "f32[2]" = torch.ops.aten.add.Tensor(getitem_11, 0);  getitem_11 = None
    return (getitem_12, getitem_9, add_1)
```

In order to support this, this PR makes the following changes:
* Adds `FakeScriptObject` to `CustomObjArgument`, which will be put on the `meta["val"]` of nodes representing torchbind objects.
* Adds pickle/deepcopy support to FunctionSchema.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128397
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-06-14 21:28:17 +00:00
65d3ddcb8b Add GLIBC requirements for libtorch to solve #113124 (#128135)
Fixes #113124.

## Description

I modified the installing.rst file to address the system requirements and troubleshooting steps for using LibTorch with different GLIBC versions.

### Summary of Changes

- Added system requirements specifying the GLIBC version needed for both the cxx11 ABI version and the pre-cxx11 ABI version of LibTorch.
- Included a troubleshooting section with instructions on how to check the dependencies of the LibTorch libraries and identify the required GLIBC version using the `ldd lib/libtorch.so` command.

## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128135
Approved by: https://github.com/jbschlosser
2024-06-14 21:24:53 +00:00
e9a29aaa4a [ONNX] Add upsample trilinear to skip decomp (#128259)
(1) Add upsample trilinear vec to skip decomposition
(2) Add tests to make sure that torch.export.export still decomposes them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259
Approved by: https://github.com/justinchuby
2024-06-14 21:20:44 +00:00
e6e102cf85 Dynamo testing: add some skips (#128734)
The following tests are failing consistently for me locally, so we're
going to skip them. They're disabled in CI but it looks like they're
just always failing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128734
Approved by: https://github.com/williamwen42
ghstack dependencies: #128731
2024-06-14 20:53:30 +00:00
11de50f17c [Dynamo] skip some TorchScript tests (#128731)
We don't care about the Dynamo x TorchScript composition, so I'm
disabling these tests (so they don't get reported as flaky). Not
disabling all of the TorchScript tests yet because they have been useful
to catch random bugs.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128731
Approved by: https://github.com/williamwen42
2024-06-14 20:53:30 +00:00
4b96575a09 [dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196)
FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched.

For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196
Approved by: https://github.com/soulitzer
2024-06-14 20:28:08 +00:00
1aafb9eb90 [dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269)
Fixes https://github.com/pytorch/pytorch/issues/101168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269
Approved by: https://github.com/jansel
ghstack dependencies: #128715
2024-06-14 20:17:03 +00:00
9c77332116 [torch.compile][ci] Flaky models in CI (similar to DISABLED_TEST) (#128715)
These models are really flaky. I went into the CI machine and ran the model many times, sometime it fails, sometimes it passes. Even Pytorch-eager results change from run to run, so the accuracy comparison is fundamentally broken/non-deterministic. I am hitting these issues more frequently in inlining work. There is nothing wrong with inlining, I think these models are on the edge of already-broken accuracy measurement, and inlining is just pushing it in more broken direction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128715
Approved by: https://github.com/eellison
2024-06-14 20:17:03 +00:00
2e5366fbc0 Extended Module Tracker (#128508)
This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes.

1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``.
2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``.
3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case.
4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508
Approved by: https://github.com/wanchaol
2024-06-14 19:48:46 +00:00
d50712e5e3 [PT2] add inductor log for unbind_stack_pass (#128684)
Summary: Currently, we do not log the pass. To better enable pattern hit inspection, we enable it.

Test Plan: see signal

Differential Revision: D58571992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128684
Approved by: https://github.com/dshi7
2024-06-14 19:45:55 +00:00
9035fff2de [BE] Do not test deprecated torch.nn.utils.weight_norm (#128727)
Test `torch.nn.utils.parametrizations.weight_norm` instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128727
Approved by: https://github.com/kit1980
ghstack dependencies: #128726
2024-06-14 19:14:44 +00:00
27458cc097 [BE] Refactor repeated code in test_weight_norm (#128726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128726
Approved by: https://github.com/kit1980
2024-06-14 19:14:44 +00:00
a6bd154a42 [inductor] Support mm decomps for matrices with unbacked sizes (#128655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128655
Approved by: https://github.com/jansel
2024-06-14 18:35:42 +00:00
b94c52dd29 [GHF] Refuse merge to non-default branch (#128710)
Unless PR is ghstack one

Test plan:
```
% GITHUB_TOKEN=$(gh auth token)  python3 -c "from trymerge import GitHubPR; pr=GitHubPR('pytorch', 'pytorch', 128591); print(pr.base_ref(), pr.default_branch())"
release/2.4 main
```
Fixes: https://github.com/pytorch/test-infra/issues/5339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128710
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-06-14 18:23:25 +00:00
be0eec9031 [export] Improve static typing in tracer. (#128552)
Summary: as title.

Test Plan: CI

Differential Revision: D58485487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128552
Approved by: https://github.com/angelayi
2024-06-14 17:57:37 +00:00
2367161e4b Revert "[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966)"
This reverts commit c339efaf023b4af056dad4cb2f11c07930ed8af6.

Reverted https://github.com/pytorch/pytorch/pull/127966 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/127966#issuecomment-2168505985))
2024-06-14 17:57:23 +00:00
d7fc871175 [inductor] Improve superfluous mask handling in triton codegen (#128518)
This takes the logic from `filter_masks` and factors it out into
`_has_constant_mask`. I also improve support for `persistent_reduction` kernels
by making use of the static RBLOCK value and potentially XBLOCK too in the
`no_x_dim` case.

I then use this helper when generating the `xmask` and `rmask`, so we can
generate them as constants meaning triton can optimize them even if they are
included.

e.g. `compiled_sum(torch.randn(1024, 512, device="cuda"), dim=-1)`
before:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel):
    xnumel = 1024
    XBLOCK: tl.constexpr = 1
    rnumel = 512
    RBLOCK: tl.constexpr = 512
    xoffset = tl.program_id(0) * XBLOCK
    xindex = tl.full([1], xoffset, tl.int32)
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[:]
    roffset = 0
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), rmask & xmask, other=0.0)
    tmp1 = tl.broadcast_to(tmp0, [RBLOCK])
    tmp3 = tl.where(rmask & xmask, tmp1, 0)
    tmp4 = triton_helpers.promote_to_tensor(tl.sum(tmp3, 0))
    tl.store(out_ptr0 + (x0), tmp4, xmask)
```

after:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel):
    xnumel = 1024
    XBLOCK: tl.constexpr = 1
    rnumel = 512
    RBLOCK: tl.constexpr = 512
    xoffset = tl.program_id(0) * XBLOCK
    xindex = tl.full([1], xoffset, tl.int32)
    xmask = tl.full([RBLOCK], True, tl.int1)
    rindex = tl.arange(0, RBLOCK)[:]
    roffset = 0
    rmask = tl.full([RBLOCK], True, tl.int1)
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), None)
    tmp1 = tl.broadcast_to(tmp0, [RBLOCK])
    tmp3 = triton_helpers.promote_to_tensor(tl.sum(tmp1, 0))
    tl.store(out_ptr0 + (x0), tmp3, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128518
Approved by: https://github.com/lezcano
2024-06-14 17:52:55 +00:00
2357490524 [PT2] Enable shape_padding multiplier adjustment (#128346)
Summary:
Our experiments demonstrate that the current defautl value 1.1 may not be the best multiplier, and we thus enable the adjustment of the value to further improve the QPS.

context: https://docs.google.com/document/d/10VjpOJkTv5A4sNX7dD6qT7PyhBxn6LSeLAuaqYtoOto/edit

Test Plan:
# IG_CTR

{F1682138315}

Differential Revision: D58373261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128346
Approved by: https://github.com/jackiexu1992
2024-06-14 17:49:24 +00:00
cyy
d4807da802 Various fixes of torch/csrc files (#127252)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127252
Approved by: https://github.com/r-barnes
2024-06-14 17:31:24 +00:00
089e76cca3 [traced-graph][sparse] remove redundant assert in sparse prop test (#128523)
The assertEqualMeta() method already tests that the first argument is a FakeTensor

https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523
Approved by: https://github.com/huydhn
2024-06-14 17:05:17 +00:00
1fb4effe7a [GPT-fast benchmark] Add MLP, gather + gemv, gemv micro benchmark (#128002)
Output example:
```
| name                         | metric                    | target  | actual  |
|------------------------------|---------------------------|---------|---------|
| layer_norm_bfloat16          | memory_bandwidth(GB/s)    | 1017    | 1000.01 |
| mlp_layer_norm_gelu_bfloat16 | flops_utilization         | 0.71    | 0.71    |
| gemv_int8                    | memory_bandwidth(GB/s)    | 990     | 984.06 |
| gemv_bfloat16                | memory_bandwidth(GB/s)    | 1137    | 1137.92 |
| gather_gemv_int8             | memory_bandwidth(GB/s)    | 1113    | 1111.09 |
| gather_gemv_bfloat16         | memory_bandwidth(GB/s)    | 1249    | 1248.15 |

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128002
Approved by: https://github.com/Chillee
2024-06-14 17:03:22 +00:00
4c84af0f5d Fix indexing and slicing of ranges in dynamo (#128567)
Fix https://github.com/pytorch/pytorch/issues/128520
Dynamo does not handle range()[binary subscript] or range()[trinary_subscript] correctly. Right now it calls
the get_item function which basically applies the subscript operation on top of the list of [start, end, step]! which is completely not related to what is  expected.

in python, range()[complex subscript] is another range, ex:
range(1, 10, 2)[1:4:1] is range(3, 9, 2)
and range(1, 10, 2)[1:4:1]  is range(-9, 9, 2)

This diff fix index and slice applications on range.
it mimics implementations from (https://github.com/python/cpython/blob/main/Objects/rangeobject.c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128567
Approved by: https://github.com/anijain2305
2024-06-14 16:49:49 +00:00
f75f5987aa Revert "Extended Module Tracker (#128508)"
This reverts commit 1f46284f9ed5b60981174e689d750b358b19e4c4.

Reverted https://github.com/pytorch/pytorch/pull/128508 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/9515753429/job/26230639980 ([comment](https://github.com/pytorch/pytorch/pull/128508#issuecomment-2168405784))
2024-06-14 16:46:03 +00:00
732b4e9074 Fix generated vararg types (#128648)
In the generated files torchgen is incorrectly generating types on the varargs.

The changes all look like this (changing `*size: _int` to `*size: Union[_int, SymInt]`):
```
--- ./torch/_VF.pyi.sav	2024-06-13 20:36:49.189664629 -0700
+++ ./torch/_VF.pyi	2024-06-13 20:36:57.208894614 -0700
@@ -168,17 +168,17 @@
 @overload
 def _efficientzerotensor(size: Sequence[Union[_int, SymInt]], *, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
 @overload
-def _efficientzerotensor(*size: _int, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
+def _efficientzerotensor(*size: Union[_int, SymInt], dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
 def _embedding_bag(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ...
 def _embedding_bag_forward_only(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ...
 @overload
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128648
Approved by: https://github.com/jamesjwu
2024-06-14 16:04:37 +00:00
8629939a51 [torch/c10] Add C10_UBSAN_ENABLED macro and use it to disable SymInt_… (#127967)
Adds `C10_UBSAN_ENABLED` macro and use it to disable `SymIntTest::Overflows` (fails under `signed-integer-overflow` UBSAN check).

Also cleans up UBSAN guard in `jit/test_misc.cpp` to use `C10_UBSAN_ENABLED`  and the existing `C10_ASAN_ENABLED` instead of locally defining `HAS_ASANUBSAN`.

> NOTE: This should fix `SymIntTest::Overflows` failing under ubsan in fbcode too...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127967
Approved by: https://github.com/atalman, https://github.com/d4l3k, https://github.com/malfet
2024-06-14 16:01:12 +00:00
ee140a198f Revert "[Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591)"
This reverts commit 03e8a4cf45ee45611de77b55b515a8936f60ce31.

Reverted https://github.com/pytorch/pytorch/pull/128591 on behalf of https://github.com/atalman due to Contains release only changes should not be landed ([comment](https://github.com/pytorch/pytorch/pull/128591#issuecomment-2168308233))
2024-06-14 15:51:00 +00:00
c187593418 Prevent expansion of cat indexing to avoid int64 intermediate (#127815)
Fix for https://github.com/pytorch/pytorch/issues/127652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815
Approved by: https://github.com/shunting314, https://github.com/peterbell10
2024-06-14 15:42:08 +00:00
c339efaf02 [ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966)
Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560

This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069

unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping.

The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966
Approved by: https://github.com/pruthvistony, https://github.com/zou3519
2024-06-14 15:24:28 +00:00
c76a9d13cb Revert D56709309 (#128481)
Summary: potential fw compatibility issue raised from D58397323

Test Plan: Sandcastle

Reviewed By: houseroad

Differential Revision: D58443190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128481
Approved by: https://github.com/desertfire
2024-06-14 14:57:17 +00:00
9972e5f447 Rename impl_abstract to register_fake, part 2/2 (#123938)
This PR renames the implementation details of register_fake to align
more with the new name. It is in its own PR because this is risky
(torch.package sometimes depends on private library functions and
implementation details).

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123938
Approved by: https://github.com/williamwen42
2024-06-14 14:37:24 +00:00
a2d9c430b4 Adding a note for Getting Started with PyTorch on Intel GPUs (#127872)
Adding a note for Getting Started with PyTorch on Intel GPUs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127872
Approved by: https://github.com/svekars
2024-06-14 14:24:28 +00:00
dfc4b608e1 Remove leftover warning causing log spew (#128688)
This warning was left by mistake, and is uninformative (the user is doing nothing wrong) and causing log spew in trainings. See https://github.com/pytorch/pytorch/pull/120750#discussion_r1638430500
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128688
Approved by: https://github.com/drisspg
2024-06-14 14:08:11 +00:00
e1dfc61250 Document CI/CD security philosophy (#128316)
Namely:
-  when use of non-ephemeral runners is OK, vs when it is not
- Why binary build pipelines should not use distributed caching
- Why temporary CI artifacts should not be considered safe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128316
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-06-14 13:47:25 +00:00
cyy
bfd5ea93e0 Enable clang-tidy on c10/util/Float8*.h (#120573)
This PR clears warnings and enables clang-tidy on c10/util/Float8*.h.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120573
Approved by: https://github.com/drisspg
2024-06-14 13:47:07 +00:00
1f46284f9e Extended Module Tracker (#128508)
This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes.

1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``.
2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``.
3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case.
4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508
Approved by: https://github.com/wanchaol
2024-06-14 12:01:53 +00:00
e397ad6883 Improve codegen for ops.masked in triton (#128054)
Fixes https://github.com/pytorch/pytorch/issues/127930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128054
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-06-14 11:52:56 +00:00
7e734e2d08 [inductor] Fix nested indirect indexing case for index_propagation (#128378)
Tries to fix #127677.

# Context

Just as @peterbell10 pointed out, we have the following scenario:
```
a = ops.indirect_indexing(...)
b = ops.index_expr(a, ...)
c = ops.indirect_indexing(b, ...)
```

We can repro this as:
```
def forward(self, arg0_1, arg1_1, arg2_1):
    iota = torch.ops.prims.iota.default(arg0_1, start = 0, step = 1, index=0),
    repeat_interleave = torch.ops.aten.repeat_interleave.Tensor(arg1_1);
    index = torch.ops.aten.index.Tensor(iota, [repeat_interleave]);
    index_1 = torch.ops.aten.index.Tensor(arg2_1, [index]);
    return (index_1,)
```

which should generate a JIT py file like this:
```
def triton_poi_fused_index_select_0(in_ptr0, in_ptr1, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
    ...
    tmp0 = tl.load(in_ptr0 + (x1), xmask, eviction_policy='evict_last')
    tmp1 = ks0
    tmp2 = tmp0 + tmp1
    tmp3 = tmp0 < 0
    tmp4 = tl.where(tmp3, tmp2, tmp0)
    # check_bounds()
    tl.device_assert(((0 <= tmp4) & (tmp4 < ks0)) | ~(xmask), "index out of bounds: 0 <= tmp4 < ks0")

def call():
  arg0_1, arg1_1, arg2_1 = args
  buf1 = aten.repeat_interleave.Tensor(arg1_1)
  buf4 = empty_strided_cuda((u0, 64), (64, 1))
  triton_poi_fused_index_select_0.run(
    buf1, arg2_1, buf4, s0,
    triton_poi_fused_index_select_0_xnumel,
    grid=grid(triton_poi_fused_index_select_0_xnumel),
    stream=stream0)
```

# Issue
In our `IndexPropagation.indirect_indexing()` call we have `expr=indirect0` which is spawned in `LoopBodyBlock.indirect_indexing()`.
3b555ba477/torch/_inductor/ir.py (L8154-L8160)

When we try to see if we can prove its bounds, we fail because `indirect0` isn't in `var_ranges`.

# Approach
When creating `indirect` symbols from fallback, specify its range to be `[-size, size -1]` to avoid a lookup error with `indirectX`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128378
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-14 10:07:06 +00:00
99988be423 [halide-backend] Add test shard (#127308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127308
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #128266
2024-06-14 10:02:57 +00:00
03e8a4cf45 [Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591)
Port #127592 from main to release/2.4

------
Fixes #127402

- Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py
- Add checks of mutation for QLinearPointwiseBinaryPT2E

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592
Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128591
Approved by: https://github.com/jgong5, https://github.com/Chillee
2024-06-14 09:31:38 +00:00
43ae3073f9 Revert "[traced-graph][sparse] remove redundant assert in sparse prop test (#128523)"
This reverts commit ba3726d02b25dff92762c59d4dffe96a7babfa75.

Reverted https://github.com/pytorch/pytorch/pull/128523 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Looks like your changes broke the inductor tests: inux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor. [Here you can find more details](ba3726d02b). ([comment](https://github.com/pytorch/pytorch/pull/128523#issuecomment-2167518145))
2024-06-14 08:27:05 +00:00
0344f95c2e Add missing #include <array> to thread_name.cpp (#128664)
I got local compile errors (using clang 14.0.6) due to this missing include after pulling the
latest pytorch main.  It's totally puzzling why CI appears to pass
without this fix.  Hopefully someone else will have an idea if we are
missing some CI coverage or if I am using a strange build setup locally.

The PR introducing the compile errors was https://github.com/pytorch/pytorch/pull/128448.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128664
Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/d4l3k
2024-06-14 07:49:09 +00:00
03725a0512 [dtensor][example] added MLPStacked example for printing sharding (#128461)
**Summary**
Currently, the comm_mode_feature_examples does not have an example for printing sharding information for a model with nested module. While adding the new example to the suite, I recognized a way to refactor existing examples in order to make them more readable for users. The expected output can be found below:
<img width="354" alt="Screenshot 2024-06-11 at 5 41 14 PM" src="https://github.com/pytorch/pytorch/assets/50644008/68cef7c7-cb1b-4e51-8b60-85123d96ca92">

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128461
Approved by: https://github.com/XilunWu
ghstack dependencies: #128369, #128451
2024-06-14 07:30:31 +00:00
dd3b79a08f [dtensor][be] improving readability of comm_mode.py and comm_mode_features_example.py (#128451)
**Summary**
I have added comments to address previous readability concerns in comm_mode.py and comm_mode_features_example.py. I also renamed files and test cases in order to better reflect what they are about. Removed non-distributed test case and other lines of code that do not contribute to the example of how comm_mode can be used. Finally, I've added the expected output for each example function so users are not forced to run code.

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128451
Approved by: https://github.com/XilunWu
ghstack dependencies: #128369
2024-06-14 07:30:31 +00:00
e886122e98 [dtensor][debug] add module level tracing and readable display (#128369)
**Summary**
Currently, CommDebugMode only allows displaying collective tracing at a model level whereas a user may require a more detailed breakdown. In order to make this possible, I have changed the ModuleParamaterShardingTracker by adding a string variable to track the current sub-module as well as a dictionary keeping track of the depths of the submodules in the model tree. CommModeDebug class was changed by adding a new dictionary keeping track of the module collective counts as well as a function that displays the counts in a way that is easy for the user to read. Two examples using MLPModule and Transformer have been added to showcase the new changes. The expected output of the simpler MLPModule example is:

<img width="255" alt="Screenshot 2024-06-10 at 4 58 50 PM" src="https://github.com/pytorch/pytorch/assets/50644008/cf2161ef-2663-49c1-a8d5-9f97e96a1791">

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128369
Approved by: https://github.com/XilunWu
2024-06-14 07:30:31 +00:00
4669c6d3ae [quant][pt2e][quantizer] Support set_module_name_qconfig in X86InductorQuantizer (#126044)
Summary:
Added `set_module_name_qconfig` support to allow users to set configurations based on module name in `X86InductorQuantizer`.

For example, only quantize the `sub`:

```python
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 5)
        self.sub = Sub()

    def forward(self, x):
        x = self.linear(x)
        x = self.sub(x)
        return x

m = M().eval()
example_inputs = (torch.randn(3, 5),)
# Set config for a specific submodule.
quantizer = X86InductorQuantizer()
quantizer.set_module_name_qconfig("sub", xiq.get_default_x86_inductor_quantization_config())
```

- Added `set_module_name_qconfig` to allow user set the configuration at the `module_name` level.
- Unified the annotation process to follow this order:  `module_name_qconfig`, `operator_type_qconfig`, and `global_config`.
- Added `config_checker` to validate all user configurations and prevent mixing of static/dynamic or QAT/non-QAT configs.
- Moved `_get_module_name_filter` from `xnnpack_quantizer.py` into `utils.py` as it common for all quantizer.

Test Plan

```bash
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_set_module_name
```

@Xia-Weiwen @leslie-fang-intel  @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126044
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2024-06-14 07:13:10 +00:00
674be9d3be Update cu124 dynamo benchmark expected values (#128589)
I believe this corresponds to changes in https://github.com/pytorch/pytorch/pull/127780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128589
Approved by: https://github.com/nWEIdia, https://github.com/DanilBaibak
2024-06-14 07:04:34 +00:00
18f35d9e12 Revert "Run all samples for torchinductor tests (#128343)"
This reverts commit 41df20c07caecddb6d21d69a125f2998ae9313e8.

Reverted https://github.com/pytorch/pytorch/pull/128343 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_avg_pool3d_cuda_float16 and other tests 41df20c07c https://github.com/pytorch/pytorch/actions/runs/9509191526/job/26213490266. I think this might be a landrace ([comment](https://github.com/pytorch/pytorch/pull/128343#issuecomment-2167275337))
2024-06-14 06:08:17 +00:00
f48f7615dc [easy][subclasses] dynamo.reset() in test_subclass_views (#128659)
When we don't dynamo.reset(), we don't recompile on different dynamic shapes.

Also, some of the returned views were tuples - so when we `* 2`, we actually just copy all the inputs twice in the tuple. I changed it so that it would just return one of the values from the return tuple.

Additionally, this exposes a bug that fails with the slice operation, so I skipped it when we're testing with dynamic shapes:

```
  File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3996, in produce_guards
    sexpr = ShapeGuardPrinter(symbol_to_source, source_ref, self.var_to_sources).doprint(expr)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 292, in doprint
    return self._str(self._print(expr))
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
    return printmethod(expr, **kwargs)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 56, in _print_Add
    t = self._print(term)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
    return printmethod(expr, **kwargs)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in _print_Mul
    a_str = [self.parenthesize(x, prec, strict=False) for x in a]
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in <listcomp>
    a_str = [self.parenthesize(x, prec, strict=False) for x in a]
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 37, in parenthesize
    return self._print(item)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
    return printmethod(expr, **kwargs)
  File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1494, in _print_Symbol
    assert self.symbol_to_source.get(expr), (
AssertionError: s3 (could be from ['<ephemeral: symint_visitor_fn>', '<ephemeral: symint_visitor_fn>']) not in {s0: ["L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]"], s1: ["L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]"], s2: ["L['x'].a.storage_offset()", "L['x'].b.storage_offset()", "L['x'].a.storage_offset()", "L['x'].b.storage_offset()"]}.  If this assert is failing, it could be due to the issue described in https://github.com/pytorch/pytorch/pull/90665
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128659
Approved by: https://github.com/YuqingJ
2024-06-14 05:18:07 +00:00
9ac08dab1f Updates diskspace-cleanup for ROCm CI (#127947)
Gets the location of the docker directory and outputs how much disk space is being used by docker.

This is required since the new Cirrascale CI nodes for ROCm have docker root directory in a different partition.

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127947
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-06-14 04:32:38 +00:00
eff01bce21 Only run inductor A100 perf benchmark smoke test periodically (#128677)
Attempt to mitigate the long queue on A100 as reported in https://github.com/pytorch/pytorch/issues/128627.

From what I see, this change 03467b3fed/1 doubles the job duration from 20+ to 40+ minutes. This, together https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-cu124.yml and maybe an increase number of PR with `ciflow/inductor`, are all contributing to the long queue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128677
Approved by: https://github.com/atalman, https://github.com/desertfire
2024-06-14 02:39:33 +00:00
ba3726d02b [traced-graph][sparse] remove redundant assert in sparse prop test (#128523)
The assertEqualMeta() method already tests that the first argument is a FakeTensor

https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523
Approved by: https://github.com/soulitzer
2024-06-14 02:34:51 +00:00
685fcfb40d Fix docstring in autograd (#128657)
Fix docstrings in autograd files.

The fix can be verified by running pydocstyle path-to-file --count

Related #112593

**BEFORE the PR:**

pydocstyle torch/autograd/anomaly_mode.py --count
8
pydocstyle torch/autograd/__init__.py --count
9

**AFTER the PR:**

pydocstyle torch/autograd/anomaly_mode.py --count
0
pydocstyle torch/autograd/__init__.py --count
0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128657
Approved by: https://github.com/soulitzer
2024-06-14 02:18:59 +00:00
0186b386cd Revert "[ONNX] Add upsample trilinear to skip decomp (#128259)"
This reverts commit b72989a2b5ac4637612e31e325d7c8233fcbd7a1.

Reverted https://github.com/pytorch/pytorch/pull/128259 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its ONNX job is failing in trunk b72989a2b5 ([comment](https://github.com/pytorch/pytorch/pull/128259#issuecomment-2167058937))
2024-06-14 01:44:26 +00:00
f48ca2561d Document torch.cuda.profiler.start (#128098)
document https://github.com/pytorch/pytorch/issues/127917 start function of cuda/ profiler.py

Fixes 127917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128098
Approved by: https://github.com/aaronenyeshi
2024-06-14 01:44:18 +00:00
41df20c07c Run all samples for torchinductor tests (#128343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343
Approved by: https://github.com/lezcano
2024-06-14 01:28:32 +00:00
6895a5804c Revert "[checkpoint] Clean up selective activation checkpoint and make public (#125795)"
This reverts commit c472cec5656b9ffb668af97a02d711bdbdf5ebec.

Reverted https://github.com/pytorch/pytorch/pull/125795 on behalf of https://github.com/soulitzer due to breaking torchtitan CI ([comment](https://github.com/pytorch/pytorch/pull/125795#issuecomment-2167036157))
2024-06-14 01:14:59 +00:00
6564d63e69 Use mv kernel for small M (#128632)
Previously we are using:
* mv kernel for M == 1
* mm kernel for 1 < M < 4
* llama.cpp inspired mm kernel for M >= 4

This PR consolidate it to only 2 kernels, use the same mv kernel for M <
12.

Benchmarked on https://github.com/malfet/llm_experiments/blob/main/metal-perf/int8mm.mm

Mac M1 Max, input size M x 4128 x 4096

![llama cpp shader and ATen shader (2)](https://github.com/pytorch/pytorch/assets/8188269/9e2e3024-c5ea-4303-88bf-ff3646296396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128632
Approved by: https://github.com/malfet
2024-06-14 01:06:53 +00:00
ae2359638b Save DOT file of graph instead of SVG for GraphTranformObserver (#128634)
Summary:
GraphTransformObserver saves the SVG file of the input/output graph in each inductor pass. In my test with CMF model, if the graph is large, GraphViz took forever to convert DOT to SVG. That is NOT acceptable.

This DIFF is to save DOT file instead of SVG file to speed it up. Also DOT file size is order of mangitude smaller than SVG.

To view these graphs, user can run dot -Txxx inpout.dot to convert DOT to any other format you want. User can control how many iterations to layout the graph properly. Refer to https://web.archive.org/web/20170507095019/http://graphviz.org/content/attrs#dnslimit for details.

Test Plan: buck2 test mode/dev-sand caffe2/test:fx --  fx.test_fx_xform_observer.TestGraphTransformObserver

Differential Revision: D58539182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128634
Approved by: https://github.com/mengluy0125
2024-06-14 00:54:22 +00:00
6f181756dc Use by-column algorithm for fp16/bf16 CPUBlas gemm_transb kernels (#127318)
Summary: #96074 (D44340826) changed the algorithm for 16-bit types for gemm_notrans_ and gemm_transb_ for the sake of precision. In this diff, we go back to the old algorithm for gemm_transb_, maintaining precision by allocating temporary space equal to (in elements, so actually double since we are accumulating 16-bit types into fp32) the size of `c` to accumulate into.

Test Plan: Used https://github.com/malfet/llm_experiments (benchmarks/benchmark_torch_mm.py) to benchmark before and after:

before:
```
mv_nt    torch.float32    5.47 usec
mv_nt    torch.float16    8.45 usec
mv_nt   torch.bfloat16  183.43 usec
mv_ta    torch.float32    5.70 usec
mv_ta    torch.float16   24.17 usec
mv_ta   torch.bfloat16   97.27 usec
notrans  torch.float32    5.58 usec
notrans  torch.float16   25.18 usec
notrans torch.bfloat16   63.11 usec
trans_a  torch.float32    5.59 usec
trans_a  torch.float16   68.94 usec
trans_a torch.bfloat16  311.60 usec
trans_b  torch.float32    5.63 usec
trans_b  torch.float16    8.76 usec
trans_b torch.bfloat16   29.17 usec
```

after:
```
mv_nt    torch.float32    5.53 usec
mv_nt    torch.float16    8.57 usec
mv_nt   torch.bfloat16  188.17 usec
mv_ta    torch.float32    5.78 usec
mv_ta    torch.float16   28.59 usec
mv_ta   torch.bfloat16   98.45 usec
notrans  torch.float32    5.71 usec
notrans  torch.float16   26.08 usec
notrans torch.bfloat16   64.06 usec
trans_a  torch.float32    5.72 usec
trans_a  torch.float16   32.21 usec
trans_a torch.bfloat16   32.10 usec
trans_b  torch.float32    5.83 usec
trans_b  torch.float16    9.05 usec
trans_b torch.bfloat16   29.66 usec
```

Also expanded coverage to a range of larger matrix-vector and matrix-matrix sizes.

before:
```
Matrix-vector:
m=1024, n=1024, k=1
====================
notrans  torch.float32   24.75 usec
notrans  torch.float16  258.04 usec
notrans torch.bfloat16  245.64 usec
trans_a  torch.float32   26.94 usec
trans_a  torch.float16  692.09 usec
trans_a torch.bfloat16 1709.53 usec
m=4100, n=4100, k=1
====================
notrans  torch.float32 2811.48 usec
notrans  torch.float16 4192.06 usec
notrans torch.bfloat16 4041.01 usec
trans_a  torch.float32 2778.38 usec
trans_a  torch.float16 17218.41 usec
trans_a torch.bfloat16 27561.21 usec
m=16384, n=16384, k=1
====================
notrans  torch.float32 60157.66 usec
notrans  torch.float16 64121.38 usec
notrans torch.bfloat16 65714.65 usec
trans_a  torch.float32 84975.39 usec
trans_a  torch.float16 1024223.33 usec
trans_a torch.bfloat16 1078683.21 usec

Matrix-matrix:
m=1024, n=1024, k=256
====================
notrans  torch.float32  302.55 usec
notrans  torch.float16 172869.06 usec
notrans torch.bfloat16 172837.81 usec
trans_a  torch.float32  250.03 usec
trans_a  torch.float16 333373.38 usec
trans_a torch.bfloat16 432760.00 usec
m=4100, n=4100, k=128
====================
notrans  torch.float32 5278.56 usec
notrans  torch.float16 1426335.29 usec
notrans torch.bfloat16 1404249.37 usec
trans_a  torch.float32 4818.63 usec
trans_a  torch.float16 2969936.17 usec
trans_a torch.bfloat16 3432565.96 usec
m=16384, n=16384, k=16
====================
notrans  torch.float32 72225.71 usec
notrans  torch.float16 1439875.54 usec
notrans torch.bfloat16 1443716.33 usec
trans_a  torch.float32 221130.21 usec
trans_a  torch.float16 16910654.17 usec
trans_a torch.bfloat16 21447377.63 usec
```

after:
```
Matrix-vector:
m=1024, n=1024, k=1
====================
notrans  torch.float32   25.11 usec
notrans  torch.float16  252.76 usec
notrans torch.bfloat16  238.58 usec
trans_a  torch.float32   26.62 usec
trans_a  torch.float16  167.40 usec
trans_a torch.bfloat16  174.08 usec
m=4100, n=4100, k=1
====================
notrans  torch.float32 2774.28 usec
notrans  torch.float16 3991.70 usec
notrans torch.bfloat16 3945.44 usec
trans_a  torch.float32 3011.25 usec
trans_a  torch.float16 2666.85 usec
trans_a torch.bfloat16 2686.95 usec
m=16384, n=16384, k=1
====================
notrans  torch.float32 58682.15 usec
notrans  torch.float16 63077.52 usec
notrans torch.bfloat16 63319.33 usec
trans_a  torch.float32 70549.57 usec
trans_a  torch.float16 42145.45 usec
trans_a torch.bfloat16 42270.13 usec

Matrix-matrix:
m=1024, n=1024, k=256
====================
notrans  torch.float32  289.37 usec
notrans  torch.float16 179704.87 usec
notrans torch.bfloat16 173490.33 usec
trans_a  torch.float32  330.89 usec
trans_a  torch.float16 42466.26 usec
trans_a torch.bfloat16 42811.19 usec
m=4100, n=4100, k=128
====================
notrans  torch.float32 4793.33 usec
notrans  torch.float16 1407557.04 usec
notrans torch.bfloat16 1388212.17 usec
trans_a  torch.float32 4714.20 usec
trans_a  torch.float16 359406.58 usec
trans_a torch.bfloat16 350419.42 usec
m=16384, n=16384, k=16
====================
notrans  torch.float32 65757.08 usec
notrans  torch.float16 1427715.71 usec
notrans torch.bfloat16 1440883.00 usec
trans_a  torch.float32 202263.44 usec
trans_a  torch.float16 1387522.33 usec
trans_a torch.bfloat16 1762253.92 usec
```

We are improving, but still have a lot of room for improvement compared to float32 BLAS. Full disclosure: applying this same method to gemm_notrans (which does correspond to notrans in the benchmark's nomenclature) does not approve performance across the board; the 16KB x 16KB x 16 matmul regresses and I haven't figured out why yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127318
Approved by: https://github.com/peterbell10, https://github.com/malfet
2024-06-14 00:39:18 +00:00
18f5357f4f Introduce heuristic for mixed_mm on A100 (#128232)
This PR introduces a heuristic for tuned_mixed_mm. The heuristic is only enabled on an A100, because it has only been tested on an A100, and it is only enabled if force_mixed_mm="heuristic".

I compared the heuristic to the aten fallback implementation and triton+autotune:
 Geometric mean speedup: 2.51
 ```
 m     n     k  triton + autotune (GB/s)  aten (GB/s)  heuristic (GB/s)  used_heuristic  speedup (heuristic/aten)
  1  4096  4096                    456.95       134.59            459.37            True                      3.41
  1  4096  8192                    523.93       138.29            553.50            True                      4.00
  1  4096 16394                    233.70       161.62            234.14            True                      1.45
  1  8192  4096                    633.25       140.64            574.86            True                      4.09
  1  8192  8192                    737.54       147.41            690.26            True                      4.68
  1  8192 16394                    413.67       175.88            408.68            True                      2.32
  1 16394  4096                    717.22       167.22            665.36            True                      3.98
  1 16394  8192                    812.69       177.17            815.90            True                      4.61
  1 16394 16394                    473.17       178.58            435.11            True                      2.44
  4  4096  4096                    479.46       134.80            486.74            True                      3.61
  4  4096  6333                    174.27       106.74            171.64            True                      1.61
  4  4096  8192                    567.14       138.32            571.09            True                      4.13
  4  4096 12313                    179.65       105.91            180.03            True                      1.70
  4  4096 16394                    222.96       145.54            222.81            True                      1.53
  4  6333  4096                    491.78       126.37            473.20            True                      3.74
  4  6333  6333                    268.79       143.40            269.75            True                      1.88
  4  6333  8192                    783.80       135.12            796.23            True                      5.89
  4  6333 12313                    286.35       142.37            287.30            True                      2.02
  4  6333 16394                    362.47       139.66            361.47            True                      2.59
  4  8192  4096                    642.73       140.53            641.88            True                      4.57
  4  8192  6333                    287.65       137.63            287.38            True                      2.09
  4  8192  8192                    738.42       150.16            721.59            True                      4.81
  4  8192 12313                    301.27       146.18            302.31            True                      2.07
  4  8192 16394                    415.37       167.66            393.41            True                      2.35
  4 12313  4096                    823.66       141.81            745.40            True                      5.26
  4 12313  6333                    433.92       148.17            429.83            True                      2.90
  4 12313  8192                    984.60       149.30            988.95            True                      6.62
  4 12313 12313                    452.00       150.87            452.50            True                      3.00
  4 12313 16394                    609.88       159.20            609.71            True                      3.83
  4 16394  4096                    779.44       157.46            777.10            True                      4.94
  4 16394  6333                    402.93       139.50            309.47            True                      2.22
  4 16394  8192                    950.38       175.49            949.67            True                      5.41
  4 16394 12313                    414.62       153.99            315.95            True                      2.05
  4 16394 16394                    497.56       174.97            461.77            True                      2.64
16  4096  4096                    475.92       134.45            478.57            True                      3.56
16  4096  6333                    146.36       112.50            145.35            True                      1.29
16  4096  8192                    560.00       138.22            557.19            True                      4.03
16  4096 12313                    152.02       105.06            151.27            True                      1.44
16  4096 16394                    222.48       156.72            222.88            True                      1.42
16  6333  4096                    692.41       122.14            696.88            True                      5.71
16  6333  6333                    220.74       140.90            225.41            True                      1.60
16  6333  8192                    813.56       140.21            820.28            True                      5.85
16  6333 12313                    232.48       131.19            232.55            True                      1.77
16  6333 16394                    367.39       134.93            361.87            True                      2.68
16  8192  4096                    665.54       140.29            266.24            True                      1.90
16  8192  6333                    254.77       136.65            240.12            True                      1.76
16  8192  8192                    750.63       146.26            736.93            True                      5.04
16  8192 12313                    266.61       127.13            251.81            True                      1.98
16  8192 16394                    397.25       160.42            390.76            True                      2.44
16 12313  4096                    857.48       141.36            851.36            True                      6.02
16 12313  6333                    423.21       132.40            357.55            True                      2.70
16 12313  8192                   1021.24       145.68           1024.60            True                      7.03
16 12313 12313                    370.12       143.94            383.52            True                      2.66
16 12313 16394                    608.52       141.03            608.48            True                      4.31
16 16394  4096                    826.48       155.94            826.74            True                      5.30
16 16394  6333                    420.38       144.09            265.23            True                      1.84
16 16394  8192                    988.07       156.21            984.63            True                      6.30
16 16394 12313                    431.40       146.92            265.49            True                      1.81
16 16394 16394                    497.39       167.86            461.79            True                      2.75
23  4096  4096                    344.43       132.84            338.64            True                      2.55
23  4096  6333                    195.34       118.48            195.31            True                      1.65
23  4096  8192                    389.83       140.02            376.62            True                      2.69
23  4096 12313                    204.49       137.96            204.80            True                      1.48
23  4096 16394                    242.48       148.99            242.74            True                      1.63
23  6333  4096                    429.25       126.52            517.75            True                      4.09
23  6333  6333                    295.56       133.51            296.14            True                      2.22
23  6333  8192                    594.88       137.05            581.78            True                      4.25
23  6333 12313                    315.18       131.67            314.64            True                      2.39
23  6333 16394                    386.46       141.45            386.54            True                      2.73
23  8192  4096                    553.52       142.05            568.35            True                      4.00
23  8192  6333                    215.58       139.01            210.86            True                      1.52
23  8192  8192                    609.21       154.85            528.76            True                      3.41
23  8192 12313                    220.38       142.93            233.54            True                      1.63
23  8192 16394                    402.63       158.39            403.21            True                      2.55
23 12313  4096                    723.54       131.58            581.94            True                      4.42
23 12313  6333                    307.90       131.58            307.90            True                      2.34
23 12313  8192                    893.36       129.97            623.72            True                      4.80
23 12313 12313                    322.40       134.84            317.80            True                      2.36
23 12313 16394                    512.97       142.31            409.45            True                      2.88
23 16394  4096                    703.66       154.54            643.53            True                      4.16
23 16394  6333                    305.55       127.55            293.17            True                      2.30
23 16394  8192                    768.12       154.60            681.53            True                      4.41
23 16394 12313                    311.61       140.92            307.01            True                      2.18
23 16394 16394                    467.24       171.07            467.29            True                      2.73
32  4096  4096                    344.71       132.30            338.62            True                      2.56
32  4096  6333                    206.48       107.59            205.55            True                      1.91
32  4096  8192                    387.24       137.82            353.12            True                      2.56
32  4096 12313                    216.35       120.61            214.50            True                      1.78
32  4096 16394                    242.05       149.92            241.94            True                      1.61
32  6333  4096                    525.50       127.12            518.02            True                      4.08
32  6333  6333                    300.50       118.41            296.55            True                      2.50
32  6333  8192                    600.92       136.99            601.94            True                      4.39
32  6333 12313                    316.13       136.45            316.03            True                      2.32
32  6333 16394                    386.11       141.34            386.10            True                      2.73
32  8192  4096                    546.18       140.18            341.14            True                      2.43
32  8192  6333                    218.40       130.65            263.42            True                      2.02
32  8192  8192                    608.29       147.16            542.12            True                      3.68
32  8192 12313                    225.60       135.04            225.23            True                      1.67
32  8192 16394                    434.75       160.42            401.28            True                      2.50
32 12313  4096                    787.80       136.28            583.60            True                      4.28
32 12313  6333                    316.66       125.76            323.35            True                      2.57
32 12313  8192                    891.38       128.88            639.50            True                      4.96
32 12313 12313                    326.11       132.37            325.88            True                      2.46
32 12313 16394                    521.64       139.47            395.69            True                      2.84
32 16394  4096                    625.55       158.46            651.16            True                      4.11
32 16394  6333                    304.14       131.13            284.55            True                      2.17
32 16394  8192                    767.79       162.95            704.34            True                      4.32
32 16394 12313                    310.74       137.68            303.39            True                      2.20
32 16394 16394                    465.92       171.43            465.37            True                      2.71
43  4096  4096                    345.05       133.87            196.47            True                      1.47
43  4096  6333                    148.64        99.92            148.97            True                      1.49
43  4096  8192                    386.50       135.39            214.00            True                      1.58
43  4096 12313                    190.39       109.36            156.27            True                      1.43
43  4096 16394                    203.63       150.24            204.05            True                      1.36
43  6333  4096                    421.35       106.04            132.25            True                      1.25
43  6333  6333                    224.75       113.01            224.97            True                      1.99
43  6333  8192                    471.11       117.61            327.39            True                      2.78
43  6333 12313                    234.55       115.61            234.74            True                      2.03
43  6333 16394                    311.56       132.24            312.01            True                      2.36
43  8192  4096                    400.73       140.12            269.11            True                      1.92
43  8192  6333                    167.32       119.13            168.84            True                      1.42
43  8192  8192                    435.45       146.98            286.21            True                      1.95
43  8192 12313                    161.05       127.82            162.78            True                      1.27
43  8192 16394                    207.16       156.40            208.90            True                      1.34
43 12313  4096                    484.01       120.10            313.35            True                      2.61
43 12313  6333                    234.54       106.63            232.85            True                      2.18
43 12313  8192                    515.34       130.23            411.70            True                      3.16
43 12313 12313                    239.39       130.04            239.03            True                      1.84
43 12313 16394                    316.02       137.39            316.29            True                      2.30
43 16394  4096                    475.60       152.57            340.97            True                      2.23
43 16394  6333                    241.21       132.49            208.59            True                      1.57
43 16394  8192                    499.34       157.43            361.61            True                      2.30
43 16394 12313                    246.25       132.31            211.68            True                      1.60
43 16394 16394                    302.90       158.56            277.05            True                      1.75
64  4096  4096                    280.48       126.82            195.97            True                      1.55
64  4096  6333                    150.94       101.63            150.48            True                      1.48
64  4096  8192                    305.47       135.06            211.03            True                      1.56
64  4096 12313                    158.12       110.06            158.15            True                      1.44
64  4096 16394                    206.68       136.21            201.28            True                      1.48
64  6333  4096                    409.11       105.10            296.07            True                      2.82
64  6333  6333                    229.98       108.46            230.59            True                      2.13
64  6333  8192                    469.32       112.24            330.58            True                      2.95
64  6333 12313                    245.02       117.16            244.84            True                      2.09
64  6333 16394                    317.78       125.80            318.37            True                      2.53
64  8192  4096                    323.42       139.92            267.31            True                      1.91
64  8192  6333                    167.51       118.45            167.56            True                      1.41
64  8192  8192                    341.13       146.71            284.88            True                      1.94
64  8192 12313                    172.21       123.42            171.97            True                      1.39
64  8192 16394                    217.22       153.18            216.99            True                      1.42
64 12313  4096                    482.19       123.32            311.82            True                      2.53
64 12313  6333                    238.73       123.88            238.66            True                      1.93
64 12313  8192                    516.32       122.11            330.50            True                      2.71
64 12313 12313                    248.73       125.32            296.82            True                      2.37
64 12313 16394                    314.98       134.06            320.31            True                      2.39
64 16394  4096                    476.59       154.58            340.84            True                      2.20
64 16394  6333                    240.54       119.60            214.82            True                      1.80
64 16394  8192                    501.36       149.02            359.45            True                      2.41
64 16394 12313                    244.65       126.01            222.47            True                      1.77
64 16394 16394                    302.48       160.36            283.66            True                      1.77
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128232
Approved by: https://github.com/Chillee
2024-06-14 00:31:22 +00:00
cyy
9ebec1f345 Enable Wunused-function in torch_cpu (#128576)
Follows #128499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128576
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-14 00:12:58 +00:00
6767e38267 Fix manual licensing (#128630)
It has come to my attention that some of our licenses are incorrect, so I attempted to rectify a few of them based on given recommendations for:
clog - BSD-3
eigen - MPL-2.0
ffnvcodec - LGPL-2.1
-> **hungarian - Permissive (free to use)**
irrlicht - The Irrlicht Engine License (zlib/libpng)
-> **pdcurses - Public Domain for core**
-> **sigslot - Public Domain**
test - BSD-3
Vulkan - Apache-2.0 or MIT
fb-only: more context is here https://fb.workplace.com/groups/osssupport/posts/26333256012962998/?comment_id=26333622989592967

This PR addressed the manual mismatches of licensing mentioned above (the two bolded, one is getting addressed in #128085, but as everything else is generated by pulling through other files, I did not address those. It is unclear what needs to be updated for the remaining to be accurate/if they're inaccurate today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128630
Approved by: https://github.com/malfet
2024-06-14 00:12:09 +00:00
afdaa7fc95 [while_loop] expose it as torch.while_loop (#128562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128562
Approved by: https://github.com/zou3519
2024-06-13 23:44:10 +00:00
c486e2ab64 Add coloring to fx graph print out (#128476)
Note: Won't land immediately, at least I'll need to add a color option to the field. But curious if any tests fail.

Old:
<img width="1294" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/c3a750ed-5e54-4621-b2e4-be5481be15b6">

New:
<img width="1303" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/3a1f1adc-6f3a-413e-8b87-ee53da9bf4ed">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128476
Approved by: https://github.com/ezyang
2024-06-13 23:39:04 +00:00
61421c42c0 [custom_op] don't invoke autograd.Function when unnecessary (#127976)
This matches our autograd logic for pytorch native operators. There's no
need to invoke an autograd.Function if we're under a torch.no_grad() or
if none of the inputs have requires_grad=True (invoking an
autograd.Function results in (noticeable) overhead).

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127976
Approved by: https://github.com/williamwen42
2024-06-13 23:38:23 +00:00
b72989a2b5 [ONNX] Add upsample trilinear to skip decomp (#128259)
(1) Add upsample trilinear vec to skip decomposition
(2) Add tests to make sure that torch.export.export still decomposes them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259
Approved by: https://github.com/justinchuby
2024-06-13 23:31:34 +00:00
8c20f53a5e Try seeding individual foreach tests (#128220)
A first easy attempt to deflake foreach

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128220
Approved by: https://github.com/ZainRizvi, https://github.com/crcrpar, https://github.com/huydhn
2024-06-13 22:42:16 +00:00
865d7b3424 [Reland][dynamo] Enable some inlining inbuilt nn module tests (#128440)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-06-13 22:39:22 +00:00
3a0006ef22 Remove global variable SIZE, and fix linter warning (#128559)
- Resolve a TODO by removing global variable `SIZE`.
- Fix a linter warning in `test/test_nestedtensor.py`.

`pytest pytorch/test/test_sort_and_select.py` and ` pytest test/test_nestedtensor.py` pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128559
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-06-13 22:09:51 +00:00
6211e67e49 Document torch.jit.frontend.get_default_args (#128408)
Fixes #127896

### Description
Add docstring to `torch/jit/frontend.py:get_default_args` function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128408
Approved by: https://github.com/malfet
2024-06-13 21:49:16 +00:00
bf8a05f483 [FSDP2] Included module FQN in FSDPParamGroup record_functions (#128624)
This PR adds the module FQN into the `FSDPParamGroup` `record_function`s for improved clarity in profiler traces.

Differential Revision: [D58544809](https://our.internmc.facebook.com/intern/diff/D58544809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128624
Approved by: https://github.com/ckluk2
2024-06-13 21:35:33 +00:00
c8e9656a12 Revert "Add test to xfail_list only for abi_compatible (#128506)"
This reverts commit 49366b2640df1cba5a3b40bedd31b57b08529612.

Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes an inductor test to fail in trunk 49366b2640 ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2166824714))
2024-06-13 21:30:07 +00:00
8763d44bf1 add xpu to torch.compile (#127279)
As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.compile doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127279
Approved by: https://github.com/dvrogozh, https://github.com/svekars
2024-06-13 21:15:09 +00:00
790138fdc7 Add profiler annotation for fused_all_gather_matmul and fused_matmul_reduce_scatter (#127556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127556
Approved by: https://github.com/awgu
ghstack dependencies: #127454, #127455
2024-06-13 20:52:46 +00:00
3b28dc6c9d Improve the scheduling for fused_matmul_reduce_scatter (#127455)
In fused_all_gather_matmul, each rank copies their shard into their
local p2p buffer, performs a barrier, then performs (copy -> matmul) for
each remote shard. The (copy -> matmul)s for remote shards run on two
streams without synchronization. This not only allows for
computation/communication overlapping, but also computation/computation
overlapping which alleviates the wave quantization effect caused by
computation decomposition.

However, the synchronization-free approach doesn't work well with
fused_matmul_reduce_scatter, in which there's a barrier in every step.
Without synchronization between the two streams, a matmul in one stream
can delay a barrier in the other stream, further delaying the copy
waiting for the barrier.

This PR addresss the issue by adding synchronization between the two
streams such that the matmul of step i can only start after the barrier
of step i-1 completes. With this approach, we lose the
computation/computation overlapping, but avoid slowdown due to delayed
barrier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127455
Approved by: https://github.com/Chillee
ghstack dependencies: #127454
2024-06-13 20:52:46 +00:00
c0b40ab42e doc string for torch.jit.frontend.get_jit_class_def method (#128391)
Fixes #127904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128391
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-06-13 19:51:02 +00:00
a3af32c2fb Add functionality to make ViewAndMutationData (slightly more) cache safe (#127618)
This PR changes the traced_tangents field of ViewAndMutationMeta to be cache safe. Specifically, at runtime, the only time we need the fw_metadata's traced_tangent's field is for Tensor subclass metadata from __tensor_flatten__. So instead of storing an entire FakeTensor, which has many fields that can be unserializable, only store the result of __tensor_flatten__() on any FakeTensors representing subclasses.

That said, there's no guarantee that `__tensor_flatten__` is actually serializable: if we fail to pickle the result of __tensor_flatten__ we won't save to the cache.

To do this, we also make a small change to `__coerce_same_metadata_as_tangent__`, so that it takes in the return value of tensor_flatten() instead of an entire FakeTensor. Let me know if we should change the name of the function.

By doing this, we can now run the dynamic shapes cache test with autograd turned on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127618
Approved by: https://github.com/bdhirsh
2024-06-13 19:45:33 +00:00
39193b10e8 [inductor] fx graph cache: memoize devices to make cache key calculation more predictable (#128366)
Summary: I've seen this issue once in the wild and oulgen was able to repro in a unit test. The problem is this:
- We're using pickle to turn everything related to the FX graph cache key into a byte stream, then hashing the bytes to compute the cache key.
- Pickle is optimized to avoid serializing the same ID more than once; it instead drops a reference to a previously-pickled object if it encounters the same ID.
- That pickle behavior means that we can see different cache keys if an object id appears more than once in the hashed objects vs. being functionally equivalent but distinct objects.

The cases I've investigated only involve the torch.device objects in the tensor graph args. That is, we may compile a graph with two tensor args, each referencing `torch.device('cpu')`. In one run, those devices may reference the same object; in another, they may reference distinct (but equivalent) objects. In practice, my observation is that the compiler is largely deterministic and this situation is rare. I've seen cache misses on a real benchmark only when enabling/disabling FakeTensor caching in order to introduce different code paths that otherwise produce the same fx graph. But the failing unit test seems to be enough motivation for a remediation?

I don't really love this solution, but I've failed to find another way to make the pickling phase robust to these kinds of changes, e.g., by changing the protocol version or by overriding internal methods (which would also be gross). But I'm definitely open to other creative ideas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128366
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-06-13 19:25:14 +00:00
c54e358bdb enable comprehensive padding internally (#128555)
Summary: The feature was previously disabled in fbcode due to breaking the deterministic NE unit tests. Now it has been on in OSS for quite a while and we verified that it has no NE impact on CMF, we want to update the unit test and enable the feature.

Test Plan:
```
time buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests -- --exact 'aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests - aps_models.ads.icvr.tests.ne.e2e_deterministic_tests.icvr_fm_test.ICVR_FM_DeterministicTest: test_icvr_fm_pt2_fsdp_multi_gpus'

```

Differential Revision: D58425432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128555
Approved by: https://github.com/eellison
2024-06-13 19:20:00 +00:00
cdc37e4bff Add a shape property to IR nodes (#127818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127818
Approved by: https://github.com/peterbell10
2024-06-13 19:11:52 +00:00
5a80d2df84 [BE] enable UFMT for torch/nn/utils (#128595)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128595
Approved by: https://github.com/Skylion007
2024-06-13 18:34:57 +00:00
9f55c80a9f [AOTI] Fix a minimal_arrayref_interface test failure (#128613)
Summary: When calling a fallback op in the minimal_arrayref_interface mode with an optional tensor, a temporary RAIIAtenTensorHandle needes to be explicitly created in order to pass a pointer of tensor as the optional tensor parameter.

Test Plan: CI

Differential Revision: D58528575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128613
Approved by: https://github.com/hl475
2024-06-13 18:25:04 +00:00
a265556362 inductor fusion logs: make it easier to attribute to aten graph (#127159)
Summary:

I want to be able to look at inductor fusion logs and reason about which parts of the aot_autograd aten graph were fused / not fused.

This PR adds a short description of each buffer to the fusion logs. Example for forward of `Float8Linear`:

```
torch._inductor.scheduler.__fusion: ===== attempting fusion (1/10): 13 nodes =====
torch._inductor.scheduler.__fusion: fuse_nodes_once, candidates:
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf0'), Reduction(['[254201]', 'max', 'origins={abs_1, max_1}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf3'), Reduction(['[114688]', 'max', 'origins={abs_2, max_2}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf6'), Pointwise(['[]', 'origins={reciprocal_1, convert_element_type_6, clamp_min_2, mul_2, copy_1, reciprocal_3, convert_element_type_5}'])
torch._inductor.scheduler.__fusion:   ExternKernelSchedulerNode(name='buf10')
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf2'), Pointwise(['[]', 'origins={full_default}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf8'), Pointwise(['[8192, 7168]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_type
_3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf4'), Reduction(['[512]', 'max', 'origins={abs_2, max_2}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf13'), Pointwise(['[8192, 7168]', 'origins={clone_2}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf7'), Pointwise(['[16384, 8192]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_typ
e_3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}'])
torch._inductor.scheduler.__fusion:   ExternKernelSchedulerNode(name='buf9')
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf1'), Reduction(['[528]', 'max', 'origins={abs_1, max_1}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf5'), Pointwise(['[]', 'origins={convert_element_type, clamp_min, convert_element_type_1, copy, reciprocal_2, mul, reciprocal}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf12'), Pointwise(['[8192, 16384]', 'origins={clone_1}'])
torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf7: no shared data
torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf12: no shared data
torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf1: numel/rnumel mismatch (reduce) (528, 1), (254201, 528)
torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf1: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf12 with buf1: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf7: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf12: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf8: intermediate nodes between node1 & node2
torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf13: no shared data
torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf4: numel/rnumel mismatch (reduce) (512, 1), (114688, 512)
torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf4: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf13 with buf4: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf8: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf13: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf9 with buf10: node1 is extern or nop
torch._inductor.scheduler.__fusion: found 4 possible fusions
torch._inductor.scheduler.__fusion: fusing buf7 with buf12
torch._inductor.scheduler.__fusion: fusing buf8 with buf13
torch._inductor.scheduler.__fusion: fusing buf4 with buf6
torch._inductor.scheduler.__fusion: fusing buf1 with buf5
torch._inductor.scheduler.__fusion: completed fusion round (1/10): fused 13 nodes into 9 nodes
```

Test Plan: will add tests after we align some version of this can land

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127159
Approved by: https://github.com/mlazos
2024-06-13 18:22:02 +00:00
de9a072ac4 Updating the sigslot license to Public Domain (#128085)
It seems that Sigslot's license is Public Domain, not Apache 2. https://sigslot.sourceforge.net

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128085
Approved by: https://github.com/janeyx99
2024-06-13 18:13:54 +00:00
8733c4f4be docs: Add link to test-infra issue (#128608)
It's not immediately obvious from this file that the issue being referred to is in another repo. Add that detail and link to make it easier for folks reading this code to jump to the correct issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128608
Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/ZainRizvi
2024-06-13 18:00:53 +00:00
dd19c9150c Revert "[aota] compiled forward outputs requires_grad alignment with eager (#128016)"
This reverts commit b459713ca75f6ab7c8a59acec0258e0f77904ada.

Reverted https://github.com/pytorch/pytorch/pull/128016 on behalf of https://github.com/bdhirsh due to fix torchbench regression ([comment](https://github.com/pytorch/pytorch/pull/128016#issuecomment-2166446841))
2024-06-13 17:56:42 +00:00
52f529105d force_stride_order on fused_all_gather_matmul/fused_matmul_reduce_scatter's operands to avoid a copy due to layout transformation (#127454)
When performing fused_all_gather_matmul/fused_matmul_reduce_scatter and gather_dim/scatter_dim != 0, a copy of the lhs operand (A_shard/A) is needed for layout transformation.
This copy can be avoided if the lhs operand already has the following stride order:

    lhs.movedim(gather_dim, 0).contiguous().movedim(0, gather_dim).stride()

In `micro_pipeline_tp` passes, we enforce the lhs operand to have such stride order via `inductor_prims.force_stride_order`. This way if the lhs operand has a flexible layout, the copy is avoided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127454
Approved by: https://github.com/Chillee
2024-06-13 17:52:37 +00:00
d5780396c7 Skip debug asserts for mixed dense, subclass views in autograd_not_implemented_fallback (#128057)
Fixes #125503
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128057
Approved by: https://github.com/albanD, https://github.com/soulitzer
ghstack dependencies: #127007
2024-06-13 17:13:02 +00:00
9a8917fdbd Naive CPU kernels for jagged <-> padded dense conversions (#127007)
This PR introduces naive CPU impls for:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`

On the CUDA side, these are backed by lifted FBGEMM kernels. We may want to revisit the CPU versions with higher-performance implementations at a later time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127007
Approved by: https://github.com/davidberard98
2024-06-13 17:13:02 +00:00
a0604193a2 handle call_function with Parameter args in DDPOptimizer splitting (#128034)
When nn module inlining is enabled, modules are replaced with the underlying function calls in the output fx graph.
example:
```
class GraphModule(torch.nn.Module):
  def forward(self, L_x_: "f32[1024, 1024]"):
      l_x_ = L_x_

      # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_structured_trace.py:284 in forward, code: return self.layers(x)
      l__self___layers_0: "f32[1024, 1024]" = self.L__self___layers_0(l_x_);  l_x_ = None
      l__self___layers_1: "f32[1024, 1024]" = self.L__self___layers_1(l__self___layers_0);  l__self___layers_0 = None
      return (l__self___layers_1,)
```

will be
```
class GraphModule(torch.nn.Module):
    def forward(self, L_self_layers_0_weight: "f32[1024, 1024]", L_self_layers_0_bias: "f32[1024]", L_x_: "f32[1024, 1024]", L_self_layers_1_weight: "f32[1024, 1024]", L_self_layers_1_bias: "f32[1024]"):
        l_self_layers_0_weight = L_self_layers_0_weight
        l_self_layers_0_bias = L_self_layers_0_bias
        l_x_ = L_x_
        l_self_layers_1_weight = L_self_layers_1_weight
        l_self_layers_1_bias = L_self_layers_1_bias

        # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias)
        input_1: "f32[1024, 1024]" = torch._C._nn.linear(l_x_, l_self_layers_0_weight, l_self_layers_0_bias);  l_x_ = l_self_layers_0_weight = l_self_layers_0_bias = None
        input_2: "f32[1024, 1024]" = torch._C._nn.linear(input_1, l_self_layers_1_weight, l_self_layers_1_bias);  input_1 = l_self_layers_1_weight = l_self_layers_1_bias = None
        return (input_2,)
```
The DDP optimizer when performing splitting, does not handle the inlined graph since it does not handle function calls since earlier we did not have function calls with params as inputs. (but calls to modules instead).

This diff addresses that, it uses the example_value in the arguments to determine Parameter arguments of a function call
and the Parameter properties.
This address #https://github.com/pytorch/pytorch/issues/127552

running the optimizer on the code above with inlining yields to the following splitting:
```
---submod_0 graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_]
    %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_weight]
    %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_bias]
    %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {})
    return linear

---submod_1 graph---
graph():
    %input_1 : [num_users=1] = placeholder[target=input_1]
    %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_weight]
    %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_bias]
    %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%input_1, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {})
    return linear

---final graph---
graph():
    %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_weight]
    %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_bias]
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
    %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_weight]
    %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_bias]
    %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {})
    %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {})
    return (submod_1,)
---------------

```
where as without inlining it uses to be
```
---submod_0 graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_]
    %l__self___layers_0 : [num_users=1] = call_module[target=L__self___layers_0](args = (%l_x_,), kwargs = {})
    return l__self___layers_0
/data/users/lsakka/pytorch/pytorch/torch/_inductor/compile_fx.py:133: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(

---submod_1 graph---
graph():
    %l__self___layers_0 : [num_users=1] = placeholder[target=l__self___layers_0]
    %l__self___layers_1 : [num_users=1] = call_module[target=L__self___layers_1](args = (%l__self___layers_0,), kwargs = {})
    return l__self___layers_1

---final graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
    %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_,), kwargs = {})
    %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0,), kwargs = {})
    return (submod_1,)
---------------
```

TESTING:

(1) running
``` TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1   pytest test/distributed/test_dynamo_distributed.py -k ```
result in reduction in failures from 6 to 2 with this PR.

The two remaining are FSDP related which does not sounds trivial and have so many details. will leave them for future work.

Co-authored-by: Animesh Jain <anijain@umich.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128034
Approved by: https://github.com/anijain2305, https://github.com/wconstab
2024-06-13 17:07:27 +00:00
3e3435678c Remove some implications from the static_eval pattern matcher (#128500)
We should be able to remove this as, with the new canonicalisation, we
have that `a < b` and `-a > -b` should be canonicalised to the same
expression (if SymPy does not interfere too much).

nb. I thought this would cut further the compilation time, but I was running
the benchmarks wrong (not removing triton's cache oops). It turns out that
after the first PR in this stack, https://github.com/pytorch/pytorch/issues/128398 is fully fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128500
Approved by: https://github.com/ezyang
ghstack dependencies: #128410, #128411
2024-06-13 16:50:00 +00:00
0fdd8d84fa Do not generate -1* in SymPy expressions when canonicalising (#128411)
Partially addresses https://github.com/pytorch/pytorch/issues/128150
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128411
Approved by: https://github.com/ezyang
ghstack dependencies: #128410
2024-06-13 16:49:59 +00:00
bdeb9225b0 Do not call get_implications unnecessarily (#128410)
This should improve compilation times. With this PR and the patch in
the original issue, I get a compilation time of `Compilation time: 307.30 second`.

Fixes https://github.com/pytorch/pytorch/issues/128398
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128410
Approved by: https://github.com/Chillee
2024-06-13 16:49:55 +00:00
cyy
e2a72313e8 Concat namespaces of torch/csrc/profiler code and other fixes (#128606)
Improve namespaces and modernize codebase of torch/csrc/profiler code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128606
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-06-13 16:46:34 +00:00
7c370d2fb0 expose set_thread_name to Python and set thread names (#128448)
This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process.

Threads named:

* torchrun/elastic
* PyTorch dataloader worker processes + pin memory thread
* TCPStore
* ProcessGroupNCCL background threads
* WorkerServer httpserver thread

Test plan:

```
$ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL | grep pt_'
3264281 3264281 pts/45   00:00:02 pt_elastic
3264281 3267950 pts/45   00:00:00 pt_elastic
```

dataloading

```py
import torch
import time

from torch.utils.data import (
    DataLoader,
    Dataset,
)

class NoopDataset(Dataset):
    def __getitem__(self, index):
        return index

    def __len__(self):
        return 10

dataloader = DataLoader(NoopDataset(), num_workers=2)

for i, x in enumerate(dataloader):
    print(i, x)
    time.sleep(10000)
```

```
$ python3 ~/scripts/dataloader_test.py
$ ps -eL | grep pt_
1228312 1228312 pts/45   00:00:02 pt_main_thread
1228312 1230058 pts/45   00:00:00 pt_main_thread
1228312 1230059 pts/45   00:00:00 pt_main_thread
1230052 1230052 pts/45   00:00:00 pt_data_worker
1230052 1230198 pts/45   00:00:00 pt_data_worker
1230052 1230740 pts/45   00:00:00 pt_data_worker
1230055 1230055 pts/45   00:00:00 pt_data_worker
1230055 1230296 pts/45   00:00:00 pt_data_worker
1230055 1230759 pts/45   00:00:00 pt_data_worker
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448
Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro
2024-06-13 16:38:23 +00:00
b05b8d3989 [EZ][ALI Migration] Add logging for workflow type determination (#128619)
To help figure out what went wrong when the wrong label appears to have been set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128619
Approved by: https://github.com/zxiiro, https://github.com/clee2000
2024-06-13 16:37:07 +00:00
e9b81e4edf Fakify torch bind input by default (#128454)
Summary: Try a reland of https://github.com/pytorch/pytorch/pull/127116 after some fixes landed

Differential Revision: D58418251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128454
Approved by: https://github.com/angelayi
2024-06-13 16:25:11 +00:00
c63ccead5e Revert "[dynamo] Enable some inlining inbuilt nn module tests (#128440)"
This reverts commit 1602c7d0c861a4382746ccb18c76d8703a636f4e.

Reverted https://github.com/pytorch/pytorch/pull/128440 on behalf of https://github.com/clee2000 due to new test broke internally D58501220 ([comment](https://github.com/pytorch/pytorch/pull/128440#issuecomment-2166127531))
2024-06-13 16:14:37 +00:00
17b45e905a Fix get output code when caching is enabled (#128445)
Summary: Improve output code retrieval mechanism so that it works in the presence of cache hits.

Test Plan: ci

Differential Revision: D58429602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128445
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/masnesral
2024-06-13 16:00:30 +00:00
93a14aba6e [BE]: Update mypy to 1.10.0 (#127717)
Updates mypy to the latest and greatest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717
Approved by: https://github.com/ezyang
2024-06-13 15:57:13 +00:00
49366b2640 Add test to xfail_list only for abi_compatible (#128506)
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.

We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.

- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-13 15:32:15 +00:00
cf7adc2fa1 [Inductor] Update Intel GPU Triton commit pin. (#124842)
Update Intel triton for Pytorch 2.4 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124842
Approved by: https://github.com/EikanWang
2024-06-13 14:34:37 +00:00
edb45dce85 Add OpInfo entry for as_strided_copy (#127231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231
Approved by: https://github.com/lezcano
2024-06-13 13:58:47 +00:00
7cc07a3eb1 [custom_op] stop using nonlocals to store information (#128547)
Fixes https://github.com/pytorch/pytorch/issues/128544
Fixes https://github.com/pytorch/pytorch/issues/128535

We had a problem with multithreading where the nonlocals were being
clobbered. In the first place, we stored these nonlocals because we
wanted to ferry information from an autograd.Function.apply to
autograd.Function.forward.

Our new approach is:
- pass the information directly as an input to the
  autograd.Function.apply. This means that the autograd.Function.forward
  will receive the information too.
- this messes up ctx.needs_input_grad, which has an element per input to
  forward. The user should not see the additional information we passed.
  We fix this by temporarily overriding ctx.needs_input_grad to the
  right thing.
- this exposed a bug in that ctx.needs_input_grad wasn't correct for
  TensorList inputs. This PR fixes that too.

Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128547
Approved by: https://github.com/williamwen42, https://github.com/soulitzer
2024-06-13 13:36:39 +00:00
2b9465d62a [aota] Allow some mutations in backward (#128409)
https://github.com/pytorch/pytorch/issues/127572

Allow mutations in backward on forward inputs, if
1/ not mutationg metadata
Enforced at compilation time.

2/ if create_graph=True: mutated input does not require_grad
Enforced in runtime, when create_graph mode can be detected by checking torch.is_grad_enabled()

Adding input_joint_info to track mutations of inputs during joint.
Created a separate field in ViewAndMutationMeta as it is filled only after joint fn tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128409
Approved by: https://github.com/bdhirsh
2024-06-13 12:09:08 +00:00
d0c08926d1 allow inlining functions in _python_dispatch and _is_make_fx_tracing (#128485)
This fix grab breaks in torch_multimodal_clip benchmark.

Co-authored-by: Animesh Jain <anijain@umich.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128485
Approved by: https://github.com/anijain2305
ghstack dependencies: #128428
2024-06-13 09:56:39 +00:00
1fd2cd26a0 [inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
2024-06-13 09:46:22 +00:00
c897651392 [inductor] Add BackendFeature gating (#128266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266
Approved by: https://github.com/shunting314
2024-06-13 07:31:51 +00:00
88974fedd0 Clean up xpu ut to make CI happy (#128383)
# Motivation
Before #127611 merged, the xpu-specific UT `test/test_xpu.py` was skipped temporarily. This PR aims to fix the UT bug introduced by #127741.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128383
Approved by: https://github.com/EikanWang
2024-06-13 07:06:41 +00:00
ce79b09415 [CUDA][Sparse] Change comparison function of test_sparse_semi_structured.py and bump tolerances for sp24_matmuls (#128553)
Minor tweak of comparison as using `assert` on `torch.allclose` prevents the mismatches from being logged. Also bump a few tolerances that seem to be causing failures on sm86/sm90

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128553
Approved by: https://github.com/jcaip
2024-06-13 06:58:07 +00:00
0678742924 [MPS] Add Metal implementation of exp op (#128421)
To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor)
Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU

Fix bug in non-contiguous tensors handling

Fixes https://github.com/pytorch/pytorch/issues/84936
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128421
Approved by: https://github.com/kulinseth
ghstack dependencies: #128373, #128375
2024-06-13 06:53:17 +00:00
14c9eb5ed2 Add XPU code owners (#128486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128486
Approved by: https://github.com/atalman, https://github.com/malfet
2024-06-13 06:33:45 +00:00
518c9e6455 Forward fix lint (#128587)
merge at will
After https://github.com/pytorch/pytorch/pull/125968
and https://github.com/pytorch/pytorch/pull/127693
landrace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128587
Approved by: https://github.com/huydhn
2024-06-13 06:19:03 +00:00
c52eda896e [dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428
Approved by: https://github.com/yanboliang, https://github.com/mlazos
ghstack dependencies: #126578, #128440, #128470, #128453, #128484
2024-06-13 06:08:56 +00:00
1f6e84fa68 [inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484
Approved by: https://github.com/mlazos
ghstack dependencies: #126578, #128440, #128470, #128453
2024-06-13 06:08:56 +00:00
ea541dd965 SymIntify cross_entropy_loss_prob_target numel call (#128141)
This PR replaces call to ```numel``` with ```sym_numel``` in cross_entropy_loss_prob_target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128141
Approved by: https://github.com/ezyang
2024-06-13 05:37:17 +00:00
ade3d07483 GGML inspired int8 MM Metal shader (#127646)
## Context

This PR ported GGML int8 per channel matrix multiplication and matrix vector multiplication metal shaders into ATen library.
llama.cpp LICENSE: https://github.com/ggerganov/llama.cpp/blob/master/LICENSE

## Key Changes

Made the following changes to the original code:

* Memory layout of weight and scales is different than llama.cpp.
* Weight dequantization (scales multiplication) is done after MM is finished.
* Following PyTorch naming convention (M, K, N and assuming row major).

## Benchmark

When M = 1, mv shader improves existing ATen int8mm by 40%.
When M > 4, mm shader outperforms existing ATen int8mm up to 10x for a large M, as show blow.
![image](https://github.com/pytorch/pytorch/assets/8188269/fd9eff71-c538-4263-a7b5-f96fe479ae9d)

Hence the kernel chooses different shaders based on M.

## Test Plan

Tests are passing:
```
❯ python test/test_mps.py -v -k _int8_
/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'dlopen(/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN3c1017RegisterOperatorsD1Ev
  Referenced from: <A770339A-37C9-36B2-84FE-4125FBE26FD6> /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <5749F98A-0A0C-3F89-9CBF-277B3C8EA00A> /Users/larryliu/CLionProjects/pytorch/torch/lib/libtorch_cpu.dylib'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
test__int8_mm_m_1_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok

----------------------------------------------------------------------
Ran 12 tests in 1.180s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127646
Approved by: https://github.com/malfet
2024-06-13 05:23:56 +00:00
b86b4ace88 Invalidate eager params when inlining and freezing nn modules (#128543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128543
Approved by: https://github.com/anijain2305
2024-06-13 04:50:17 +00:00
83bb9b7c53 [BE] explicitly export subpackage torch.utils (#128342)
Resolves #126401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128342
Approved by: https://github.com/Skylion007
ghstack dependencies: #127707
2024-06-13 04:39:16 +00:00
2229884102 Introduce int_oo (#127693)
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.

After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.

But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.

The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.

Fixes https://github.com/pytorch/pytorch/issues/127396

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
2024-06-13 04:08:20 +00:00
d3b8230639 Fix profiler_kineto Clang errors (#128464)
Summary: There are clang errors in profiler_kineto. It would probably be a good idea to fix them as the file is already quite dense.

Test Plan: Make sure all on Phabricator all tests under static_tests/lint_root pass

Differential Revision: D58431005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128464
Approved by: https://github.com/aaronenyeshi
2024-06-13 03:10:50 +00:00
d630e1e838 Revert "[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269)"
This reverts commit f2d7f235a684c593f5a1ff2ca0b47b47274bfe85.

Reverted https://github.com/pytorch/pytorch/pull/128269 on behalf of https://github.com/anijain2305 due to incorrect ([comment](https://github.com/pytorch/pytorch/pull/128269#issuecomment-2164267320))
2024-06-13 03:04:26 +00:00
7fe9ab9ccc update amp example to device-agnostic (#127278)
As support for Intel GPU has been upstreamed, this PR is to make the AMP example doc device-agnostic.

Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127278
Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/svekars
2024-06-13 02:01:16 +00:00
cyy
3f9b8446cf [8/N] Remove unused functions (#128499)
Follows #128407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128499
Approved by: https://github.com/malfet
2024-06-13 01:15:11 +00:00
ede74940a1 optimize vec isa check dispatch logical. (#128320)
Optimize cpu vec isa check dispatch by archecture, it makes code easy to read and maintaince.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128320
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-13 01:06:34 +00:00
c1cd946818 [cond] add a set_ and data mutation expected failure test (#128457)
A follow up of the discussion in https://github.com/pytorch/pytorch/pull/126936.

Cond errors out early because of a graph break triggered by DelayGraphBreakVariable, which is created due to `aten.set_` [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/tensor.py#L366-L376).

We might need to see what happened to this test if we allow graph break in higher order op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128457
Approved by: https://github.com/zou3519
2024-06-13 00:16:59 +00:00
c472cec565 [checkpoint] Clean up selective activation checkpoint and make public (#125795)
Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit

Memory considerations:
- As with the existing SAC, cached values are cleared upon first use.
- We error if the user wishes to backward a second time on a region forwarded with SAC enabled.

In-place:
- We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed.
- `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place)

Randomness, views
- Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors)

Tensor object preservation
- We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object. If the tensor does require grad, we must detach to avoid creating a reference cycle. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor.

Policy function
- Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error.
- The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3).
- The number of times we call the policy_fn is something documented part of public API. We call the policy function for all ops except detach because detach is itself called a different number of times by AC between forward and recompute.
- The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below).
Tensors guaranteed to be the same tensor as-is
- Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary.

"bc-breaking" for existing users of the private API:
- Existing policy functions must now change their return value to use the Enum.
- Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `gen_selective_checkpoint_context_fn`. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795
Approved by: https://github.com/Chillee, https://github.com/fmassa
2024-06-12 23:57:33 +00:00
25b7537a27 doc comment typo fixes and improvements (#128512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128512
Approved by: https://github.com/LucasLLC
2024-06-12 23:55:09 +00:00
eb1db6702f [2nd try][AOTI] Switch to use shim v2 (#128521)
Test Plan: Sandcastle

Differential Revision: D58470269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128521
Approved by: https://github.com/desertfire
2024-06-12 23:44:24 +00:00
4423e1bbdc [release] Increase version 2.4.0->2.5.0 (#128514)
Same as https://github.com/pytorch/pytorch/pull/121974
Branch cut for 2.4.0 completed hence advance main version to 2.5.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128514
Approved by: https://github.com/malfet
2024-06-12 23:40:01 +00:00
3bc2004f91 [ts_converter] Fix prim::dtype (#128517)
Summary: prim::dtype has the signature `(Tensor a) -> int`, where it gets the dtype of the tensor and returns the integer corresponding to this dtype based on the enum in ScalarType.h. Previously we were converting prim::dtype by returning the actual dtype of the tensor (ex. torch.float32). This causes some incorrect control flow to behavior, specifically where it checks if `prim::dtype(tensor) in [3, 5, 7]`, where [3, 5, 7] correspond to torch.int32, torch.float16, torch.float64. This control flow would always returns False because we would be comparing torch.float32 against the integers [3, 5, 7], which is a type mismatch.

Test Plan: 7/22 internal models now are convertable and runnable in eager and sigmoid! P1410243909

Reviewed By: jiashenC

Differential Revision: D58469232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128517
Approved by: https://github.com/jiashenC
2024-06-12 23:02:50 +00:00
2fa6f80b13 Perform reciprocal optimization with foreach_div (#128433)
Fixes https://github.com/pytorch/pytorch/issues/114165

Internal xref
https://fb.workplace.com/groups/1144215345733672/posts/2801223606699496/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128433
Approved by: https://github.com/awgu
2024-06-12 22:57:03 +00:00
8db4a41973 Use computeStorageNbytesContiguous if possible (#128515)
```at::detail::computeStorageNbytesContiguous``` does fewer data-dependent tests compared to ```at::detail::computeStorageNbytes```. Therefore, use of former is more likely to succeed with dynamic shapes. This PR detects is_contiguous and dispatches to the appropriate function. This should be helpful in unblocking aot_eager for torchrec. As an aside, this is an alternative solution to the unsound solution I had first proposed in another [PR](#128141).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128515
Approved by: https://github.com/ezyang
2024-06-12 22:53:06 +00:00
e2610240f9 [ROCm] Enable several inductor UTs (#127761)
Fixes #ISSUE_NUMBER

Needs https://github.com/pytorch/pytorch/pull/125396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127761
Approved by: https://github.com/peterbell10, https://github.com/pruthvistony
2024-06-12 22:47:45 +00:00
bb3cf8a339 Lift inductor lowerings for jagged <-> padded dense kernels (#125968)
This PR lifts internal lowerings written for FBGEMM kernels that do jagged <-> padded dense conversions. In particular, this PR provides lowerings and meta registrations for the following ATen ops:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`
    * NB: if `total_L` is not provided, the output shape is data-dependent. An unbacked SymInt is used for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125968
Approved by: https://github.com/davidberard98
2024-06-12 22:46:09 +00:00
b4a7b543e5 Add targeted unit tests for guards-related functions used in the codecache (#128482)
Summary: Add a few unit tests that exercise `produce_guards_expression` and `evaluate_guards_expression` (and specifically "ToFloat" "FloatTrueDiv" added in https://github.com/pytorch/pytorch/pull/128418)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128482
Approved by: https://github.com/ezyang
ghstack dependencies: #128418
2024-06-12 22:41:50 +00:00
1f302d6885 Support aten operations with out tensor (#124926)
This PR intends to support the aten operations with the `out` tensor.

Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.

However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.

Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```

W/O this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    return (clamp_max, clamp_max)
```

W/ this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max);  arg3_1 = clamp_max = None
    return (copy_,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
2024-06-12 22:31:59 +00:00
f4edd67fe7 [c10d] fix OSS commSplit bug (#128459)
Summary:
D56907877 modified OSS commSplit. However, commSplit requires every rank being called even though it is no-color. ncclCommSplit will not create a communicator for nocolor ranks hence this line of code will potentially throw error like `NCCL WARN CommUserRank : comm argument is NULL`

Revert this change from D56907877

Test Plan: CI

Differential Revision: D58436088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128459
Approved by: https://github.com/shuqiangzhang
2024-06-12 22:29:01 +00:00
f39ab8a0fe Fix side effect pruning (#128028)
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.

This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
   involved in a return from the function or intermediate variable
   during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
   NestedUserFunctionVariable to a global list

The new algorithm reflects this, but please let me know if there are
more cases to handle.

Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
  SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
  -- the functorch dynamo graphs no longer return dead cellvars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
2024-06-12 22:25:37 +00:00
cyy
3008644297 [Caffe2] Remove remaining unused perfkernels (#128477)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128477
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-12 22:19:36 +00:00
55a6b38f52 [inductor] enable fx graph cache on torchbench (#128239)
Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239
Approved by: https://github.com/oulgen
2024-06-12 22:15:02 +00:00
6206da55ef Fix lint after #119459 (#128558)
TSIA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128558
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet
2024-06-12 22:11:37 +00:00
2b28b107db [dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128453
Approved by: https://github.com/yf225
ghstack dependencies: #126578, #128440, #128470
2024-06-12 22:03:45 +00:00
6aef2052ea Save backward graphs lazily to cache (#126999)
This PR makes it so we lazily save to the cache on backward call instead of saving ahead of time always. We have to pass a closure to post_compile to prevent cyclic dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126999
Approved by: https://github.com/bdhirsh
ghstack dependencies: #126791
2024-06-12 21:58:34 +00:00
87072dcfdb Change Dynamo's custom ops warning message to be less spammy (#128456)
This is a short-term fix (for 2.4). In the longer term we should
fix https://github.com/pytorch/pytorch/issues/128430

The problem is that warnings.warn that are inside Dynamo print
all the time. Python warnings are supposed to print once, unless their
cache is reset: Dynamo ends up resetting that cache everytime it runs.

As a workaround we provide our own warn_once cache that is keyed on the
warning msg. I am not worried about this increasing memory usage because
that's effectively what python's warnings.warn cache does.

Test Plan:
- fix tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128456
Approved by: https://github.com/anijain2305
2024-06-12 21:57:12 +00:00
c53d65b3d3 [inductor] fix linear add bias pattern (#128473)
Fix https://github.com/pytorch/pytorch/issues/128287.
Previous the assertion in `linear_add_bias` are pretty bad
```
assert packed_weight_node.name == "_reorder_linear_weight"
assert transpose_weight_node.name == "permute_default"
```
because the `name` can be changed to `_reorder_linear_weight_id, permute_default_id` if we have more than 1 reorder/permute.

Check `target` instead `name` can solve this issue.

UT is also updated to have match more than 1 `linear_add_bias` pattern to cover this case.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128473
Approved by: https://github.com/jgong5
2024-06-12 21:55:35 +00:00
bb13fad7aa Share TCPStore by default when using c10d rdzv handler (#128096)
Summary:
Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail.

Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server.

Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler.

Any handler would like to manage tcp store has to:
- Return true on `use_agent_store` property
- `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call)

Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change)

Test Plan:
`cat ~/workspace/dist-demo/stores.py`
~~~
import torch
import logging
import sys
import torch.distributed as dist
import torch

import os
import time

logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stderr))
logger.setLevel(logging.INFO)

def _run_test(store):

    if dist.get_rank() == 1:
        logger.info("Rank %s is sleeping", dist.get_rank())
        time.sleep(5)
        key = "lookup_key"
        logger.info("Checking key %s in store on rank %s", key, dist.get_rank())
        store.check([key])
    else:
        logger.info("rank %s done", dist.get_rank())

def main() -> None:
    use_gpu = torch.cuda.is_available()
    dist.init_process_group(backend="nccl" if use_gpu else "gloo")
    dist.barrier()

    logger.info(f"Hello World from rank {dist.get_rank()}")

    host = os.environ['MASTER_ADDR']
    port = os.environ['MASTER_PORT']
    world_size = os.environ['WORLD_SIZE']

    logger.info("testing TCPStore")
    store = dist.TCPStore(
        host_name=host, port=int(port), world_size=int(world_size),
    )
    _run_test(store)

if __name__ == "__main__":
    main()
~~~

With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option)
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 1
Hello World from rank 2
Hello World from rank 0
testing TCPStore
testing TCPStore
testing TCPStore
rank 2 done
Rank 1 is sleeping
rank 0 done
Checking key lookup_key in store on rank 1
~~~

TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro
c-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 0
Hello World from rank 2
Hello World from rank 1
testing TCPStore
testing TCPStore
testing TCPStore
rank 0 done
rank 2 done
Rank 1 is sleeping
Checking key lookup_key in store on rank 1
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module>
[rank1]:     main()
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main
[rank1]:     _run_test(store)
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test
[rank1]:     store.check([key])
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python
Traceback (most recent call last):
  File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module>
    main()
  File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main
    run(args)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/kurman/workspace/dist-demo/stores.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-05_17:40:22
  host      : devgpu011.cln5.facebook.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2279237)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
~~~

Differential Revision: D58180193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096
Approved by: https://github.com/shuqiangzhang
2024-06-12 21:49:42 +00:00
c0ea8fc3a3 Disable inlining nn modules on static inputs tests (#128529)
With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128529
Approved by: https://github.com/anijain2305
ghstack dependencies: #128528
2024-06-12 21:40:29 +00:00
ff3ba99320 Disable inline nn modules on unstable ptr test (#128528)
With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128528
Approved by: https://github.com/anijain2305
2024-06-12 21:40:29 +00:00
1026b7cfbe Add docstring for the torch.typename function (#128129)
Fixes: #127885

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128129
Approved by: https://github.com/malfet
2024-06-12 21:34:20 +00:00
cba840fde9 Fix accidental variable shadow (#128460)
Fixes #128322

We should probably crank up clang's warning levels...

Test:
```
import torch

def addmv_slice(input, mat, vec, slice_op):
    vec = vec[slice_op]
    res = torch.addmv(input, mat, vec)  # traced line: 25
    return res

torch._dynamo.reset()
model_opt = torch.compile(addmv_slice)

input = torch.empty(size=[11]).uniform_(-1, 1)
mat = torch.empty([11, 128]).uniform_(-10.0, 20.0)

vec = torch.empty([256]).uniform_(-10.0, 20.0)
slice_op = slice(None, None, 2)
out = model_opt(input, mat, vec, slice_op)

vec = torch.empty([384]).uniform_(-10.0, 20.0)
slice_op = slice(None, None, 3)
out = model_opt(input, mat, vec, slice_op)
```
before this change the test fails with:
```
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function getitem>(*(FakeTensor(..., size=(s0,)), slice(None, None, s1)), **{}):
slice step cannot be zero
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128460
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-06-12 21:14:04 +00:00
0444e89931 [export] Remove replace_sym_size_ops_pass (#128443)
Summary: Not needed anymore.

Test Plan: CI

Differential Revision: D58429458

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128443
Approved by: https://github.com/angelayi
2024-06-12 21:03:06 +00:00
67e6c76a18 Support apply_(callable) sugar for CPU NJTs (#125416)
Example:
```python
nt.apply_(lambda x: x * 2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125416
Approved by: https://github.com/soulitzer
2024-06-12 20:30:57 +00:00
dd143d44cc [BE] enable UFMT for top-level files torch/*.py (#127707)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127707
Approved by: https://github.com/ezyang
2024-06-12 20:15:05 +00:00
cc231a8e2b First version of AOTAutogradCache (#126791)
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.

CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.

On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object

On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.

For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.

V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.

We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.

Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
2024-06-12 20:04:44 +00:00
7775fee10f [tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431)
as titled, this PR refactors the PrepareModuleInput style to have common
method prepare_input_arg, allow both args/kwargs to reuse this logic

This also fixes https://github.com/pytorch/pytorch/issues/128365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431
Approved by: https://github.com/awgu
2024-06-12 19:16:33 +00:00
ec1fdda196 Fix jagged NT softmax semantics (#119459)
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
2024-06-12 19:12:03 +00:00
817ce6835b Revert "[cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)"
This reverts commit 4c971932e839fc5da2b91906ad028d4654932bca.

Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))
2024-06-12 18:47:52 +00:00
6d1b1ddd3e Select Runner Label Dynamically (#127287)
Updated `get_workflow_type.py` logic to dynamically select a prefix for the runner label.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127287
Approved by: https://github.com/ZainRizvi
2024-06-12 18:47:47 +00:00
7db501ba2b Revert "[cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350)"
This reverts commit 45dccfddcd8fce804f50075484421ade27f1f021.

Reverted https://github.com/pytorch/pytorch/pull/128350 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128350#issuecomment-2163669538))
2024-06-12 18:35:18 +00:00
d71f92213c [DSD] keep 'exp_avg' as DTensor after torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#128004)
Fixes #126950
`ptd_state_dict` with `broadcast_from_rank0=False` might miss 2 condition checks in the `set_optimizer_state_dict`
Here we add another condition `full_state_dict=True` with corresponding tensor distribution without broadcasting if broadcast_from_rank0=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128004
Approved by: https://github.com/fegin
2024-06-12 18:14:56 +00:00
624e8ae491 Documentation for is_dependent function (#128197)
Docstring for torch.distributions.constraints.is_dependent

Fixes #127900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128197
Approved by: https://github.com/fritzo, https://github.com/malfet
2024-06-12 17:50:41 +00:00
a70a7337d2 Update torch.nanmean() docstring to mention input dtype requirement (#128155)
Fixes #120570

## Description
Update torch.nanmean() docstring to mention input dtype requirement as either floating point type or complex.
Previously, the torch.mean() docstring had been updated in #120208 in a similar manner, but the torch.nanmean() docstring was not updated.

## Checklist

- [X] The issue that is being fixed is referred in the description.
- [X] Only one issue is addressed in this pull request.
- [x] Labels from the issue that this PR is fixing are added to this pull request.
- [X] No unnecessary issues are included into this pull request.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128155
Approved by: https://github.com/malfet
2024-06-12 17:46:36 +00:00
0f52dc7e51 Document torch.cuda.profiler.stop (#128196)
Fixes #127918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128196
Approved by: https://github.com/malfet, https://github.com/eqy
2024-06-12 17:39:43 +00:00
5001f41b90 Revert "Make TraceUtils.h to be device-agnostic (#126969)"
This reverts commit 648625b230e8e6e7478fb219ff4f0aa6a45070f5.

Reverted https://github.com/pytorch/pytorch/pull/126969 on behalf of https://github.com/clee2000 due to failing internal builds D58443769 ([comment](https://github.com/pytorch/pytorch/pull/126969#issuecomment-2163462600))
2024-06-12 16:32:57 +00:00
f89574fa23 Revert "Pass params to dump_nccl_trace_pickle (#128307)"
This reverts commit eb567b1f40233667b982f81e3a75deec0fdfd9ca.

Reverted https://github.com/pytorch/pytorch/pull/128307 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert 126969 ([comment](https://github.com/pytorch/pytorch/pull/128307#issuecomment-2163459399))
2024-06-12 16:29:51 +00:00
81e4e12f02 Revert "Support aten operations with out tensor (#124926)"
This reverts commit cba195c8edd6c7149036ef0767772d11fff5390e.

Reverted https://github.com/pytorch/pytorch/pull/124926 on behalf of https://github.com/clee2000 due to newly added test broke in internal D58444103.  Test passed in OSS CI though ([comment](https://github.com/pytorch/pytorch/pull/124926#issuecomment-2163441547))
2024-06-12 16:20:04 +00:00
c5172b8de8 Revert "[AOTI] Switch to use shim v2 (#127674)"
This reverts commit 9a38cae299e5ffd8143182bec878c28f96cfd72a.

Reverted https://github.com/pytorch/pytorch/pull/127674 on behalf of https://github.com/clee2000 due to tests failed internally D56709309 ([comment](https://github.com/pytorch/pytorch/pull/127674#issuecomment-2163436728))
2024-06-12 16:17:07 +00:00
9e39c62908 correct avx512_vnni isa name. (#128318)
`x86` has two vnni isa currently: `avx2_vnni` and `avx512_vnni`.
This PR correct the function name to `avx512_vnni`.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128318
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2024-06-12 16:12:49 +00:00
f2dcbe89d6 Revert "Prevent expansion of cat indexing to avoid int64 intermediate (#127815)"
This reverts commit 793df7b7cb1473004837f5867f4c1c4b2b0f751d.

Reverted https://github.com/pytorch/pytorch/pull/127815 on behalf of https://github.com/clee2000 due to the newly added test is failing internally D58444153.  Test exists in opensource and passed in OSS CI, maybe env difference? ([comment](https://github.com/pytorch/pytorch/pull/127815#issuecomment-2163421968))
2024-06-12 16:09:22 +00:00
8df56afc20 Add support in Python API for the recommended max working set size. (#128289)
Adds ways for users to request recommended max size for Metal on Mac. It plumbs through
https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc

Can be used like
```
        max_memory = torch.mps.recommended_max_memory()
        print ("Recommended Max Memory : ", (max_memory/(1024*1024*1024)), "GB")
```

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289
Approved by: https://github.com/malfet
2024-06-12 16:03:57 +00:00
b19c2319e4 [ROCm] TunableOp for gemm_and_bias (#128143)
Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm.  gemm_and_bias was notably missing.  This PR closes that gap.

This PR also fixes a regression after #124362 disabled the numerical check by default. The env var to enable it no longer worked.

CC @xw285cornell

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128143
Approved by: https://github.com/Skylion007
2024-06-12 15:53:39 +00:00
3c971d2ef3 Flip default value for mypy disallow_untyped_defs [final] (#127836)
Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code.  I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types.

The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped.  Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-06-12 15:28:42 +00:00
15ab636007 Revert "Fix side effect pruning (#128028)"
This reverts commit a55d0d9718c11eb2897423c78eff18b168dd0a06.

Reverted https://github.com/pytorch/pytorch/pull/128028 on behalf of https://github.com/clee2000 due to broke test in internal D58443816.  Test exists in external too though ([comment](https://github.com/pytorch/pytorch/pull/128028#issuecomment-2163249251))
2024-06-12 14:55:57 +00:00
5ef70faaa7 Revert "Make torch_geometric models compatible with export (#123403)" (#128377)
This reverts commit d78991a7381adb3df5e9b63c365db4506643edce.

This PR reverts https://github.com/pytorch/pytorch/pull/123403 to fix the performance regression as discussed in https://github.com/pytorch/pytorch/issues/127513#issuecomment-2158835653.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128377
Approved by: https://github.com/jgong5, https://github.com/angelayi, https://github.com/desertfire
2024-06-12 14:53:01 +00:00
71f491554c Revert "First version of AOTAutogradCache (#126791)"
This reverts commit abc3eec22d38079bee855fbcb75da62a9558284c.

Reverted https://github.com/pytorch/pytorch/pull/126791 on behalf of https://github.com/DanilBaibak due to The changes broke a number of linux jobs ([comment](https://github.com/pytorch/pytorch/pull/126791#issuecomment-2163081643))
2024-06-12 13:59:29 +00:00
abc3eec22d First version of AOTAutogradCache (#126791)
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.

CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.

On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object

On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.

For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.

V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.

We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.

Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
2024-06-12 13:44:30 +00:00
2e065f2486 [Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#127592)
Fixes #127402

- Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py
- Add checks of mutation for QLinearPointwiseBinaryPT2E

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592
Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee
2024-06-12 10:49:16 +00:00
46a35a1ed4 [BE] enable UFMT for torch/__init__.py (#127710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127710
Approved by: https://github.com/ezyang
ghstack dependencies: #127703, #127708, #127709
2024-06-12 10:40:23 +00:00
26433b86de [BE][Easy] sort __all__ in torch/__init__.py (#127709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127709
Approved by: https://github.com/ezyang
ghstack dependencies: #127703, #127708
2024-06-12 10:21:36 +00:00
2386045e4f Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-12 09:39:58 +00:00
1edcb31d34 [RELAND][inductor][cpp] bf16/fp16 gemm template computed with fp32 (#128472)
reland for https://github.com/pytorch/pytorch/pull/126068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128472
Approved by: https://github.com/desertfire
2024-06-12 08:37:16 +00:00
ebb00a92bd [dynamo] Skip freezing expect failure for inlining inbuilt nn modules (#128470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128470
Approved by: https://github.com/mlazos
ghstack dependencies: #126578, #128440
2024-06-12 08:21:50 +00:00
1602c7d0c8 [dynamo] Enable some inlining inbuilt nn module tests (#128440)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #126578
2024-06-12 08:21:50 +00:00
04037f3d22 [BE] sort imports in torch/__init__.py (#127708)
----

- Sort import via `usort`
- Change relative import `from . import xxx` to absolute import `from torch import xxx`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127708
Approved by: https://github.com/ezyang
ghstack dependencies: #127703
2024-06-12 08:03:54 +00:00
0b331fd5d7 [CUDA] Abate SoftMax.cu compiler warning spam (#128468)
Avoids excessively spammy warnings such as
```
pytorch/aten/src/ATen/native/cuda/SoftMax.cu(844): warning #191-D: type qualifier is meaningless on cast type
        [&] { const auto& the_type = input.scalar_type(); constexpr const char* at_dispatch_name = "host_softmax"; at::ScalarType _st = ::detail::scalar_type(the_type); ; switch (_st) { case at::ScalarType::Double: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Double)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Double), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Double>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Float: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Float)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Float), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Float>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Half: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Half)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Half), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Half>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::BFloat16: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::BFloat16)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::BFloat16), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::BFloat16>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } default: do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str('"', at_dispatch_name, "\" not implemented for '", toString(_st), "'")))); }; } while (false); } }()

```
and
```
SoftMax.cu:844: warning: comparison of integer expressions of different signedness: ‘int64_t’ {aka ‘long int’} and ‘long unsigned int’ [-Wsign-compare]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128468
Approved by: https://github.com/valentinandrei
2024-06-12 07:47:14 +00:00
8b3daf1768 Add FloatTrueDiv and ToFloat to SYMPY_INTERP (#128418)
Summary: I admit I'm not 100% sure what I'm doing here. I'm hitting a bug in the FX graph cache when we try to evaluate a guards expression. We're creating guards that look like this:
```
Ne(CeilToInt(FloatTrueDiv(ToFloat(8*L['t0']) - 4.0, 8.0))*CeilToInt(FloatTrueDiv(ToFloat(8*L['t1']) - 4.0, 8.0)), CeilToInt(FloatTrueDiv(ToFloat(8*L['t1']) - 4.0, 8.0))) and ...
```
It looks like we have a facility to define these operators in the SYMPY_INTERP map and we're just missing FloatTrueDiv and ToFloat. What's surprsing to me is that we're only hitting this problem with the FX graph enabled. We can create such guards, but we've never actually evaluated any?

Test Plan:
`TORCHINDUCTOR_FX_GRAPH_CACHE=1 python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --only detectron2_fcos_r_50_fpn`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128418
Approved by: https://github.com/ezyang
2024-06-12 06:26:43 +00:00
a421699998 Revert "[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431)"
This reverts commit 089f9a116ac8b2c14d6351b52614b529caba126b.

Reverted https://github.com/pytorch/pytorch/pull/128431 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Your changes broke the linter. Here you can find more details - 089f9a116a ([comment](https://github.com/pytorch/pytorch/pull/128431#issuecomment-2162197858))
2024-06-12 06:25:53 +00:00
dcc0093dba [BE][Easy] export explicitly imported public submodules (#127703)
Add top-level submodules `torch.{storage,serialization,functional,amp,overrides,types}`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127703
Approved by: https://github.com/ezyang
2024-06-12 05:52:18 +00:00
62311257ad Add 1 test case for Convtranspose1D in op microbenchmark (#127216)
Operator Convtransposd1d suffers performance regression with specific shape, #120982. Then we'd like to have this shape included into op level benchmark in this PR.

I reproduced the regression that convtranspos1d with shape [2016, 1026, 1024, 256, 1, 224]. Here is the summary:

Hardware info: Intel SPR8480-56cores per socket with frequency=2.1G.
Performance comparison between torch 1.13 vs. torch 2.2
Benchmarking **PyTorch1.13**: ConvTranspose1d Mode: Eager
Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu
Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu
Forward Execution Time (s) : **0.96s**

Benchmarking **PyTorch2.2:** ConvTranspose1d
Mode: Eager
Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu
Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu
Forward Execution Time (s) : **7.988s**

Also benchmarking for 7 rounds to check the variance.

  | Round1 | Round2 | Round3 | Round4 | Round5 | Round6 | Round7 | Normalized   Variance
-- | -- | -- | -- | -- | -- | -- | -- | --
Pytorch1.13 | 0.971 | 0.972 | 0.969 | 0.970 | 0.972 | 0.970 | 0.971 | 0.0002%
Pytorch 2.2 | 8.064 | 8.053 | 8.027 | 7.927 | 7.971 | 7.929 | 7.902 | 0.0059%
Ratio v2.2 vs.   v1.13(Lower is better) | 8.31 | 8.28 | 8.29 | 8.18 | 8.20 | 8.18 | 8.14 |  

Reproduce script:
numctl -N 0 python -m pt.conv_test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127216
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
2024-06-12 05:33:54 +00:00
089f9a116a [tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431)
as titled, this PR refactors the PrepareModuleInput style to have common
method prepare_input_arg, allow both args/kwargs to reuse this logic

This also fixes https://github.com/pytorch/pytorch/issues/128365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431
Approved by: https://github.com/awgu
2024-06-12 05:22:24 +00:00
77a0ca66e4 Add threadfence to 2-stage reduction for correct writes visibility (#128455)
Final block accumulating 2-stage reduction result has to complete acquire pattern to make sure the writes of all other blocks are visible to it, see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=atom#release-and-acquire-patterns
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128455
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-06-12 04:13:36 +00:00
c0b87afcad [RELAND2][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)
Tracing through `__init__`  is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.

Fixes https://github.com/pytorch/pytorch/issues/111837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
2024-06-12 04:09:23 +00:00
02e7519ac3 DOC: strip inaccurate either float32 or float64 statement from set_default_type (#128192)
Fixes #126647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128192
Approved by: https://github.com/malfet
2024-06-12 03:57:48 +00:00
cyy
8cf302dce4 [5/N] Change static functions in headers to inline (#128406)
Follows #128286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128406
Approved by: https://github.com/ezyang
2024-06-12 03:25:54 +00:00
86b5df3e71 Documenting the torch.fx.annotate.annotate function (#128337)
Fixes #127903

This PR adds docstring to the `torch.fx.annotate.annotate` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128337
Approved by: https://github.com/malfet
2024-06-12 03:06:32 +00:00
7c2058338a Improve convert fp32 to fp16 fx pass (#127829)
Summary: Improve the convert fp32 to fp16 fx pass to use to_dtype node and const folding instead of inplace conversion.

Test Plan:
```
buck2 test @//mode/{opt,inplace} //glow/fb/fx/fba/tests:test_fba_pass_manager_builder
```

Differential Revision: D57803843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127829
Approved by: https://github.com/Skylion007
2024-06-12 02:50:37 +00:00
3ddec713b8 Revert "[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177)"
This reverts commit cac7a22b92478d897488688010e562b7bd36b97f.

Reverted https://github.com/pytorch/pytorch/pull/128177 on behalf of https://github.com/clee2000 due to broke test/test_quantization.py::TestQuantizedLinear::test_qlinear_cudnn on sm86 tests cac7a22b92 https://github.com/pytorch/pytorch/actions/runs/9470648757/job/26100448913.  Probably a landrace, test ran on the PR and succeed ([comment](https://github.com/pytorch/pytorch/pull/128177#issuecomment-2161977110))
2024-06-12 02:20:15 +00:00
85eeb90d2c [dynamo] Fix graph breaks related to HF ModelOutput (#127780)
Fixes https://github.com/pytorch/pytorch/issues/126028 and https://github.com/pytorch/pytorch/issues/126027.

Changes:
- Support building `CustomizedDictVariable` in` VariableBuilder` (but only for HF `ModelOutput` subclasses)
- Remove `DataClassVariable` since it's not really being used anywhere (`CustomizedDictVariable` can be used instead)
- Support side effects for `CustomizedDictVariable`
- Allow `NO_HASATTR` leaf guard on `DictSubclassGuardManager`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127780
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-06-12 02:16:24 +00:00
7f6daf289b [inductor] parallel compile: set LD_LIBRARY_PATH for sub-processes in internal (#128376)
Test Plan: `TORCHINDUCTOR_WORKER_START=subprocess TORCHINDUCTOR_COMPILE_THREADS=16 buck run mode/opt scripts/slarsen/torch_compile:run`

Differential Revision: D58371264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128376
Approved by: https://github.com/eellison
2024-06-12 01:55:53 +00:00
3d55d84ec2 [Fix] Check tensor dtype before using torch.allclose in _trace log (#128438)
#### Issue
`torch.allclose` errors out during logging due to different dtypes.

#### Test
* `pytest test/test_jit.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128438
Approved by: https://github.com/angelayi
2024-06-12 01:52:09 +00:00
bb2a995529 Back out "[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466)" (#128432)
Summary:
Original commit changeset: c7d2e6b13922

Original Phabricator Diff: D57618942

Differential Revision: D58383241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128432
Approved by: https://github.com/ezyang, https://github.com/Yuzhen11
2024-06-12 01:34:32 +00:00
cyy
9538bf4e7c [2/N] Remove inclusion of c10/util/string_utils.h (#128372)
Follows  #128300.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128372
Approved by: https://github.com/aaronenyeshi
2024-06-12 01:18:20 +00:00
cyy
219da29dfd [7/N] Remove unused functions (#128407)
Follows  #128309
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128407
Approved by: https://github.com/ezyang
2024-06-12 01:10:33 +00:00
cyy
fb013ecb24 Remove unused private List::ptr_to_first_element (#128405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128405
Approved by: https://github.com/ezyang
2024-06-12 01:07:14 +00:00
6af4c6acad Migrate test to internal base class, fixes (#128367)
Summary:
## Remove etc deps
converted tests to non-etcd based rdzv handler so that tests don't have dependency on etcd server

## Adopt pytorch test convetions
- test starts with `test_TESTS.py`
- Test base class is torch.testing._internal.common_utils.TestCase
- include __main__  handler

## reduce test timing (used to take > 300 seconds):

3.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_env_with_torchelastic
2.59s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_tcp_with_torchelastic
2.33s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_worker_raise_exception
2.33s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_run_path
2.30s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_auto_configurations
2.24s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched_with_logs_spec_defined
2.24s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched
2.17s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_multiple_agents
2.12s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic
2.08s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_gpu_launch_configurations
1.32s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_standalone
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_number_configurations
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_with_env_vars
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python_caffe2_bc
1.04s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_bash
1.03s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_default_nproc
0.04s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_logs_logs_spec_entrypoint_must_be_defined
0.01s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_agent_raise_exception
0.01s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_shutdown

Test Plan: pytest --durations=0  test/distributed/launcher/run_test.py

Differential Revision: D58388182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128367
Approved by: https://github.com/d4l3k
2024-06-12 01:03:40 +00:00
786c24a4cd [inductor] Always realize sigmoid for CPU (#128339)
Summary: Currently the cpu backend prefers to always realize exp because it's a heavy op on CPU. For the same reason, we need to realize sigmoid as well. This solves a problem in llama2 inference where exp was repeated in an inner loop for many times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128339
Approved by: https://github.com/eellison, https://github.com/helloguo, https://github.com/jansel, https://github.com/jgong5, https://github.com/peterbell10
2024-06-12 00:46:33 +00:00
5d8c7f39d4 Revert "Introduce int_oo (#127693)"
This reverts commit 9cab5987bdeb66df8efbc581b3469bfe300e168c.

Reverted https://github.com/pytorch/pytorch/pull/127693 on behalf of https://github.com/clee2000 due to sorry executorch CI is a bit weird regarding pins, I'll make a chat with mergen with the choices of what to do and how it'll affect executorch CI, reverting for now to prevent more divergences in the meantime ([comment](https://github.com/pytorch/pytorch/pull/127693#issuecomment-2161775400))
2024-06-11 23:36:08 +00:00
c9c1fed065 Revert "Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374)"
This reverts commit c13e03c87428b986972a48d8fc78dbffc2579f63.

Reverted https://github.com/pytorch/pytorch/pull/128374 on behalf of https://github.com/clee2000 due to sorry I need to revert this in order to revert something else, to remerge, just rebase and fix the merge conflict ([comment](https://github.com/pytorch/pytorch/pull/128374#issuecomment-2161772864))
2024-06-11 23:34:03 +00:00
94fea82d66 init sub comment (#128082)
Fixes #127905

### Description

Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128082
Approved by: https://github.com/titaiwangms
2024-06-11 22:42:35 +00:00
447173198b Add docstring for the torch.fx.operator_schemas.create_type_hint func… (#128139)
Fixes: #127916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128139
Approved by: https://github.com/SherlockNoMad
2024-06-11 22:42:11 +00:00
b79d056e76 [export] FIx unflattener for preserving modules containing unused inputs (#128260)
Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs.

This also fixes unflattener issues in D57829276.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260
Approved by: https://github.com/pianpwk
2024-06-11 22:32:08 +00:00
eb567b1f40 Pass params to dump_nccl_trace_pickle (#128307)
Summary:
Pass parameters from request to dump_nccl_trace_pickle handler.
The supported parameters + value are all lowercase.
includecollectives={true, false}
includestacktraces={true, false}
onlyactive={true, false}

Example post is:
/handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true

Test Plan:
unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128307
Approved by: https://github.com/d4l3k
ghstack dependencies: #128191
2024-06-11 22:28:53 +00:00
1dd2431f86 [Test] Add test for only_active flag (#128191)
Summary:
Add a unit test for the only_active flag to _dump_nccl_trace API call.
With this flag, we only expect active records to be returned.

Test Plan:
Unit test.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128191
Approved by: https://github.com/d4l3k
2024-06-11 22:26:01 +00:00
5fcb5f0c8b init reshape_from_tensor_shape comment (#128171)
Fixes #127897

### Description
Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128171
Approved by: https://github.com/titaiwangms
2024-06-11 21:56:33 +00:00
a55d0d9718 Fix side effect pruning (#128028)
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.

This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
   involved in a return from the function or intermediate variable
   during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
   NestedUserFunctionVariable to a global list

The new algorithm reflects this, but please let me know if there are
more cases to handle.

Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
  SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
  -- the functorch dynamo graphs no longer return dead cellvars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
2024-06-11 21:40:48 +00:00
8c1247cffb [BE] Fixed CPU autocast warning (#127774)
This PR fixes
```
/data/users/andgu/pytorch/torch/utils/checkpoint.py:1398: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127774
Approved by: https://github.com/soulitzer, https://github.com/Skylion007, https://github.com/tianyu-l
2024-06-11 21:33:35 +00:00
70a1e85718 [Traceable FSDP2] Use custom ops for AllGather copy-in / copy-out and ReduceScatter copy-in (#127856)
Making these operations into custom ops helps Inductor identify these ops and enforce the FSDP communication op ordering.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127856
Approved by: https://github.com/awgu
2024-06-11 20:15:03 +00:00
adb699189b Revert "[RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)"
This reverts commit b2d602306a9eb19e30328cbaee941c874f8148a9.

Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/clee2000 due to failed internal test D58394084.  Author has forward fix but includes external changes so reverting is a bit easier to coordinate ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2161481839))
2024-06-11 19:41:41 +00:00
eqy
45dccfddcd [cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350)
CC @vedaanta-nvidia @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128350
Approved by: https://github.com/Skylion007
2024-06-11 19:22:21 +00:00
3e09123797 Enable UFMT on test_nestedtensor.py (#128359)
split it into two PRs since it is more than 2k lines of change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128359
Approved by: https://github.com/davidberard98
2024-06-11 19:14:04 +00:00
61f922c2ca Fix 'get_real_value' on placeholder nodes (#127698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698
Approved by: https://github.com/jansel
ghstack dependencies: #127695, #127696
2024-06-11 18:57:25 +00:00
984b1a8c35 Fix 'get_attr' call in dynamo 'run_node' (#127696)
Fixes #124858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696
Approved by: https://github.com/jansel
ghstack dependencies: #127695
2024-06-11 18:57:25 +00:00
205410cb44 add xpu to torch.tensors (#127280)
As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.tensors doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127280
Approved by: https://github.com/svekars
2024-06-11 18:13:01 +00:00
cac7a22b92 [cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177)
Similar in spirit to #125790, hopefully addresses failures seen for cuDNN 9.1 upgrade: #https://github.com/pytorch/pytorch/pull/128166

CC @nWEIdia @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128177
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2024-06-11 18:09:25 +00:00
8a09940a54 [inductor] fix compile time regression by caching get_gpu_type (#128363)
We observed signficant compile time regression in torchtitan when turning
on 2D parallel + torch.compile recently. So I decided to get a deeper
understanding why.

It turns out this is affecting **all the trainings** that have functional collectives
captured in the graph, not only 2D parallel (2D parallel was just the
job that happen to have collectives captured in the TP region).

The root cause is because when doing inductor lowering, we are calling
the comm analysis pass to get a estimated collective time for each
collective node in the graph, for each call to check the collective
node, we are calling `get_gpu_type()`, which under the hood calls a
`torch.utils.collect_env.run` to get the GPU info. However, this call is
super expensive! The reason is that this call effectively spawns a new
process and call `nvidia-smi` to get the GPU info, so the cost is **linear**
to the number of collective nodes in the graph.

see https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py#L75

The fix is to add a lru cache to the function, so that we only call this
once and reuse the cached results afterwards

torchtitan benchmark shows:
* before this fix: 2D parallel + fp8 compile time: 6min +
* after this fix: 2D parallel + fp8 compile time: 2min 48s (more than 100% improvement)

There're more room to improve the compile time, but this PR is trying to fix the biggest regression I found so far.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128363
Approved by: https://github.com/yf225
2024-06-11 18:02:13 +00:00
1d233b8f50 Revert "Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704)"
This reverts commit c38b3381a12a0ec033dd417827c530c4474b8165.

Reverted https://github.com/pytorch/pytorch/pull/126704 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))
2024-06-11 17:45:20 +00:00
491c4a5dcb Revert "Make sure #126704 is BC for torch.save-ed nn.Module (#128344)"
This reverts commit 841d87177a900c2bbd59b6589165189141c4e8bb.

Reverted https://github.com/pytorch/pytorch/pull/128344 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))
2024-06-11 17:45:20 +00:00
4345d98663 [dynamo] Fix for #127696 (#128358)
Test Plan:
`buck2 test @//mode/dev-nosan //executorch/exir/backend/...`
https://www.internalfb.com/intern/testinfra/testrun/12666373989243932

Differential Revision: D58384518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128358
Approved by: https://github.com/ydwu4
2024-06-11 16:43:15 +00:00
a838e90964 Add Intel Gaudi device/HPU to auto load in instantiate_device_type_tests (#126970)
### Motivation
Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations.
Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming.
Hence with this PR  we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded.
The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices

### Changes
Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU.
Include code to check if  intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests

### Additional Context
please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970
Approved by: https://github.com/albanD
2024-06-11 16:35:17 +00:00
29081059b6 [Static Runtime] Fix & run gen_static_runtime_ops (#128299)
gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise.

I added a number of ops to the blocklist:
```
+        "_nested_tensor_storage_offsets",
+        "_nested_get_values",  # no CPU backend
+        "_nested_get_values_copy",  # no CPU backend
+        "_nested_view_from_jagged",  # testing needs to be patched
+        "_nested_view_from_jagged_copy",  # testing needs to be patched
+        "_nested_view_from_buffer",  # testing needs to be patched
+        "_nested_view_from_buffer_copy",  # testing needs to be patched
+        "_int_mm",  # testing needs to be patched
+        "_to_sparse_csc",  # testing needs to be patched
+        "_to_sparse_csr",  # testing needs to be patched
+        "segment_reduce",  # testing needs to be patched
```

Most of these are added just because testing doesn't work right now.

Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though.

Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299
Approved by: https://github.com/YuqingJ
2024-06-11 16:27:39 +00:00
f8c45996d5 [MPS] Make erfinv compilable for bfloat16 (#128375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128375
Approved by: https://github.com/Skylion007
ghstack dependencies: #128373
2024-06-11 16:04:11 +00:00
c13e03c874 Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128374
Approved by: https://github.com/Skylion007
2024-06-11 15:58:28 +00:00
053930e194 [MPS][BE] Remove code duplication (#128373)
Use `scalarToMetalTypeString` instead of `getMetalType`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128373
Approved by: https://github.com/Skylion007
2024-06-11 15:58:04 +00:00
9a38cae299 [AOTI] Switch to use shim v2 (#127674)
Differential Revision: D56709309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127674
Approved by: https://github.com/desertfire
2024-06-11 15:01:25 +00:00
55901fb3da [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/ezyang
2024-06-11 14:04:52 +00:00
fc77fdca6f [guard_size_oblivious] Add gso ExpandUtils:_sym_to (#128224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128224
Approved by: https://github.com/ezyang
2024-06-11 14:01:34 +00:00
648625b230 Make TraceUtils.h to be device-agnostic (#126969)
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.

In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
2024-06-11 08:38:07 +00:00
207c2248a8 [inductor] Fix lowering full with SymBool value (#128213)
Fixes #128161, fixes #128095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128213
Approved by: https://github.com/lezcano
2024-06-11 08:33:35 +00:00
a206dcc79e fb_memcache: Move to fbcode from thirdparty (#128174)
Summary: The fb_memcache injections location and path is changing.

Test Plan: Existing tests should pass.

Reviewed By: bertmaher, oulgen

Differential Revision: D57973772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128174
Approved by: https://github.com/oulgen
2024-06-11 07:46:12 +00:00
f2d7f235a6 [dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269)
Fixes https://github.com/pytorch/pytorch/issues/101168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269
Approved by: https://github.com/jansel
ghstack dependencies: #128295, #126578, #128268, #128254
2024-06-11 07:09:04 +00:00
402b289f3b Properly register parameter for binary folding test (#128356)
This PR properly registers the tensor used in the module compute as a parameter. This bug was hidden previously because all tensors on the nn modules would be considered constant by dynamo, with inlining NN modules, this is no longer the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128356
Approved by: https://github.com/anijain2305
ghstack dependencies: #128355
2024-06-11 06:48:26 +00:00
a32157c67c Mark params static if inlining modules and freezing (#128355)
Today inlining builtin nn modules is not compatible with parameter freezing. Freezing parameters and then constant folding them through the graph relies on the assumption that they will not be inputs and will be static across calls to the same graph. When inlining builtin nn modules this assumption is broken and we reuse the same graph for different instances of the same nn module. There are three options 1) abandon constant folding, 2) create a dispatcher layer (like cudagraphs) which will dispatch to the correct constant-folded graph for each distinct set of parameters or 3) recompile

This PR implements 3 by introducing guards on the parameter pointers. This was due to freezing being relatively rare and performance sensistive. 2 Had many more unknowns and 1 is not a viable option due to the drop in performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128355
Approved by: https://github.com/anijain2305
2024-06-11 06:48:26 +00:00
24e7f29099 Lowering for avg_pool_3d_backward (Fixes:#127101) (#127722)
We implemented a lowering for the avg_pool3d_backward operation and created tests for it.
We ran some benchmarks and achieved the following results:

```
[-------------- avgpool_3d_backwards --------------]
                             |  Decomposed  |  Eager
16 threads: ----------------------------------------
      (3, 5, 400, 200, 200)  |     6061     |  11160
      (3, 5, 300, 200, 200)  |     4547     |   8372
      (3, 5, 200, 200, 200)  |     3032     |   5585
      (3, 5, 300, 300, 300)  |    10100     |  18840
      (3, 5, 100, 100, 100)  |      381     |    703
      (3, 5, 100, 300, 200)  |     2270     |   4190
      (8, 8, 128, 128, 128)  |     3397     |   6253
      (2, 3, 150, 150, 150)  |      520     |    947
      (1, 3, 128, 128, 128)  |      161     |    299
      (8, 16, 64, 64, 64)    |      851     |   1569
      (1, 1, 50, 50, 50)     |       17     |     11
      (3, 5, 20, 40, 40)     |       17     |     30
      (3, 5, 10, 20, 20)     |       17     |     11
      (1, 1, 10, 10, 10)     |       16     |     11
      (3, 5, 5, 10, 10)      |       17     |     11
      (3, 5, 2, 5, 5)        |       17     |     11
```
These were run on an RTX 3050, so we were not able to allocate larger tensors due to memory limitations.
We believe it would be beneficial to benchmark this on more recent hardware, just to check if the performance holds up with larger sizes.

Furthermore, we also refactored code from adaptive_avg_pool2d and adaptive_max_pool2d, to reduce code duplication.
We diffed the kernels and they are identical.

Fixes #127101

Co-authored-by: Martim Mendes <martimccmendes@tecnico.ulisboa.pt>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127722
Approved by: https://github.com/jansel
2024-06-11 06:39:04 +00:00
5b5d269d34 Speed up fx graph iteration by implementing it in C++ (#128288)
Before this change
```
python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py
iterating over 100000000 FX nodes took 19.5s (5132266 nodes/s)
```

After this change
```
python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py
iterating over 100000000 FX nodes took 3.4s (29114001 nodes/s)
```

5.7x improvement

Differential Revision: [D58343997](https://our.internmc.facebook.com/intern/diff/D58343997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128288
Approved by: https://github.com/jansel, https://github.com/albanD
2024-06-11 05:48:31 +00:00
fa88f390a0 Revert "[inductor] enable fx graph cache on torchbench (#128239)"
This reverts commit 734e8f6ad7e7f0fa0341fb658f1f986225173f5f.

Reverted https://github.com/pytorch/pytorch/pull/128239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to surface a bunch of inductor failures in trunk 734e8f6ad7 ([comment](https://github.com/pytorch/pytorch/pull/128239#issuecomment-2159789242))
2024-06-11 04:53:38 +00:00
2686 changed files with 101707 additions and 68696 deletions

View File

@ -1,5 +1,5 @@
0.6b
manylinux_2_17
rocm6
04b5df8c8123f90cba3ede7e971e6fbc6040d506
3db6ecbc915893ff967abd6e1b43bd5f54949868873be60dc802086c3863e648
rocm6.1
7f07e8a1cb1f99627eb6d77f5c0e9295c775f3c7
77c29fa3f3b614e187d7213d745e989a92708cee2bc6020419ab49019af399d1

View File

@ -373,6 +373,13 @@ case "$image" in
CONDA_CMAKE=yes
EXECUTORCH=yes
;;
pytorch-linux-jammy-py3.12-halide)
CUDA_VERSION=12.4
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
HALIDE=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
@ -490,6 +497,7 @@ docker build \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "HALIDE=${HALIDE}" \
--build-arg "XPU_VERSION=${XPU_VERSION}" \
--build-arg "ACL=${ACL:-}" \
--build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \

View File

@ -1 +1 @@
d4b3e5cc607e97afdba79dc90f8ef968142f347c
c572f9e509b5ec5d56f4d218271e36269bba244f

View File

@ -0,0 +1 @@
340136fec6d3ebc73e7a19eba1663e9b0ba8ab2d

View File

@ -1 +1 @@
01cbe5045a6898c9a925f01435c8277b2fe6afcc
21eae954efa5bf584da70324b640288c3ee7aede

View File

@ -1 +1 @@
b8c64f64c18d8cac598b3adb355c21e7439c21de
1b2f15840e0d70eec50d84c7a0575cb835524def

View File

@ -1 +1 @@
45fff310c891f5a92d55445adf8cc9d29df5841e
dedb7bdf339a3546896d4820366ca562c586bfa0

View File

@ -9,7 +9,7 @@ TARBALL='aotriton.tar.bz2'
read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true
ARCH=$(uname -m)
AOTRITON_INSTALL_PREFIX="$1"
AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}.tar.bz2"
AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.bz2"
cd "${AOTRITON_INSTALL_PREFIX}"
# Must use -L to follow redirects

View File

@ -85,7 +85,7 @@ fi
else
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then
conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}
else
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

View File

@ -37,6 +37,9 @@ install_conda_dependencies() {
install_pip_dependencies() {
pushd executorch/.ci/docker
# Install PyTorch CPU build beforehand to avoid installing the much bigger CUDA
# binaries later, ExecuTorch only needs CPU
pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# Install all Python dependencies
pip_install -r requirements-ci.txt
popd
@ -44,13 +47,14 @@ install_pip_dependencies() {
setup_executorch() {
pushd executorch
source .ci/scripts/utils.sh
# Setup swiftshader and Vulkan SDK which are required to build the Vulkan delegate
as_jenkins bash .ci/scripts/setup-vulkan-linux-deps.sh
install_flatc_from_source
pip_install .
export PYTHON_EXECUTABLE=python
export EXECUTORCH_BUILD_PYBIND=ON
export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
# Make sure that all the newly generate files are owned by Jenkins
chown -R jenkins .
as_jenkins .ci/scripts/setup-linux.sh cmake
popd
}

View File

@ -0,0 +1,46 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
COMMIT=$(get_pinned_commit halide)
test -n "$COMMIT"
# activate conda to populate CONDA_PREFIX
test -n "$ANACONDA_PYTHON_VERSION"
eval "$(conda shell.bash hook)"
conda activate py_$ANACONDA_PYTHON_VERSION
if [ -n "${UBUNTU_VERSION}" ];then
apt update
apt-get install -y lld liblld-15-dev libpng-dev libjpeg-dev libgl-dev \
libopenblas-dev libeigen3-dev libatlas-base-dev libzstd-dev
fi
conda_install numpy scipy imageio cmake ninja
git clone --depth 1 --branch release/16.x --recursive https://github.com/llvm/llvm-project.git
cmake -DCMAKE_BUILD_TYPE=Release \
-DLLVM_ENABLE_PROJECTS="clang" \
-DLLVM_TARGETS_TO_BUILD="X86;NVPTX" \
-DLLVM_ENABLE_TERMINFO=OFF -DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_ENABLE_EH=ON -DLLVM_ENABLE_RTTI=ON -DLLVM_BUILD_32_BITS=OFF \
-S llvm-project/llvm -B llvm-build -G Ninja
cmake --build llvm-build
cmake --install llvm-build --prefix llvm-install
export LLVM_ROOT=`pwd`/llvm-install
export LLVM_CONFIG=$LLVM_ROOT/bin/llvm-config
git clone https://github.com/halide/Halide.git
pushd Halide
git checkout ${COMMIT} && git submodule update --init --recursive
pip_install -r requirements.txt
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build
cmake --build build
test -e ${CONDA_PREFIX}/lib/python3 || ln -s python${ANACONDA_PYTHON_VERSION} ${CONDA_PREFIX}/lib/python3
cmake --install build --prefix ${CONDA_PREFIX}
chown -R jenkins ${CONDA_PREFIX}
popd
rm -rf Halide llvm-build llvm-project llvm-install
python -c "import halide" # check for errors

View File

@ -33,7 +33,9 @@ pip_install coloredlogs packaging
pip_install onnxruntime==1.18
pip_install onnx==1.16.0
# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps
pip_install onnxscript==0.1.0.dev20240523 --no-deps
pip_install onnxscript==0.1.0.dev20240613 --no-deps
# required by onnxscript
pip_install ml_dtypes
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -85,10 +85,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.9.0
mypy==1.10.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.9.0
#Pinned versions: 1.10.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -306,7 +306,7 @@ pywavelets==1.5.0 ; python_version >= "3.12"
#Pinned versions: 1.4.1
#test that import:
lxml==5.0.0.
lxml==5.0.0
#Description: This is a requirement of unittest-xml-reporting
# Python-3.9 binaries

View File

@ -103,6 +103,14 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
ARG HALIDE
# Build and install halide
COPY ./common/install_halide.sh install_halide.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/halide.txt halide.txt
RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi
RUN rm install_halide.sh common_utils.sh halide.txt
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -155,6 +155,14 @@ COPY ci_commit_pins/executorch.txt executorch.txt
RUN if [ -n "${EXECUTORCH}" ]; then bash ./install_executorch.sh; fi
RUN rm install_executorch.sh common_utils.sh executorch.txt
ARG HALIDE
# Build and install halide
COPY ./common/install_halide.sh install_halide.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/halide.txt halide.txt
RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi
RUN rm install_halide.sh common_utils.sh halide.txt
ARG ONNX
# Install ONNX dependencies
COPY ./common/install_onnx.sh ./common/common_utils.sh ./

View File

@ -230,6 +230,10 @@ if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
export CMAKE_BUILD_TYPE=RelWithAssert
fi
# Do not change workspace permissions for ROCm CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
@ -284,12 +288,26 @@ else
# Which should be backward compatible with Numpy-1.X
python -mpip install --pre numpy==2.0.0rc1
fi
WERROR=1 python setup.py bdist_wheel
WERROR=1 python setup.py clean
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 python setup.py bdist_wheel --cmake
else
WERROR=1 python setup.py bdist_wheel
fi
else
python setup.py clean
if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then
source .ci/pytorch/install_cache_xla.sh
fi
python setup.py bdist_wheel
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
echo "USE_SPLIT_BUILD cannot be used with xla or rocm"
exit 1
else
python setup.py bdist_wheel
fi
fi
pip_install_whl "$(echo dist/*.whl)"
@ -328,9 +346,10 @@ else
CUSTOM_OP_TEST="$PWD/test/custom_operator"
python --version
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
mkdir -p "$CUSTOM_OP_BUILD"
pushd "$CUSTOM_OP_BUILD"
cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \
cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -343,7 +362,7 @@ else
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
mkdir -p "$JIT_HOOK_BUILD"
pushd "$JIT_HOOK_BUILD"
cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \
cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -355,7 +374,7 @@ else
python --version
mkdir -p "$CUSTOM_BACKEND_BUILD"
pushd "$CUSTOM_BACKEND_BUILD"
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd

View File

@ -56,9 +56,29 @@ function assert_git_not_dirty() {
function pip_install_whl() {
# This is used to install PyTorch and other build artifacts wheel locally
# without using any network connection
python3 -mpip install --no-index --no-deps "$@"
# Convert the input arguments into an array
local args=("$@")
# Check if the first argument contains multiple paths separated by spaces
if [[ "${args[0]}" == *" "* ]]; then
# Split the string by spaces into an array
IFS=' ' read -r -a paths <<< "${args[0]}"
# Loop through each path and install individually
for path in "${paths[@]}"; do
echo "Installing $path"
python3 -mpip install --no-index --no-deps "$path"
done
else
# Loop through each argument and install individually
for path in "${args[@]}"; do
echo "Installing $path"
python3 -mpip install --no-index --no-deps "$path"
done
fi
}
function pip_install() {
# retry 3 times
# old versions of pip don't have the "--progress-bar" flag
@ -188,28 +208,6 @@ function clone_pytorch_xla() {
fi
}
function checkout_install_torchdeploy() {
local commit
commit=$(get_pinned_commit multipy)
pushd ..
git clone --recurse-submodules https://github.com/pytorch/multipy.git
pushd multipy
git checkout "${commit}"
python multipy/runtime/example/generate_examples.py
BUILD_CUDA_TESTS=1 pip install -e .
popd
popd
}
function test_torch_deploy(){
pushd ..
pushd multipy
./multipy/runtime/build/test_deploy
./multipy/runtime/build/test_deploy_gpu
popd
popd
}
function checkout_install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
@ -224,6 +222,8 @@ function checkout_install_torchbench() {
# to install and test other models
python install.py --continue_on_fail
fi
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
}

View File

@ -18,8 +18,9 @@ time python test/run_test.py --verbose -i distributed/test_c10d_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_nccl
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl
time python test/run_test.py --verbose -i distributed/test_cuda_p2p
time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering
time python test/run_test.py --verbose -i distributed/test_store
time python test/run_test.py --verbose -i distributed/test_symmetric_memory
time python test/run_test.py --verbose -i distributed/test_pg_wrapper
time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent
# FSDP tests

View File

@ -249,9 +249,7 @@ fi
# This tests that the debug asserts are working correctly.
if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
echo "We are in debug mode: $BUILD_ENVIRONMENT. Expect the python assertion to fail"
# TODO: Enable the check after we setup the build to run debug asserts without having
# to do a full (and slow) debug build
# (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")
(cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")
elif [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
# Noop when debug is disabled. Skip bazel jobs because torch isn't available there yet.
echo "We are not in debug mode: $BUILD_ENVIRONMENT. Expect the assertion to pass"
@ -264,18 +262,6 @@ elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then
export ATEN_CPU_CAPABILITY=avx2
fi
# temp workarounds for https://github.com/pytorch/pytorch/issues/126692, remove when fixed
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
pushd test
CUDA_VERSION=$(python -c "import torch; print(torch.version.cuda)")
if [ "$CUDA_VERSION" == "12.4" ]; then
ISCUDA124="cu124"
else
ISCUDA124=""
fi
popd
fi
test_python_legacy_jit() {
time python test/run_test.py --include test_jit_legacy test_jit_fuser_legacy --verbose
assert_git_not_dirty
@ -289,6 +275,9 @@ test_python_shard() {
# Bare --include flag is not supported and quoting for lint ends up with flag not being interpreted correctly
# shellcheck disable=SC2086
# modify LD_LIBRARY_PATH to ensure it has the conda env.
# This set of tests has been shown to be buggy without it for the split-build
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION
assert_git_not_dirty
@ -347,17 +336,31 @@ test_inductor_distributed() {
assert_git_not_dirty
}
test_inductor() {
python tools/dynamo/verify_dynamo.py
python test/run_test.py --inductor --include test_modules test_ops test_ops_gradients test_torch --verbose
# Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state
python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor --verbose
test_inductor_shard() {
if [[ -z "$NUM_TEST_SHARDS" ]]; then
echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
exit 1
fi
python tools/dynamo/verify_dynamo.py
python test/run_test.py --inductor \
--include test_modules test_ops test_ops_gradients test_torch \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
# Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state
python test/run_test.py \
--include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
}
test_inductor_aoti() {
# docker build uses bdist_wheel which does not work with test_aot_inductor
# TODO: need a faster way to build
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference
fi
}
@ -376,7 +379,7 @@ test_inductor_cpp_wrapper_abi_compatible() {
--output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_timm_training.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"
}
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
@ -401,7 +404,7 @@ if [[ "${TEST_CONFIG}" == *dynamic* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--dynamic-shapes --dynamic-batch-only)
fi
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--device cpu)
else
DYNAMO_BENCHMARK_FLAGS+=(--device cuda)
@ -526,9 +529,10 @@ test_single_dynamo_benchmark() {
test_perf_for_dashboard "$suite" \
"${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"
else
if [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *aot_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then
# Test AOTInductor with the ABI-compatible mode on CI
# This can be removed once the ABI-compatible mode becomes default.
# For CPU device, we perfer non ABI-compatible mode on CI when testing AOTInductor.
export TORCHINDUCTOR_ABI_COMPATIBLE=1
fi
python "benchmarks/dynamo/$suite.py" \
@ -538,10 +542,10 @@ test_single_dynamo_benchmark() {
--output "$TEST_REPORTS_DIR/${name}_${suite}.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/${TEST_CONFIG}_${name}.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"
python benchmarks/dynamo/check_graph_breaks.py \
--actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/${TEST_CONFIG}_${name}.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"
fi
}
@ -550,6 +554,11 @@ test_inductor_micro_benchmark() {
python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"
}
test_inductor_halide() {
python test/run_test.py --include inductor/test_halide.py --verbose
assert_git_not_dirty
}
test_dynamo_benchmark() {
# Usage: test_dynamo_benchmark huggingface 0
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -564,11 +573,15 @@ test_dynamo_benchmark() {
elif [[ "${TEST_CONFIG}" == *perf* ]]; then
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
else
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then
local dt="float32"
if [[ "${TEST_CONFIG}" == *amp* ]]; then
dt="amp"
fi
if [[ "${TEST_CONFIG}" == *freezing* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 --freezing "$@"
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --"$dt" --freezing "$@"
else
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --"$dt" "$@"
fi
elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
@ -592,7 +605,7 @@ test_inductor_torchbench_smoketest_perf() {
--bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_torchbench_inference.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \
@ -607,13 +620,8 @@ test_inductor_torchbench_smoketest_perf() {
# https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,
# and thus we lower its threshold to reduce flakiness. If this continues to be a problem,
# we switch to use some other model.
# Use 4.7 for cuda 12.4, change back to 4.9 after fixing https://github.com/pytorch/pytorch/issues/126692
if [ "$CUDA_VERSION" == "12.4" ]; then
THRESHOLD=4.7
else
THRESHOLD=4.9
fi
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t $THRESHOLD
# lowering threshold from 4.9 to 4.7 for cu124. Will bump it up after cuda 12.4.0->12.4.1 update
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.7
# Check memory compression ratio for a few models
for test in hf_Albert timm_vision_transformer; do
@ -632,7 +640,7 @@ test_inductor_torchbench_smoketest_perf() {
--only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_huggingface_training.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_huggingface_training.csv"
done
}
@ -1169,15 +1177,21 @@ test_executorch() {
pushd /executorch
# NB: We need to build ExecuTorch runner here and not inside the Docker image
# because it depends on PyTorch
export PYTHON_EXECUTABLE=python
export EXECUTORCH_BUILD_PYBIND=ON
export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
# NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch
# from the PR
# shellcheck disable=SC1091
source .ci/scripts/utils.sh
build_executorch_runner "cmake"
source .ci/scripts/setup-linux.sh cmake
echo "Run ExecuTorch unit tests"
pytest -v -n auto
# shellcheck disable=SC1091
LLVM_PROFDATA=llvm-profdata-12 LLVM_COV=llvm-cov-12 bash test/run_oss_cpp_tests.sh
echo "Run ExecuTorch regression tests for some models"
# NB: This is a sample model, more can be added here
export PYTHON_EXECUTABLE=python
# TODO(huydhn): Add more coverage here using ExecuTorch's gather models script
# shellcheck disable=SC1091
source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''
@ -1237,11 +1251,10 @@ elif [[ "$TEST_CONFIG" == distributed ]]; then
if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_rpc
fi
elif [[ "$TEST_CONFIG" == deploy ]]; then
checkout_install_torchdeploy
test_torch_deploy
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then
test_inductor_halide
elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then
test_inductor_micro_benchmark
elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then
@ -1253,13 +1266,14 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark timm_models "$id"
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then
install_torchaudio cpu
else
install_torchaudio cuda
fi
install_torchtext
install_torchvision
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git
id=$((SHARD_NUMBER-1))
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
@ -1278,7 +1292,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any
# nightlies that torchbench may pull in
if [[ "${TEST_CONFIG}" != *cpu_inductor* ]]; then
if [[ "${TEST_CONFIG}" != *cpu_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then
install_torchrec_and_fbgemm
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"
@ -1286,17 +1300,19 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then
install_torchvision
test_inductor_cpp_wrapper_abi_compatible
elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then
elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision
test_inductor
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision
test_dynamo_shard 1
test_aten
elif [[ "${TEST_CONFIG}" == *dynamo* && $SHARD_NUMBER -gt 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_inductor_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_inductor_aoti
test_inductor_distributed
fi
elif [[ "${TEST_CONFIG}" == *dynamo* ]]; then
install_torchvision
test_dynamo_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_aten
fi
elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then
install_torchvision
test_python_shard "$SHARD_NUMBER"

View File

@ -97,8 +97,16 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
)
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"
retry pip install -q numpy protobuf typing-extensions
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pkg_no_python="$(ls -1 /final_pkgs/torch_no_python* | sort |tail -1)"
pkg_torch="$(ls -1 /final_pkgs/torch-* | sort |tail -1)"
# todo: after folder is populated use the pypi_pkg channel instead
pip install "\$pkg_no_python" "\$pkg_torch" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}_pypi_pkg"
retry pip install -q numpy protobuf typing-extensions
else
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"
retry pip install -q numpy protobuf typing-extensions
fi
else
pip install "\$pkg"
retry pip install -q numpy protobuf typing-extensions
@ -110,6 +118,12 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then
cd /tmp/libtorch
fi
if [[ "$GPU_ARCH_TYPE" == xpu ]]; then
# Workaround for __mkl_tmp_MOD unbound variable issue, refer https://github.com/pytorch/pytorch/issues/130543
set +u
source /opt/intel/oneapi/pytorch-gpu-dev-0.5/oneapi-vars.sh
fi
# Test the package
/builder/check_binary.sh

View File

@ -33,9 +33,9 @@ if [[ -z "$DOCKER_IMAGE" ]]; then
if [[ "$PACKAGE_TYPE" == conda ]]; then
export DOCKER_IMAGE="pytorch/conda-cuda"
elif [[ "$DESIRED_CUDA" == cpu ]]; then
export DOCKER_IMAGE="pytorch/manylinux-cpu"
export DOCKER_IMAGE="pytorch/manylinux:cpu"
else
export DOCKER_IMAGE="pytorch/manylinux-cuda${DESIRED_CUDA:2}"
export DOCKER_IMAGE="pytorch/manylinux-builder:${DESIRED_CUDA:2}"
fi
fi
@ -75,9 +75,9 @@ export PYTORCH_BUILD_NUMBER=1
TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
# Only linux Python < 3.13 are supported wheels for triton
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)
@ -87,11 +87,11 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:
fi
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton rocm package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" && "$DESIRED_PYTHON" != "3.12" ]]; then
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" ]]; then
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}"
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"
@ -100,30 +100,18 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B
fi
fi
JAVA_HOME=
BUILD_JNI=OFF
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
POSSIBLE_JAVA_HOMES=()
POSSIBLE_JAVA_HOMES+=(/usr/local)
POSSIBLE_JAVA_HOMES+=(/usr/lib/jvm/java-8-openjdk-amd64)
POSSIBLE_JAVA_HOMES+=(/Library/Java/JavaVirtualMachines/*.jdk/Contents/Home)
# Add the Windows-specific JNI path
POSSIBLE_JAVA_HOMES+=("$PWD/pytorch/.circleci/windows-jni/")
for JH in "${POSSIBLE_JAVA_HOMES[@]}" ; do
if [[ -e "$JH/include/jni.h" ]] ; then
# Skip if we're not on Windows but haven't found a JAVA_HOME
if [[ "$JH" == "$PWD/pytorch/.circleci/windows-jni/" && "$OSTYPE" != "msys" ]] ; then
break
fi
echo "Found jni.h under $JH"
JAVA_HOME="$JH"
BUILD_JNI=ON
break
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"
else
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"
fi
done
if [ -z "$JAVA_HOME" ]; then
echo "Did not find jni.h"
fi
fi
cat >"$envfile" <<EOL
@ -136,6 +124,7 @@ export DESIRED_PYTHON="${DESIRED_PYTHON:-}"
export DESIRED_CUDA="$DESIRED_CUDA"
export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"
export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"
export USE_SPLIT_BUILD="${USE_SPLIT_BUILD:-}"
if [[ "${OSTYPE}" == "msys" ]]; then
export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"
if [[ "${LIBTORCH_CONFIG:-}" == 'debug' ]]; then
@ -159,8 +148,6 @@ export TORCH_CONDA_BUILD_FOLDER='pytorch-nightly'
export ANACONDA_USER='pytorch'
export USE_FBGEMM=1
export JAVA_HOME=$JAVA_HOME
export BUILD_JNI=$BUILD_JNI
export PIP_UPLOAD_FOLDER="$PIP_UPLOAD_FOLDER"
export DOCKER_IMAGE="$DOCKER_IMAGE"

View File

@ -25,6 +25,15 @@ if [[ "${DRY_RUN}" = "disabled" ]]; then
AWS_S3_CP="aws s3 cp"
fi
if [[ "${USE_SPLIT_BUILD:-false}" == "true" ]]; then
UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_pkg"
fi
# this is special build with all dependencies packaged
if [[ ${BUILD_NAME} == *-full* ]]; then
UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"
fi
# Sleep 2 minutes between retries for conda upload
retry () {
"$@" || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@")

View File

@ -40,3 +40,7 @@ e6ec0efaf87703c5f889cfc20b29be455885d58d
a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
# 2024-01-02 clangformat: fused adam #116583
9dc68d1aa9e554d09344a10fff69f7b50b2d23a0
# 2024-06-28 enable UFMT in `torch/storage.py`
d80939e5e9337e8078f11489afefec59fd42f93b
# 2024-06-28 enable UFMT in `torch.utils.data`
7cf0b90e49689d45be91aa539fdf54cf2ea8a9a3

View File

@ -47,3 +47,5 @@ self-hosted-runner:
- macos-latest-xlarge
- macos-13-xlarge
- macos-14-xlarge
# Organization-wide Intel hosted XPU runners
- linux.idc.xpu

View File

@ -14,12 +14,14 @@ runs:
- name: Cleans up diskspace
shell: bash
run: |
set -ex
diskspace_cutoff=${{ inputs.diskspace-cutoff }}
diskspace=$(df -H / --output=pcent | sed -n 2p | sed 's/%//' | sed 's/ //')
docker_root_dir=$(docker info -f '{{.DockerRootDir}}')
diskspace=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')
msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
if [[ "$diskspace" -ge "$diskspace_cutoff" ]] ; then
docker system prune -af
diskspace_new=$(df -H / --output=pcent | sed -n 2p | sed 's/%//' | sed 's/ //')
diskspace_new=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')
if [[ "$diskspace_new" -gt "$diskspace_cutoff" ]] ; then
echo "Error: Available diskspace is less than $diskspace_cutoff percent. Not enough diskspace."
echo "$msg"

View File

@ -52,6 +52,13 @@ inputs:
description: Hugging Face Hub token
required: false
default: ""
use_split_build:
description: |
[Experimental] Build a libtorch only wheel and build pytorch such that
are built from the libtorch wheel.
required: false
type: boolean
default: false
outputs:
docker-image:
value: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -144,6 +151,7 @@ runs:
DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
shell: bash
run: |
# detached container should get cleaned up by teardown_ec2_linux
@ -163,6 +171,7 @@ runs:
-e PR_LABELS \
-e OUR_GITHUB_JOB_ID \
-e HUGGING_FACE_HUB_TOKEN \
-e USE_SPLIT_BUILD \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
@ -183,7 +192,7 @@ runs:
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped' && inputs.use_split_build != 'true'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
@ -191,6 +200,16 @@ runs:
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Store PyTorch Build Artifacts on S3 for split build
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped' && inputs.use_split_build == 'true'
with:
name: ${{ inputs.build-environment }}-experimental-split-build
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
uses: seemethere/upload-artifact-s3@v5

View File

@ -26,6 +26,7 @@ runs:
-e PYTORCH_FINAL_PACKAGE_DIR \
-e PYTORCH_ROOT \
-e SKIP_ALL_TESTS \
-e USE_SPLIT_BUILD \
--tty \
--detach \
-v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
@ -35,7 +36,8 @@ runs:
"${DOCKER_IMAGE}"
)
if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" ]]; then
echo "CONTAINER_NAME=${container_name}" >> "$GITHUB_ENV"
if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" && "${GPU_ARCH_TYPE}" != "xpu" ]]; then
# Propagate download.pytorch.org IP to container. This is only needed on Linux non aarch64 runner
grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" bash -c "/bin/cat >> /etc/hosts"
fi
@ -46,10 +48,9 @@ runs:
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
- name: Cleanup docker
if: always() && env.BUILD_ENVIRONMENT == 'linux-s390x-binary-manywheel'
if: always() && (env.BUILD_ENVIRONMENT == 'linux-s390x-binary-manywheel' || env.GPU_ARCH_TYPE == 'xpu')
shell: bash
run: |
# on s390x stop the container for clean worker stop
# ignore expansion of "docker ps -q" since it could be empty
# on s390x or xpu stop the container for clean worker stop
# shellcheck disable=SC2046
docker stop $(docker ps -q) || true
docker stop "${{ env.CONTAINER_NAME }}" || true

View File

@ -1 +1 @@
b829e936f7cc61b48149f5f957a451a38bf2a178
69b2a0adc2ec03ab99990d7e8be3d4510438c148

View File

@ -1 +1 @@
d6015d42d9a1834bc7595c4bd6852562fb80b30b
23512dbebd44a11eb84afbf53c3c071dd105297e

View File

@ -27,11 +27,9 @@
- third_party/onnx
- caffe2/python/onnx/**
approved_by:
- BowenBao
- justinchuby
- liqunfu
- shubhambhokare1
- thiagocrepaldi
- titaiwangms
- wschin
- xadupre
@ -244,6 +242,7 @@
- torch/csrc/xpu/**
- torch/xpu/**
- test/xpu/**
- test/test_xpu.py
- third_party/xpu.txt
- .ci/docker/ci_commit_pins/triton-xpu.txt
approved_by:
@ -287,6 +286,7 @@
- test/cpp/dist_autograd/**
- test/cpp/rpc/**
approved_by:
- wconstab
- mrshenli
- pritamdamania87
- zhaojuanmao
@ -313,6 +313,25 @@
- Lint
- pull
- name: DCP
patterns:
- torch/distributed/checkpoint/**
approved_by:
- LucasLLC
- fegin
- wz337
- saumishr
- daulet-askarov
- pradeepdfb
- kirtiteja
- mhorowitz
- saiteja64
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: IDEEP
patterns:
- third_party/ideep
@ -376,13 +395,21 @@
- name: CPU inductor
patterns:
- torch/_inductor/mkldnn_ir.py
- torch/_inductor/mkldnn_lowerings.py
- torch/_inductor/fx_passes/mkldnn_fusion.py
- torch/_inductor/fx_passes/quantization.py
- torch/_inductor/codegen/cpp_prefix.h
- torch/_inductor/codegen/cpp.py
- torch/_inductor/codegen/cpp_utils.py
- torch/_inductor/codegen/cpp_micro_gemm.py
- torch/_inductor/codegen/cpp_template_kernel.py
- torch/_inductor/codegen/cpp_template.py
- torch/_inductor/codegen/cpp_gemm_template.py
- test/inductor/test_mkldnn_pattern_matcher.py
- test/inductor/test_cpu_repo.py
- test/inductor/test_cpu_repro.py
- test/inductor/test_cpu_cpp_wrapper.py
- test/inductor/test_cpu_select_algorithm.py
- aten/src/ATen/cpu/**
- aten/src/ATen/native/quantized/cpu/**
- test/quantization/core/test_quantized_op.py

View File

@ -26,3 +26,4 @@ retryable_workflows:
- windows-binary
labeler_config: labeler.yml
label_to_label_config: label_to_label.yml
mergebot: True

View File

@ -93,6 +93,8 @@ done
# Copy Include Files
cp -r $ROCM_HOME/include/hip $TRITON_ROCM_DIR/include
cp -r $ROCM_HOME/include/roctracer $TRITON_ROCM_DIR/include
cp -r $ROCM_HOME/include/hsa $TRITON_ROCM_DIR/include
# Copy linker
mkdir -p $TRITON_ROCM_DIR/llvm/bin

View File

@ -11,8 +11,12 @@ SCRIPT_DIR = Path(__file__).parent
REPO_DIR = SCRIPT_DIR.parent.parent
def read_triton_pin(rocm_hash: bool = False) -> str:
triton_file = "triton.txt" if not rocm_hash else "triton-rocm.txt"
def read_triton_pin(device: str = "cuda") -> str:
triton_file = "triton.txt"
if device == "rocm":
triton_file = "triton-rocm.txt"
elif device == "xpu":
triton_file = "triton-xpu.txt"
with open(REPO_DIR / ".ci" / "docker" / "ci_commit_pins" / triton_file) as f:
return f.read().strip()
@ -49,7 +53,7 @@ def build_triton(
version: str,
commit_hash: str,
build_conda: bool = False,
build_rocm: bool = False,
device: str = "cuda",
py_version: Optional[str] = None,
release: bool = False,
) -> Path:
@ -69,11 +73,14 @@ def build_triton(
triton_basedir = Path(tmpdir) / "triton"
triton_pythondir = triton_basedir / "python"
triton_repo = "https://github.com/openai/triton"
if build_rocm:
if device == "rocm":
triton_pkg_name = "pytorch-triton-rocm"
elif device == "xpu":
triton_pkg_name = "pytorch-triton-xpu"
triton_repo = "https://github.com/intel/intel-xpu-backend-for-triton"
else:
triton_pkg_name = "pytorch-triton"
check_call(["git", "clone", triton_repo], cwd=tmpdir)
check_call(["git", "clone", triton_repo, "triton"], cwd=tmpdir)
if release:
ver, rev, patch = version.split(".")
check_call(
@ -140,7 +147,7 @@ def build_triton(
expected_version=None,
)
if build_rocm:
if device == "rocm":
check_call(
[f"{SCRIPT_DIR}/amd/package_triton_wheel.sh"],
cwd=triton_basedir,
@ -155,7 +162,7 @@ def build_triton(
whl_path = next(iter((triton_pythondir / "dist").glob("*.whl")))
shutil.copy(whl_path, Path.cwd())
if build_rocm:
if device == "rocm":
check_call(
[f"{SCRIPT_DIR}/amd/patch_triton_wheel.sh", Path.cwd()],
cwd=triton_basedir,
@ -170,17 +177,19 @@ def main() -> None:
parser = ArgumentParser("Build Triton binaries")
parser.add_argument("--release", action="store_true")
parser.add_argument("--build-conda", action="store_true")
parser.add_argument("--build-rocm", action="store_true")
parser.add_argument(
"--device", type=str, default="cuda", choices=["cuda", "rocm", "xpu"]
)
parser.add_argument("--py-version", type=str)
parser.add_argument("--commit-hash", type=str)
parser.add_argument("--triton-version", type=str, default=read_triton_version())
args = parser.parse_args()
build_triton(
build_rocm=args.build_rocm,
device=args.device,
commit_hash=args.commit_hash
if args.commit_hash
else read_triton_pin(args.build_rocm),
else read_triton_pin(args.device),
version=args.triton_version,
build_conda=args.build_conda,
py_version=args.py_version,

View File

@ -3,11 +3,11 @@
import json
import os
import re
from typing import Any, Optional
from typing import Any, cast, Dict, List, Optional
from urllib.error import HTTPError
from github_utils import gh_fetch_url, gh_post_pr_comment
from github_utils import gh_fetch_url, gh_post_pr_comment, gh_query_issues_by_labels
from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
from trymerge import get_pr_commit_sha, GitHubPR
@ -19,6 +19,7 @@ REQUIRES_ISSUE = {
"critical",
"fixnewfeature",
}
RELEASE_BRANCH_REGEX = re.compile(r"release/(?P<version>.+)")
def parse_args() -> Any:
@ -58,6 +59,33 @@ def get_merge_commit_sha(repo: GitRepo, pr: GitHubPR) -> Optional[str]:
return commit_sha if pr.is_closed() else None
def get_release_version(onto_branch: str) -> Optional[str]:
"""
Return the release version if the target branch is a release branch
"""
m = re.match(RELEASE_BRANCH_REGEX, onto_branch)
return m.group("version") if m else ""
def get_tracker_issues(
org: str, project: str, onto_branch: str
) -> List[Dict[str, Any]]:
"""
Find the tracker issue from the repo. The tracker issue needs to have the title
like [VERSION] Release Tracker following the convention on PyTorch
"""
version = get_release_version(onto_branch)
if not version:
return []
tracker_issues = gh_query_issues_by_labels(org, project, labels=["release tracker"])
if not tracker_issues:
return []
# Figure out the tracker issue from the list by looking at the title
return [issue for issue in tracker_issues if version in issue.get("title", "")]
def cherry_pick(
github_actor: str,
repo: GitRepo,
@ -77,17 +105,49 @@ def cherry_pick(
)
try:
org, project = repo.gh_owner_and_name()
cherry_pick_pr = ""
if not dry_run:
org, project = repo.gh_owner_and_name()
cherry_pick_pr = submit_pr(repo, pr, cherry_pick_branch, onto_branch)
msg = f"The cherry pick PR is at {cherry_pick_pr}"
if fixes:
msg += f" and it is linked with issue {fixes}"
elif classification in REQUIRES_ISSUE:
msg += f" and it is recommended to link a {classification} cherry pick PR with an issue"
tracker_issues_comments = []
tracker_issues = get_tracker_issues(org, project, onto_branch)
for issue in tracker_issues:
issue_number = int(str(issue.get("number", "0")))
if not issue_number:
continue
post_comment(org, project, pr.pr_num, msg)
res = cast(
Dict[str, Any],
post_tracker_issue_comment(
org,
project,
issue_number,
pr.pr_num,
cherry_pick_pr,
classification,
fixes,
dry_run,
),
)
comment_url = res.get("html_url", "")
if comment_url:
tracker_issues_comments.append(comment_url)
msg = f"The cherry pick PR is at {cherry_pick_pr}"
if fixes:
msg += f" and it is linked with issue {fixes}."
elif classification in REQUIRES_ISSUE:
msg += f" and it is recommended to link a {classification} cherry pick PR with an issue."
if tracker_issues_comments:
msg += " The following tracker issues are updated:\n"
for tracker_issues_comment in tracker_issues_comments:
msg += f"* {tracker_issues_comment}\n"
post_pr_comment(org, project, pr.pr_num, msg, dry_run)
finally:
if current_branch:
@ -159,7 +219,9 @@ def submit_pr(
raise RuntimeError(msg) from error
def post_comment(org: str, project: str, pr_num: int, msg: str) -> None:
def post_pr_comment(
org: str, project: str, pr_num: int, msg: str, dry_run: bool = False
) -> List[Dict[str, Any]]:
"""
Post a comment on the PR itself to point to the cherry picking PR when success
or print the error when failure
@ -182,7 +244,35 @@ def post_comment(org: str, project: str, pr_num: int, msg: str) -> None:
comment = "\n".join(
(f"### Cherry picking #{pr_num}", f"{msg}", "", f"{internal_debugging}")
)
gh_post_pr_comment(org, project, pr_num, comment)
return gh_post_pr_comment(org, project, pr_num, comment, dry_run)
def post_tracker_issue_comment(
org: str,
project: str,
issue_num: int,
pr_num: int,
cherry_pick_pr: str,
classification: str,
fixes: str,
dry_run: bool = False,
) -> List[Dict[str, Any]]:
"""
Post a comment on the tracker issue (if any) to record the cherry pick
"""
comment = "\n".join(
(
"Link to landed trunk PR (if applicable):",
f"* https://github.com/{org}/{project}/pull/{pr_num}",
"",
"Link to release branch PR:",
f"* {cherry_pick_pr}",
"",
"Criteria Category:",
" - ".join((classification.capitalize(), fixes.capitalize())),
)
)
return gh_post_pr_comment(org, project, issue_num, comment, dry_run)
def main() -> None:
@ -214,7 +304,7 @@ def main() -> None:
except RuntimeError as error:
if not args.dry_run:
post_comment(org, project, pr_num, str(error))
post_pr_comment(org, project, pr_num, str(error))
else:
raise error

Binary file not shown.

View File

@ -8,6 +8,7 @@ architectures:
* CPU
* Latest CUDA
* Latest ROCM
* Latest XPU
"""
import os
@ -24,6 +25,7 @@ CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.1": "9", "12.4": "9"}
ROCM_ARCHES = ["6.0", "6.1"]
XPU_ARCHES = ["xpu"]
CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]
@ -48,7 +50,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.1": (
@ -61,7 +63,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.4": (
@ -74,7 +76,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
@ -132,6 +134,8 @@ def arch_type(arch_version: str) -> str:
return "cuda"
elif arch_version in ROCM_ARCHES:
return "rocm"
elif arch_version in XPU_ARCHES:
return "xpu"
elif arch_version in CPU_CXX11_ABI_ARCH:
return "cpu-cxx11-abi"
elif arch_version in CPU_AARCH64_ARCH:
@ -156,6 +160,7 @@ WHEEL_CONTAINER_IMAGES = {
gpu_arch: f"pytorch/manylinux-builder:rocm{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in ROCM_ARCHES
},
"xpu": f"pytorch/manylinux2_28-builder:xpu-{DEFAULT_TAG}",
"cpu": f"pytorch/manylinux-builder:cpu-{DEFAULT_TAG}",
"cpu-cxx11-abi": f"pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-{DEFAULT_TAG}",
"cpu-aarch64": f"pytorch/manylinuxaarch64-builder:cpu-aarch64-{DEFAULT_TAG}",
@ -221,6 +226,7 @@ def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:
"cuda": f"cu{gpu_arch_version.replace('.', '')}",
"cuda-aarch64": "cu124",
"rocm": f"rocm{gpu_arch_version}",
"xpu": "xpu",
}.get(gpu_arch_type, gpu_arch_version)
@ -325,13 +331,13 @@ def generate_wheels_matrix(
package_type = "manywheel"
if python_versions is None:
python_versions = FULL_PYTHON_VERSIONS
python_versions = FULL_PYTHON_VERSIONS + ["3.13"]
if arches is None:
# Define default compute archivectures
arches = ["cpu"]
if os == "linux":
arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES
arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES + XPU_ARCHES
elif os == "windows":
arches += CUDA_ARCHES
elif os == "linux-aarch64":
@ -347,10 +353,6 @@ def generate_wheels_matrix(
for python_version in python_versions:
for arch_version in arches:
gpu_arch_type = arch_type(arch_version)
# Disable py3.12 builds for ROCm because of triton dependency
# on llnl-hatchet, which doesn't have py3.12 wheels available
if gpu_arch_type == "rocm" and python_version == "3.12":
continue
gpu_arch_version = (
""
if arch_version == "cpu"
@ -358,9 +360,16 @@ def generate_wheels_matrix(
or arch_version == "cpu-aarch64"
or arch_version == "cpu-s390x"
or arch_version == "cuda-aarch64"
or arch_version == "xpu"
else arch_version
)
# TODO: Enable python 3.13 on rocm, xpu, aarch64, windows
if (
gpu_arch_type in ["rocm", "xpu"] or os != "linux"
) and python_version == "3.13":
continue
# 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if (
arch_version in ["12.4", "12.1", "11.8"]
@ -390,6 +399,49 @@ def generate_wheels_matrix(
),
}
)
if arch_version != "cuda-aarch64":
ret.append(
{
"python_version": python_version,
"gpu_arch_type": gpu_arch_type,
"gpu_arch_version": gpu_arch_version,
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "True",
"devtoolset": "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version] # fmt: skip
if os != "linux-aarch64"
else ""
),
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-split".replace( # noqa: B950
".", "_"
),
}
)
# Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA
if python_version == "3.10" and arch_version == "12.1":
ret.append(
{
"python_version": python_version,
"gpu_arch_type": gpu_arch_type,
"gpu_arch_version": gpu_arch_version,
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "False",
"devtoolset": "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": "",
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace( # noqa: B950
".", "_"
),
}
)
else:
ret.append(
{
@ -400,7 +452,9 @@ def generate_wheels_matrix(
gpu_arch_type, gpu_arch_version
),
"devtoolset": (
"cxx11-abi" if arch_version == "cpu-cxx11-abi" else ""
"cxx11-abi"
if arch_version in ["cpu-cxx11-abi", "xpu"]
else ""
),
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,

View File

@ -1,99 +0,0 @@
import json
from argparse import ArgumentParser
from typing import Any
from github import Auth, Github
from github.Issue import Issue
WORKFLOW_TYPE_LABEL = "label"
WORKFLOW_TYPE_RG = "rg"
WORKFLOW_TYPE_BOTH = "both"
def parse_args() -> Any:
parser = ArgumentParser("Get dynamic rollout settings")
parser.add_argument("--github-token", type=str, required=True, help="GitHub token")
parser.add_argument(
"--github-repo",
type=str,
required=False,
default="pytorch/test-infra",
help="GitHub repo to get the issue",
)
parser.add_argument(
"--github-issue", type=int, required=True, help="GitHub issue umber"
)
parser.add_argument(
"--github-user", type=str, required=True, help="GitHub username"
)
parser.add_argument(
"--github-branch", type=str, required=True, help="Current GitHub branch"
)
return parser.parse_args()
def get_gh_client(github_token: str) -> Github:
auth = Auth.Token(github_token)
return Github(auth=auth)
def get_issue(gh: Github, repo: str, issue_num: int) -> Issue:
repo = gh.get_repo(repo)
return repo.get_issue(number=issue_num)
def is_exception_branch(branch: str) -> bool:
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
def get_workflow_type(issue: Issue, username: str) -> str:
user_list = issue.get_comments()[0].body.split("\r\n")
try:
run_option = issue.get_comments()[1].body.split("\r\n")[0]
except Exception as e:
run_option = "single"
if user_list[0] == "!":
# Use old runners for everyone
return WORKFLOW_TYPE_LABEL
elif user_list[1] == "*":
if run_option == WORKFLOW_TYPE_BOTH:
# Use ARC runners and old runners for everyone
return WORKFLOW_TYPE_BOTH
else:
# Use only ARC runners for everyone
return WORKFLOW_TYPE_RG
elif username in user_list:
if run_option == WORKFLOW_TYPE_BOTH:
# Use ARC runners and old runners for a specific user
return WORKFLOW_TYPE_BOTH
else:
# Use only ARC runners for a specific user
return WORKFLOW_TYPE_RG
else:
# Use old runners by default
return WORKFLOW_TYPE_LABEL
def main() -> None:
args = parse_args()
if is_exception_branch(args.github_branch):
output = {"workflow_type": WORKFLOW_TYPE_LABEL}
else:
try:
gh = get_gh_client(args.github_token)
issue = get_issue(gh, args.github_repo, args.github_issue)
output = {"workflow_type": get_workflow_type(issue, args.github_user)}
except Exception as e:
output = {"workflow_type": WORKFLOW_TYPE_LABEL}
json_output = json.dumps(output)
print(json_output)
if __name__ == "__main__":
main()

View File

@ -202,3 +202,12 @@ def gh_update_pr_state(org: str, repo: str, pr_num: int, state: str = "open") ->
)
else:
raise
def gh_query_issues_by_labels(
org: str, repo: str, labels: List[str], state: str = "open"
) -> List[Dict[str, Any]]:
url = f"{GITHUB_API_URL}/repos/{org}/{repo}/issues"
return gh_fetch_json(
url, method="GET", params={"labels": ",".join(labels), "state": state}
)

Binary file not shown.

View File

@ -29,6 +29,7 @@ python3 -m tools.pyi.gen_pyi \
--native-functions-path aten/src/ATen/native/native_functions.yaml \
--tags-path aten/src/ATen/native/tags.yaml \
--deprecated-functions-path "tools/autograd/deprecated.yaml"
python3 torch/utils/data/datapipes/gen_pyi.py
RC=0
# Run lintrunner on all files

210
.github/scripts/runner_determinator.py vendored Normal file
View File

@ -0,0 +1,210 @@
# flake8: noqa: G004
import logging
import os
from argparse import ArgumentParser
from logging import LogRecord
from typing import Any, Iterable
from github import Auth, Github
from github.Issue import Issue
WORKFLOW_LABEL_META = "" # use meta runners
WORKFLOW_LABEL_LF = "lf." # use runners from the linux foundation
GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")
GH_OUTPUT_KEY_LABEL_TYPE = "label-type"
class ColorFormatter(logging.Formatter):
"""Color codes the log messages based on the log level"""
COLORS = {
"WARNING": "\033[33m", # Yellow
"ERROR": "\033[31m", # Red
"CRITICAL": "\033[31m", # Red
"INFO": "\033[0m", # Reset
"DEBUG": "\033[0m", # Reset
}
def format(self, record: LogRecord) -> str:
log_color = self.COLORS.get(record.levelname, "\033[0m") # Default to reset
record.msg = f"{log_color}{record.msg}\033[0m"
return super().format(record)
handler = logging.StreamHandler()
handler.setFormatter(ColorFormatter(fmt="%(levelname)-8s: %(message)s"))
log = logging.getLogger(os.path.basename(__file__))
log.addHandler(handler)
log.setLevel(logging.INFO)
def set_github_output(key: str, value: str) -> None:
"""
Defines outputs of the github action that invokes this script
"""
if not GITHUB_OUTPUT:
# See https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/ for deprecation notice
log.warning(
"No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method."
)
print(f"::set-output name={key}::{value}")
return
with open(GITHUB_OUTPUT, "a") as f:
log.info(f"Setting output: {key}='{value}'")
f.write(f"{key}={value}\n")
def parse_args() -> Any:
parser = ArgumentParser("Get dynamic rollout settings")
parser.add_argument("--github-token", type=str, required=True, help="GitHub token")
parser.add_argument(
"--github-issue-repo",
type=str,
required=False,
default="pytorch/test-infra",
help="GitHub repo to get the issue",
)
parser.add_argument(
"--github-repo",
type=str,
required=True,
help="GitHub repo where CI is running",
)
parser.add_argument(
"--github-issue", type=int, required=True, help="GitHub issue number"
)
parser.add_argument(
"--github-actor", type=str, required=True, help="GitHub triggering_actor"
)
parser.add_argument(
"--github-issue-owner", type=str, required=True, help="GitHub issue owner"
)
parser.add_argument(
"--github-branch", type=str, required=True, help="Current GitHub branch or tag"
)
parser.add_argument(
"--github-ref-type",
type=str,
required=True,
help="Current GitHub ref type, branch or tag",
)
return parser.parse_args()
def get_gh_client(github_token: str) -> Github:
auth = Auth.Token(github_token)
return Github(auth=auth)
def get_issue(gh: Github, repo: str, issue_num: int) -> Issue:
repo = gh.get_repo(repo)
return repo.get_issue(number=issue_num)
def get_potential_pr_author(
gh: Github, repo: str, username: str, ref_type: str, ref_name: str
) -> str:
# If the trigger was a new tag added by a bot, this is a ciflow case
# Fetch the actual username from the original PR. The PR number is
# embedded in the tag name: ciflow/<name>/<pr-number>
if username == "pytorch-bot[bot]" and ref_type == "tag":
split_tag = ref_name.split("/")
if (
len(split_tag) == 3
and split_tag[0] == "ciflow"
and split_tag[2].isnumeric()
):
pr_number = split_tag[2]
try:
repository = gh.get_repo(repo)
pull = repository.get_pull(number=int(pr_number))
except Exception as e:
raise Exception( # noqa: TRY002
f"issue with pull request {pr_number} from repo {repository}"
) from e
return pull.user.login
# In all other cases, return the original input username
return username
def is_exception_branch(branch: str) -> bool:
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
def get_workflow_type(issue: Issue, workflow_requestors: Iterable[str]) -> str:
try:
first_comment = issue.get_comments()[0].body.strip("\n\t ")
if first_comment[0] == "!":
log.info("LF Workflows are disabled for everyone. Using meta runners.")
return WORKFLOW_LABEL_META
elif first_comment[0] == "*":
log.info("LF Workflows are enabled for everyone. Using LF runners.")
return WORKFLOW_LABEL_LF
else:
all_opted_in_users = {
usr_raw.strip("\n\t@ ") for usr_raw in first_comment.split()
}
opted_in_requestors = {
usr for usr in workflow_requestors if usr in all_opted_in_users
}
if opted_in_requestors:
log.info(
f"LF Workflows are enabled for {', '.join(opted_in_requestors)}. Using LF runners."
)
return WORKFLOW_LABEL_LF
else:
log.info(
f"LF Workflows are disabled for {', '.join(workflow_requestors)}. Using meta runners."
)
return WORKFLOW_LABEL_META
except Exception as e:
log.error(
f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"
)
return WORKFLOW_LABEL_META
def main() -> None:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(f"Exception branch: '{args.github_branch}', using meta runners")
label_type = WORKFLOW_LABEL_META
else:
try:
gh = get_gh_client(args.github_token)
# The default issue we use - https://github.com/pytorch/test-infra/issues/5132
issue = get_issue(gh, args.github_issue_repo, args.github_issue)
username = get_potential_pr_author(
gh,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
label_type = get_workflow_type(
issue,
(
args.github_issue_owner,
username,
),
)
except Exception as e:
log.error(
f"Failed to get issue. Falling back to meta runners. Exception: {e}"
)
label_type = WORKFLOW_LABEL_META
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)
if __name__ == "__main__":
main()

View File

@ -2,7 +2,7 @@
set -eoux pipefail
SYNC_BRANCH=fbcode/pytorch-stable-prototype
SYNC_BRANCH=pytorch-stable-prototype
git config user.email "fake@example.com"
git config user.name "PyTorch Stable Bot"
@ -11,7 +11,9 @@ git fetch origin main
git fetch origin "$SYNC_BRANCH"
git checkout "$SYNC_BRANCH"
for SHA in $(git log 4333e122d4b74cdf84351ed2907045c6a767b4cd..origin/main --pretty="%h" --reverse -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)
# Using a hardcoded SHA here is a massive speedup as we can skip the entire history of the pytorch GitHub repo.
# This specific SHA was chosen as it was before the "branch point" of the stable branch
for SHA in $(git log ba3b05fdf37ddbc3c301294d6a560a816335e717..origin/main --pretty="%h" --reverse -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)
do
# `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise
if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]
@ -20,7 +22,12 @@ do
continue
fi
echo "Copying $SHA"
git cherry-pick -x "$SHA"
git cherry-pick -x "$SHA" -X theirs
git reset --soft HEAD~1
git add torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed
git checkout .
git commit --reuse-message=HEAD@{1}
git clean -f
done
if [[ "${WITH_PUSH}" == true ]]; then

View File

@ -41,7 +41,7 @@ def main() -> None:
)
options = parser.parse_args()
tagged_images: Dict[str, bool] = dict()
tagged_images: Dict[str, bool] = {}
platform_images = [
generate_binary_build_matrix.WHEEL_CONTAINER_IMAGES,
generate_binary_build_matrix.LIBTORCH_CONTAINER_IMAGES,

View File

@ -7,6 +7,7 @@ cd llm-target-determinator
pip install -q -r requirements.txt
cd ../codellama
pip install -e .
pip install numpy==1.26.0
# Run indexer
cd ../llm-target-determinator

View File

@ -180,6 +180,9 @@ def mock_gh_get_info() -> Any:
return {
"closed": False,
"isCrossRepository": False,
"headRefName": "foo",
"baseRefName": "bar",
"baseRepository": {"defaultBranchRef": {"name": "bar"}},
"files": {"nodes": [], "pageInfo": {"hasNextPage": False}},
"changedFiles": 0,
}
@ -394,6 +397,7 @@ class TestTryMerge(TestCase):
# self.assertGreater(len(pr.get_checkrun_conclusions()), 3)
self.assertGreater(pr.get_commit_count(), 60)
@skip("GitHub doesn't keep this data anymore")
def test_gql_retrieve_checksuites(self, *args: Any) -> None:
"Fetch comments and conclusions for PR with 60 commits"
pr = GitHubPR("pytorch", "pytorch", 94787)
@ -891,6 +895,24 @@ class TestBypassFailures(TestCase):
self.assertTrue(len(ignorable["FLAKY"]) == 1)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 0)
def test_ignore_failures_older_run_same_workflow(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 129013)
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
pending, failed, ignorable = categorize_checks(
checks,
list(checks.keys()),
)
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["FLAKY"]) == 2)
self.assertTrue(len(ignorable["UNSTABLE"]) == 13)
@mock.patch("trymerge.read_merge_rules", side_effect=xla_merge_rules)
def test_dont_ignore_flaky_failures(self, *args: Any) -> None:
"""
@ -1019,7 +1041,7 @@ class TestGitHubPRGhstackDependencies(TestCase):
)
@skip(
reason="This test is run against a mutalbe PR that has changed, so it no longer works. The test should be changed"
reason="This test is run against a mutable PR that has changed, so it no longer works. The test should be changed"
)
@mock.patch("trymerge.read_merge_rules")
@mock.patch("trymerge.GitRepo")

View File

@ -81,9 +81,10 @@ JobNameToStateDict = Dict[str, JobCheckState]
class WorkflowCheckState:
def __init__(self, name: str, url: str, status: Optional[str]):
def __init__(self, name: str, url: str, run_id: int, status: Optional[str]):
self.name: str = name
self.url: str = url
self.run_id: int = run_id
self.status: Optional[str] = status
self.jobs: JobNameToStateDict = {}
@ -122,6 +123,7 @@ fragment PRCheckSuites on CheckSuiteConnection {
workflowRun {
workflow {
name
databaseId
}
databaseId
url
@ -512,7 +514,7 @@ def add_workflow_conclusions(
workflows: Dict[str, WorkflowCheckState] = {}
# for the jobs that don't have a workflow
no_workflow_obj: WorkflowCheckState = WorkflowCheckState("", "", None)
no_workflow_obj: WorkflowCheckState = WorkflowCheckState("", "", 0, None)
def add_conclusions(edges: Any) -> None:
for edge_idx, edge in enumerate(edges):
@ -523,18 +525,30 @@ def add_workflow_conclusions(
workflow_obj: WorkflowCheckState = no_workflow_obj
if workflow_run is not None:
# This is the usual workflow run ID we see on GitHub
workflow_run_id = workflow_run["databaseId"]
# While this is the metadata name and ID of the workflow itself
workflow_name = workflow_run["workflow"]["name"]
workflow_id = workflow_run["workflow"]["databaseId"]
workflow_conclusion = node["conclusion"]
# Do not override existing status with cancelled
if workflow_conclusion == "CANCELLED" and workflow_name in workflows:
continue
if workflow_name not in workflows:
workflows[workflow_name] = WorkflowCheckState(
# Only keep the latest workflow run for each workflow, heuristically,
# it's the run with largest run ID
if (
workflow_id not in workflows
or workflows[workflow_id].run_id < workflow_run_id
):
workflows[workflow_id] = WorkflowCheckState(
name=workflow_name,
status=workflow_conclusion,
url=workflow_run["url"],
run_id=workflow_run_id,
)
workflow_obj = workflows[workflow_name]
workflow_obj = workflows[workflow_id]
while checkruns is not None:
for checkrun_node in checkruns["nodes"]:
@ -572,12 +586,12 @@ def add_workflow_conclusions(
# the jobs in but don't put the workflow in. We care more about the jobs in
# the workflow that ran than the container workflow.
res: JobNameToStateDict = {}
for workflow_name, workflow in workflows.items():
for workflow in workflows.values():
if len(workflow.jobs) > 0:
for job_name, job in workflow.jobs.items():
res[job_name] = job
else:
res[workflow_name] = JobCheckState(
res[workflow.name] = JobCheckState(
workflow.name,
workflow.url,
workflow.status,
@ -1163,7 +1177,6 @@ class GitHubPR:
# Finally, upload the record to Rockset. The list of pending and failed
# checks are at the time of the merge
save_merge_record(
collection=ROCKSET_MERGES_COLLECTION,
comment_id=comment_id,
pr_num=self.pr_num,
owner=self.org,
@ -1179,10 +1192,8 @@ class GitHubPR:
merge_base_sha=self.get_merge_base(),
merge_commit_sha=merge_commit_sha,
is_failed=False,
dry_run=dry_run,
skip_mandatory_checks=skip_mandatory_checks,
ignore_current=bool(ignore_current_checks),
workspace=ROCKSET_MERGES_WORKSPACE,
)
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")
@ -1489,7 +1500,6 @@ def checks_to_markdown_bullets(
@retries_decorator()
def save_merge_record(
collection: str,
comment_id: int,
pr_num: int,
owner: str,
@ -1505,59 +1515,44 @@ def save_merge_record(
merge_base_sha: str,
merge_commit_sha: str = "",
is_failed: bool = False,
dry_run: bool = False,
skip_mandatory_checks: bool = False,
ignore_current: bool = False,
error: str = "",
workspace: str = "commons",
) -> None:
"""
This saves the merge records into Rockset, so we can query them (for fun and profit)
This saves the merge records as a json, which can later be uploaded to s3
"""
if dry_run:
# Decide not to save the record to Rockset if dry-run is set to not pollute
# the collection
return
try:
import rockset # type: ignore[import]
# Prepare the record to be written into Rockset
data = [
{
"comment_id": comment_id,
"pr_num": pr_num,
"owner": owner,
"project": project,
"author": author,
"pending_checks": pending_checks,
"failed_checks": failed_checks,
"ignore_current_checks": ignore_current_checks,
"broken_trunk_checks": broken_trunk_checks,
"flaky_checks": flaky_checks,
"unstable_checks": unstable_checks,
"last_commit_sha": last_commit_sha,
"merge_base_sha": merge_base_sha,
"merge_commit_sha": merge_commit_sha,
"is_failed": is_failed,
"skip_mandatory_checks": skip_mandatory_checks,
"ignore_current": ignore_current,
"error": error,
# This is a unique identifier for the record for deduping purposes
# in rockset. Any unique string would work
"_id": f"{project}-{pr_num}-{comment_id}-{os.environ.get('GITHUB_RUN_ID')}",
}
]
repo_root = Path(__file__).resolve().parent.parent.parent
# Prepare the record to be written into Rockset
data = [
{
"comment_id": comment_id,
"pr_num": pr_num,
"owner": owner,
"project": project,
"author": author,
"pending_checks": pending_checks,
"failed_checks": failed_checks,
"ignore_current_checks": ignore_current_checks,
"broken_trunk_checks": broken_trunk_checks,
"flaky_checks": flaky_checks,
"unstable_checks": unstable_checks,
"last_commit_sha": last_commit_sha,
"merge_base_sha": merge_base_sha,
"merge_commit_sha": merge_commit_sha,
"is_failed": is_failed,
"skip_mandatory_checks": skip_mandatory_checks,
"ignore_current": ignore_current,
"error": error,
}
]
client = rockset.RocksetClient(
host="api.usw2a1.rockset.com", api_key=os.environ["ROCKSET_API_KEY"]
)
client.Documents.add_documents(
collection=collection,
data=data,
workspace=workspace,
)
except ModuleNotFoundError:
print("Rockset is missing, no record will be saved")
return
with open(repo_root / "merge_record.json", "w") as f:
json.dump(data, f)
@retries_decorator(rc=[])
@ -2330,6 +2325,15 @@ def main() -> None:
dry_run=args.dry_run,
)
return
if not pr.is_ghstack_pr() and pr.base_ref() != pr.default_branch():
gh_post_pr_comment(
org,
project,
args.pr_num,
f"PR targets {pr.base_ref()} rather than {pr.default_branch()}, refusing merge request",
dry_run=args.dry_run,
)
return
if args.check_mergeability:
if pr.is_ghstack_pr():
@ -2365,7 +2369,6 @@ def main() -> None:
# list of pending and failed checks here, but they are not really
# needed at the moment
save_merge_record(
collection=ROCKSET_MERGES_COLLECTION,
comment_id=args.comment_id,
pr_num=args.pr_num,
owner=org,
@ -2380,11 +2383,9 @@ def main() -> None:
last_commit_sha=pr.last_commit().get("oid", ""),
merge_base_sha=pr.get_merge_base(),
is_failed=True,
dry_run=args.dry_run,
skip_mandatory_checks=args.force,
ignore_current=args.ignore_current,
error=str(e),
workspace=ROCKSET_MERGES_WORKSPACE,
)
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")

View File

@ -81,7 +81,7 @@ jobs:
!{{ config["build_name"] }}-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: !{{ config["build_name"] }}-build
{%- if config["gpu_arch_type"] != "rocm" %}
{%- if config["gpu_arch_type"] not in ["rocm", "xpu"] %}
uses: ./.github/workflows/_binary-test-linux.yml
with:!{{ upload.binary_env_as_input(config) }}
build_name: !{{ config["build_name"] }}
@ -101,6 +101,40 @@ jobs:
{%- endif %}
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
{%- elif config["gpu_arch_type"] == "xpu" %}
runs-on: linux.idc.xpu
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config) }}
permissions:
id-token: write
contents: read
steps:
- name: Setup XPU
uses: ./.github/actions/setup-xpu
- name: configure aws credentials
id: aws_creds
uses: aws-actions/configure-aws-credentials@v1.7.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- uses: !{{ common.download_artifact_action }}
name: Download Build Artifacts
with:
name: !{{ config["build_name"] }}
path: "${{ runner.temp }}/artifacts/"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: !{{ config["container_image"] }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown XPU
uses: ./.github/actions/teardown-xpu
{%- else %}
runs-on: linux.rocm.gpu
timeout-minutes: !{{ common.timeout_minutes }}

View File

@ -30,6 +30,9 @@
{%- if config["devtoolset"] %}
DESIRED_DEVTOOLSET: !{{ config["devtoolset"] }}
{%- endif %}
{%- if config.use_split_build is defined %}
use_split_build: !{{ config["use_split_build"] }}
{%- endif %}
{%- endif %}
{%- if config["package_type"] == "libtorch" %}
{%- if config["libtorch_config"] %}
@ -44,6 +47,7 @@
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
{%- endif %}
{%- else %}
DESIRED_PYTHON: "!{{ config["python_version"] }}"
{%- endif %}

View File

@ -27,6 +27,11 @@ on:
type: string
description: |
A JSON description of what configs to run later on.
runner:
required: false
type: string
default: "linux.large"
description: Runner type
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
@ -34,7 +39,7 @@ env:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
runs-on: ${{ inputs.runner }}
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}

View File

@ -21,6 +21,13 @@ on:
default: 210
type: number
description: timeout for the job
use_split_build:
description: |
[Experimental] Build a libtorch only wheel and build pytorch such that
are built from the libtorch wheel.
required: false
type: boolean
default: false
ALPINE_IMAGE:
required: false
type: string
@ -110,6 +117,7 @@ jobs:
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
steps:
- name: Make the env permanent during this workflow (but not the secrets)
shell: bash
@ -137,6 +145,7 @@ jobs:
echo "PR_NUMBER=${{ env.PR_NUMBER }}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
echo "SHA1=${{ env.SHA1 }}"
echo "USE_SPLIT_BUILD=${{ env.use_split_build }}"
} >> "${GITHUB_ENV} }}"
- name: List the env
@ -246,6 +255,7 @@ jobs:
-e PYTORCH_ROOT \
-e SKIP_ALL_TESTS \
-e PYTORCH_EXTRA_INSTALL_REQUIREMENTS \
-e USE_SPLIT_BUILD \
--tty \
--detach \
-v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \

View File

@ -63,6 +63,13 @@ on:
required: true
type: string
description: Hardware to run this job on. Valid values are linux.4xlarge, linux.4xlarge.nvidia.gpu, linux.arm64.2xlarge, and linux.rocm.gpu
use_split_build:
description: |
[Experimental] Build a libtorch only wheel and build pytorch such that
are built from the libtorch wheel.
required: false
type: boolean
default: false
secrets:
github-token:
required: true
@ -97,6 +104,7 @@ jobs:
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
steps:
- name: Make the env permanent during this workflow (but not the secrets)
shell: bash
@ -124,6 +132,7 @@ jobs:
echo "PR_NUMBER=${{ env.PR_NUMBER }}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
echo "SHA1=${{ env.SHA1 }}"
echo "USE_SPLIT_BUILD=${{ env.USE_SPLIT_BUILD }}"
} >> "${GITHUB_ENV} }}"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

View File

@ -55,6 +55,13 @@ on:
required: false
type: string
description: Desired python version
use_split_build:
description: |
[Experimental] Build a libtorch only wheel and build pytorch such that
are built from the libtorch wheel.
required: false
type: boolean
default: false
secrets:
github-token:
required: true
@ -93,6 +100,7 @@ jobs:
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

View File

@ -56,6 +56,13 @@ on:
required: false
type: string
default: ""
use_split_build:
description: |
[Experimental] Build a libtorch only wheel and build pytorch such that
are built from the libtorch wheel.
required: false
type: boolean
default: false
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
@ -107,3 +114,4 @@ jobs:
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
use_split_build: ${{ inputs.use_split_build }}

View File

@ -1,105 +0,0 @@
name: linux-build-rg
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
type: string
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
type: boolean
default: true
description: If set, upload generated build artifacts.
build-with-debug:
required: false
type: boolean
default: false
description: If set, build in debug mode.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
type: string
default: "5.2"
description: |
List of CUDA architectures CI build should target.
runner-group:
required: false
type: string
default: "arc-lf-linux.2xlarge"
description: Runner group to select group type
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
outputs:
docker-image:
value: ${{ jobs.build.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ jobs.build.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
jobs:
build:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on:
group: ${{ inputs.runner-group }}
timeout-minutes: 240
outputs:
docker-image: ${{ steps.linux-build.outputs.docker-image }}
test-matrix: ${{ steps.linux-build.outputs.test-matrix }}
steps:
# [pytorch repo ref]
# Use a pytorch/pytorch reference instead of a reference to the local
# checkout because when we run this action we don't *have* a local
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Build
id: linux-build
uses: ./.github/actions/linux-build
with:
build-environment: ${{ inputs.build-environment }}
docker-image-name: ${{ inputs.docker-image-name }}
build-generates-artifacts: ${{ inputs.build-generates-artifacts }}
build-with-debug: ${{ inputs.build-with-debug }}
sync-tag: ${{ inputs.sync-tag }}
cuda-arch-list: ${{ inputs.cuda-arch-list }}
test-matrix: ${{ inputs.test-matrix }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -39,7 +39,7 @@ on:
type: string
default: "linux.2xlarge"
description: |
List of CUDA architectures CI build should target.
Label of the runner this job should run on.
test-matrix:
required: false
type: string
@ -64,6 +64,14 @@ on:
required: false
type: string
default: ""
use_split_build:
description: |
[Experimental] Build a libtorch only wheel and build pytorch such that
are built from the libtorch wheel.
required: false
type: boolean
default: false
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
@ -181,6 +189,7 @@ jobs:
DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
run: |
# detached container should get cleaned up by teardown_ec2_linux
container_name=$(docker run \
@ -199,6 +208,7 @@ jobs:
-e PR_LABELS \
-e OUR_GITHUB_JOB_ID \
-e HUGGING_FACE_HUB_TOKEN \
-e USE_SPLIT_BUILD \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
@ -218,7 +228,7 @@ jobs:
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped'
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build != 'true'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
@ -226,6 +236,16 @@ jobs:
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build == 'true'
with:
name: ${{ inputs.build-environment }}-experimental-split-build
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
uses: seemethere/upload-artifact-s3@v5

View File

@ -1,85 +0,0 @@
name: linux-test-rg
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
test-matrix:
required: true
type: string
description: JSON description of what test configs to run.
docker-image:
required: true
type: string
description: Docker image to run in.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
timeout-minutes:
required: false
type: number
default: 240
description: |
Set the maximum (in minutes) how long the workflow should take to finish
use-gha:
required: false
type: string
default: ""
description: If set to any value, upload to GHA. Otherwise upload to S3.
dashboard-tag:
required: false
type: string
default: ""
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
jobs:
test:
# Don't run on forked repos or empty test matrix
if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]'
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
runs-on: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Test
id: linux-test
uses: ./.github/actions/linux-test
with:
build-environment: ${{ inputs.build-environment }}
test-matrix: ${{ inputs.test-matrix }}
docker-image: ${{ inputs.docker-image }}
sync-tag: ${{ inputs.sync-tag }}
use-gha: ${{ inputs.use-gha }}
dashboard-tag: ${{ inputs.dashboard-tag }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@ -1,86 +0,0 @@
name: linux-test-label
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
test-matrix:
required: true
type: string
description: JSON description of what test configs to run.
docker-image:
required: true
type: string
description: Docker image to run in.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
timeout-minutes:
required: false
type: number
default: 240
description: |
Set the maximum (in minutes) how long the workflow should take to finish
use-gha:
required: false
type: string
default: ""
description: If set to any value, upload to GHA. Otherwise upload to S3.
dashboard-tag:
required: false
type: string
default: ""
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
jobs:
test:
# Don't run on forked repos or empty test matrix
if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]'
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
runs-on:
group: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Test
id: linux-test
uses: ./.github/actions/linux-test
with:
build-environment: ${{ inputs.build-environment }}
test-matrix: ${{ inputs.test-matrix }}
docker-image: ${{ inputs.docker-image }}
sync-tag: ${{ inputs.sync-tag }}
use-gha: ${{ inputs.use-gha }}
dashboard-tag: ${{ inputs.dashboard-tag }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@ -3,39 +3,272 @@ name: Check whether the workflow owner can use ARC runners
on:
workflow_call:
inputs:
user_name:
triggering_actor:
required: true
type: string
description: The name of the workflow owner.
description: The triggering_actor for the workflow. Use github.triggering_actor
issue_owner:
required: true
type: string
description: The owner of the issue. Use github.event.pull_request.user.login || github.event.issue.user.login
curr_branch:
required: true
type: string
description: Current branch.
description: Current branch or tag.
curr_ref_type:
required: false
type: string
default: branch
description: The value of "github.ref_type", "branch" or "tag"
issue_number:
required: false
type: string
default: "5132"
description: |
Fetch's GitHub Issue from pytorch/test-infra
Example: https://github.com/pytorch/test-infra/issues/5132
outputs:
workflow-type:
label-type:
description: Type of runners to use
value: ${{ jobs.runner-determinator.outputs.workflow-type }}
value: ${{ jobs.runner-determinator.outputs.label-type }}
jobs:
runner-determinator:
runs-on: linux.4xlarge
runs-on: ubuntu-latest
outputs:
workflow-type: ${{ steps.set-condition.outputs.workflow-type }}
label-type: ${{ steps.set-condition.outputs.label-type }}
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
ISSUE_NUMBER: ${{ inputs.issue_number }}
USERNAME: ${{ inputs.user_name }}
TRIGGERING_ACTOR: ${{ inputs.triggering_actor }}
ISSUE_OWNER: ${{ inputs.issue_owner }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
fetch-depth: 1
submodules: true
# - name: Checkout PyTorch
# uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
# with:
# fetch-depth: 1
# submodules: true
# TODO: Remove the hardcoded step below
# Hardcoding below is temporary for testing ALI runners
# This file below should match the script found in .github/scripts/runner_determinator.py
- name: Hardcode runner-determinator script
run: |
cat <<EOF > runner_determinator.py
# flake8: noqa: G004
import logging
import os
from argparse import ArgumentParser
from logging import LogRecord
from typing import Any, Iterable
from github import Auth, Github
from github.Issue import Issue
WORKFLOW_LABEL_META = "" # use meta runners
WORKFLOW_LABEL_LF = "lf." # use runners from the linux foundation
GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")
GH_OUTPUT_KEY_LABEL_TYPE = "label-type"
class ColorFormatter(logging.Formatter):
"""Color codes the log messages based on the log level"""
COLORS = {
"WARNING": "\033[33m", # Yellow
"ERROR": "\033[31m", # Red
"CRITICAL": "\033[31m", # Red
"INFO": "\033[0m", # Reset
"DEBUG": "\033[0m", # Reset
}
def format(self, record: LogRecord) -> str:
log_color = self.COLORS.get(record.levelname, "\033[0m") # Default to reset
record.msg = f"{log_color}{record.msg}\033[0m"
return super().format(record)
handler = logging.StreamHandler()
handler.setFormatter(ColorFormatter(fmt="%(levelname)-8s: %(message)s"))
log = logging.getLogger(os.path.basename(__file__))
log.addHandler(handler)
log.setLevel(logging.INFO)
def set_github_output(key: str, value: str) -> None:
"""
Defines outputs of the github action that invokes this script
"""
if not GITHUB_OUTPUT:
# See https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/ for deprecation notice
log.warning(
"No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method."
)
print(f"::set-output name={key}::{value}")
return
with open(GITHUB_OUTPUT, "a") as f:
log.info(f"Setting output: {key}='{value}'")
f.write(f"{key}={value}\n")
def parse_args() -> Any:
parser = ArgumentParser("Get dynamic rollout settings")
parser.add_argument("--github-token", type=str, required=True, help="GitHub token")
parser.add_argument(
"--github-issue-repo",
type=str,
required=False,
default="pytorch/test-infra",
help="GitHub repo to get the issue",
)
parser.add_argument(
"--github-repo",
type=str,
required=True,
help="GitHub repo where CI is running",
)
parser.add_argument(
"--github-issue", type=int, required=True, help="GitHub issue number"
)
parser.add_argument(
"--github-actor", type=str, required=True, help="GitHub triggering_actor"
)
parser.add_argument(
"--github-issue-owner", type=str, required=True, help="GitHub issue owner"
)
parser.add_argument(
"--github-branch", type=str, required=True, help="Current GitHub branch or tag"
)
parser.add_argument(
"--github-ref-type",
type=str,
required=True,
help="Current GitHub ref type, branch or tag",
)
return parser.parse_args()
def get_gh_client(github_token: str) -> Github:
auth = Auth.Token(github_token)
return Github(auth=auth)
def get_issue(gh: Github, repo: str, issue_num: int) -> Issue:
repo = gh.get_repo(repo)
return repo.get_issue(number=issue_num)
def get_potential_pr_author(
gh: Github, repo: str, username: str, ref_type: str, ref_name: str
) -> str:
# If the trigger was a new tag added by a bot, this is a ciflow case
# Fetch the actual username from the original PR. The PR number is
# embedded in the tag name: ciflow/<name>/<pr-number>
if username == "pytorch-bot[bot]" and ref_type == "tag":
split_tag = ref_name.split("/")
if (
len(split_tag) == 3
and split_tag[0] == "ciflow"
and split_tag[2].isnumeric()
):
pr_number = split_tag[2]
try:
repository = gh.get_repo(repo)
pull = repository.get_pull(number=int(pr_number))
except Exception as e:
raise Exception( # noqa: TRY002
f"issue with pull request {pr_number} from repo {repository}"
) from e
return pull.user.login
# In all other cases, return the original input username
return username
def is_exception_branch(branch: str) -> bool:
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
def get_workflow_type(issue: Issue, workflow_requestors: Iterable[str]) -> str:
try:
first_comment = issue.get_comments()[0].body.strip("\n\t ")
if first_comment[0] == "!":
log.info("LF Workflows are disabled for everyone. Using meta runners.")
return WORKFLOW_LABEL_META
elif first_comment[0] == "*":
log.info("LF Workflows are enabled for everyone. Using LF runners.")
return WORKFLOW_LABEL_LF
else:
all_opted_in_users = {
usr_raw.strip("\n\t@ ") for usr_raw in first_comment.split()
}
opted_in_requestors = {
usr for usr in workflow_requestors if usr in all_opted_in_users
}
if opted_in_requestors:
log.info(
f"LF Workflows are enabled for {', '.join(opted_in_requestors)}. Using LF runners."
)
return WORKFLOW_LABEL_LF
else:
log.info(
f"LF Workflows are disabled for {', '.join(workflow_requestors)}. Using meta runners."
)
return WORKFLOW_LABEL_META
except Exception as e:
log.error(
f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"
)
return WORKFLOW_LABEL_META
def main() -> None:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(f"Exception branch: '{args.github_branch}', using meta runners")
label_type = WORKFLOW_LABEL_META
else:
try:
gh = get_gh_client(args.github_token)
# The default issue we use - https://github.com/pytorch/test-infra/issues/5132
issue = get_issue(gh, args.github_issue_repo, args.github_issue)
username = get_potential_pr_author(
gh,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
label_type = get_workflow_type(
issue,
(
args.github_issue_owner,
username,
),
)
except Exception as e:
log.error(
f"Failed to get issue. Falling back to meta runners. Exception: {e}"
)
label_type = WORKFLOW_LABEL_META
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)
if __name__ == "__main__":
main()
EOF
cat runner_determinator.py
- name: Install dependencies
run: python3 -m pip install urllib3==1.26.18 PyGithub==2.3.0
@ -44,15 +277,14 @@ jobs:
id: set-condition
run: |
curr_branch="${{ inputs.curr_branch }}"
curr_ref_type="${{ inputs.curr_ref_type }}"
echo "Current branch is '$curr_branch'"
output="$(python3 .github/scripts/get_workflow_type.py \
python3 runner_determinator.py \
--github-token "$GITHUB_TOKEN" \
--github-issue "$ISSUE_NUMBER" \
--github-branch "$curr_branch" \
--github-user "$USERNAME")"
echo "Output: '${output}'"
WORKFLOW_TYPE=$(echo "${output}" | jq -r '.workflow_type')
echo "workflow-type=$WORKFLOW_TYPE" >> "$GITHUB_OUTPUT"
--github-actor "$TRIGGERING_ACTOR" \
--github-issue-owner "$ISSUE_OWNER" \
--github-ref-type "$curr_ref_type" \
--github-repo "$GITHUB_REPOSITORY"

View File

@ -30,6 +30,12 @@ on:
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
runner:
required: false
type: string
default: "windows.4xlarge.nonephemeral"
description: |
Label of the runner this job should run on.
outputs:
test-matrix:
@ -43,10 +49,13 @@ jobs:
build:
# Don't run on forked repos.
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, windows.4xlarge.nonephemeral]
runs-on: ${{ inputs.runner }}
timeout-minutes: 240
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
defaults:
run:
shell: bash
steps:
# Duplicated in win-test because this MUST go before a checkout
- name: Enable git symlinks on Windows and disable fsmonitor daemon
@ -89,6 +98,7 @@ jobs:
- name: Parse ref
id: parse-ref
shell: bash
run: python3 .github/scripts/parse_ref.py
- name: Get workflow job id

View File

@ -41,6 +41,9 @@ jobs:
fail-fast: false
runs-on: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
defaults:
run:
shell: bash
steps:
# Duplicated in win-build because this MUST go before a checkout
- name: Enable git symlinks on Windows and disable fsmonitor daemon
@ -224,6 +227,7 @@ jobs:
- name: Parse ref
id: parse-ref
shell: bash
run: python3 .github/scripts/parse_ref.py
- name: Uninstall PyTorch

View File

@ -14,6 +14,7 @@ on:
- .github/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/triton-rocm.txt
- .ci/docker/ci_commit_pins/triton-xpu.txt
pull_request:
paths:
- .github/workflows/build-triton-wheel.yml
@ -21,6 +22,7 @@ on:
- .github/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/triton-rocm.txt
- .ci/docker/ci_commit_pins/triton-xpu.txt
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
@ -34,7 +36,7 @@ jobs:
fail-fast: false
matrix:
py_vers: [ "3.8", "3.9", "3.10", "3.11", "3.12" ]
device: ["cuda", "rocm"]
device: ["cuda", "rocm", "xpu"]
include:
- device: "rocm"
rocm_version: "6.1"
@ -102,11 +104,6 @@ jobs:
;;
esac
BUILD_ROCM=""
if [[ "$BUILD_DEVICE" == "rocm" ]]; then
BUILD_ROCM="--build-rocm"
fi
RELEASE=""
if [[ "${IS_RELEASE_TAG}" == true ]]; then
RELEASE="--release"
@ -114,7 +111,13 @@ jobs:
docker exec -t "${container_name}" yum install -y zlib-devel zip
docker exec -t "${container_name}" "${PYTHON_EXECUTABLE}" -m pip install -U setuptools==67.4.0
docker exec -t "${container_name}" "${PYTHON_EXECUTABLE}" /pytorch/.github/scripts/build_triton_wheel.py $BUILD_ROCM $RELEASE
# Triton xpu build use GCC11
if [[ "${BUILD_DEVICE}" == xpu ]]; then
docker exec -t "${container_name}" yum install -y devtoolset-11-gcc-c++
docker exec -t "${container_name}" bash -c "source /opt/rh/devtoolset-11/enable && ${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE"
else
docker exec -t "${container_name}" bash -c "${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE"
fi
docker exec -t "${container_name}" chown -R 1000.1000 /artifacts
- uses: actions/upload-artifact@v3

View File

@ -5,6 +5,11 @@ on:
branches:
- main
- release/*
tags:
# Final Release tags look like: v1.11.0
- v[0-9]+.[0-9]+.[0-9]+
# Release candidate tags look like: v1.11.0-rc1
- v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
release:
types: [published]
pull_request:
@ -18,6 +23,8 @@ jobs:
# https://github.com/softprops/action-gh-release?tab=readme-ov-file#permissions
permissions:
contents: write
outputs:
pt_release_name: ${{ steps.release_name.outputs.pt_release_name }}
steps:
- uses: malfet/checkout@silent-checkout
with:
@ -49,11 +56,44 @@ jobs:
# Create archive
tar -czf "$PT_RELEASE_FILE" "$PT_RELEASE_NAME"
echo "Created source archive $PT_RELEASE_FILE with content: $(ls -a "$PT_RELEASE_NAME")"
- name: Upload source distribution
- name: Upload source distribution for release
if: ${{ github.event_name == 'release' }}
uses: softprops/action-gh-release@v1
with:
files: ${{env.PT_RELEASE_FILE}}
- name: Upload source distribution to GHA artifacts for release tags
if: ${{ github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}
uses: actions/upload-artifact@v2
with:
name: ${{ env.PT_RELEASE_FILE }}
path: ${{ env.PT_RELEASE_FILE }}
- name: Set output
id: release_name
run: echo "::set-output name=pt_release_name::${{ env.PT_RELEASE_NAME }}.tar.gz"
upload_source_code_to_s3:
if: ${{ github.repository == 'pytorch/pytorch' && github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}
runs-on: linux.2xlarge
environment: sourcecode-upload
name: Upload source code to S3 for release tags
permissions:
id-token: write
needs: release
steps:
- uses: actions/download-artifact@v2
with:
name: ${{ needs.release.outputs.pt_release_name }}
- name: Configure AWS credentials(PyTorch account)
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: arn:aws:iam::749337293305:role/gha_pytorch_source_code_upload_role
aws-region: us-east-1
- uses: seemethere/upload-artifact-s3@v5
with:
s3-bucket: pytorch
s3-prefix: source_code/test
if-no-files-found: warn
path: ${{ needs.release.outputs.pt_release_name }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name }}

View File

@ -54,6 +54,7 @@ jobs:
pytorch-linux-focal-py3-clang9-android-ndk-r21e,
pytorch-linux-jammy-py3.8-gcc11,
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks,
pytorch-linux-jammy-py3.12-halide,
pytorch-linux-jammy-xpu-2024.0-py3,
pytorch-linux-jammy-py3-clang15-asan,
pytorch-linux-focal-py3-clang10-onnx,

View File

@ -54,7 +54,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_8-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-aarch64-test: # Testing
@ -162,7 +162,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-aarch64-test: # Testing
@ -270,7 +270,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-aarch64-test: # Testing
@ -378,7 +378,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-aarch64-test: # Testing
@ -486,7 +486,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-aarch64-test: # Testing

View File

@ -48,7 +48,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-test: # Testing
@ -72,6 +72,48 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-split-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8-split
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-split-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cuda11_8-split-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8-split
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -88,7 +130,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-test: # Testing
@ -112,6 +154,48 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-split-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1-split
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-split-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cuda12_1-split-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1-split
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -128,7 +212,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-test: # Testing
@ -151,3 +235,45 @@ jobs:
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-split-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4-split
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-split-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cuda12_4-split-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4-split
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

File diff suppressed because it is too large Load Diff

View File

@ -54,7 +54,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_8-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-s390x-test: # Testing
@ -117,7 +117,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_9-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-s390x-test: # Testing
@ -180,7 +180,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_10-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-s390x-test: # Testing
@ -243,7 +243,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_11-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-s390x-test: # Testing
@ -306,7 +306,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_12-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-s390x-test: # Testing

View File

@ -46,7 +46,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -165,7 +165,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -284,7 +284,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -403,7 +403,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -522,7 +522,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}

View File

@ -46,7 +46,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -290,7 +290,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -536,7 +536,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -782,7 +782,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1027,7 +1027,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1271,7 +1271,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1517,7 +1517,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1763,7 +1763,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2008,7 +2008,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2252,7 +2252,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2498,7 +2498,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2744,7 +2744,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2989,7 +2989,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3233,7 +3233,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3479,7 +3479,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3725,7 +3725,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3970,7 +3970,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4214,7 +4214,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4460,7 +4460,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4706,7 +4706,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash

View File

@ -28,7 +28,8 @@ jobs:
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
{ config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
@ -95,7 +96,8 @@ jobs:
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_12-gcc9-inductor-test:

View File

@ -56,3 +56,29 @@ jobs:
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-periodic-dynamo-benchmarks-build.outputs.test-matrix }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor_torchbench_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-test-gcp:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -24,7 +24,8 @@ jobs:
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.2" },
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.2" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.2" },
]}
linux-focal-rocm6_1-py3_8-inductor-test:
@ -48,7 +49,8 @@ jobs:
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
{ config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
@ -81,32 +83,6 @@ jobs:
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor_torchbench_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-test-gcp:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_12-gcc9-inductor-build:
name: cuda12.1-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
@ -116,7 +92,8 @@ jobs:
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_12-gcc9-inductor-test:
@ -128,6 +105,26 @@ jobs:
docker-image: ${{ needs.linux-focal-cuda12_1-py3_12-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_12-gcc9-inductor-build.outputs.test-matrix }}
linux-jammy-cpu-py3_12-inductor-halide-build:
name: linux-jammy-cpu-py3.12-gcc11-inductor-halide
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-py3.12-gcc11
docker-image-name: pytorch-linux-jammy-py3.12-halide
test-matrix: |
{ include: [
{ config: "inductor-halide", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
]}
linux-jammy-cpu-py3_12-inductor-halide-test:
name: linux-jammy-cpu-py3.12-gcc11-inductor-halide
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cpu-py3_12-inductor-halide-build
with:
build-environment: linux-jammy-py3.12-gcc11
docker-image: ${{ needs.linux-jammy-cpu-py3_12-inductor-halide-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_12-inductor-halide-build.outputs.test-matrix }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
# Should be synced with the one in inductor-periodic.yml but this only runs inductor_timm
name: cuda12.4-py3.10-gcc9-sm86
@ -175,11 +172,21 @@ jobs:
{ config: "cpu_inductor_timm_freezing", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench_freezing", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench_freezing", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_huggingface_amp_freezing", shard: 1, num_shards: 1, runner: "linux.16xlarge.spr" },
{ config: "cpu_inductor_timm_amp_freezing", shard: 1, num_shards: 2, runner: "linux.16xlarge.spr" },
{ config: "cpu_inductor_timm_amp_freezing", shard: 2, num_shards: 2, runner: "linux.16xlarge.spr" },
{ config: "cpu_inductor_torchbench_amp_freezing", shard: 1, num_shards: 2, runner: "linux.16xlarge.spr" },
{ config: "cpu_inductor_torchbench_amp_freezing", shard: 2, num_shards: 2, runner: "linux.16xlarge.spr" },
{ config: "dynamic_cpu_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_timm", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_timm", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_huggingface_freezing", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_timm_freezing", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_timm_freezing", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_torchbench_freezing", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_torchbench_freezing", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "inductor_torchbench_cpu_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.24xl.spr-metal" },
]}
secrets:

View File

@ -36,33 +36,24 @@ jobs:
ref: v0.0.2
path: llm-target-determinator
- name: Setup Conda
uses: conda-incubator/setup-miniconda@v2.1.1
- name: Setup miniconda
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
with:
miniconda-version: "py39_4.12.0"
python-version: 3.9
python-version: "3.9"
- name: Install Requirements
- name: Install requirements
shell: bash -l {0}
run: |
set -euxo pipefail
conda create \
--yes \
--quiet \
--name "tdenv" \
"python=3.9"
conda activate tdenv
cd "${GITHUB_WORKSPACE}/llm-target-determinator"
pip install -r requirements.txt
cd ../codellama
pip install -e .
${CONDA_RUN} pip install -r llm-target-determinator/requirements.txt
cd "${GITHUB_WORKSPACE}/codellama"
${CONDA_RUN} pip install -e .
- name: Fetch CodeLlama Checkpoint
shell: bash -l {0}
run: |
set -euxo pipefail
conda activate tdenv
cd codellama/
cd "${GITHUB_WORKSPACE}/codellama"
mkdir "CodeLlama-7b-Python"
aws s3 cp "s3://target-determinator-assets/CodeLlama-7b-Python" "CodeLlama-7b-Python" --recursive --no-progress
@ -75,7 +66,7 @@ jobs:
shell: bash
command: |
set -euxo pipefail
python3 -m pip install awscli==1.29.40
${CONDA_RUN} python -m pip install awscli==1.29.40
cd "${GITHUB_WORKSPACE}"/llm-target-determinator/assets
aws s3 cp "s3://target-determinator-assets/indexes/latest" . --recursive
@ -88,9 +79,8 @@ jobs:
shell: bash -l {0}
run: |
set -euxo pipefail
conda activate tdenv
cd "${GITHUB_WORKSPACE}"/llm-target-determinator
torchrun \
${CONDA_RUN} torchrun \
--standalone \
--nnodes=1 \
--nproc-per-node=1 \

View File

@ -73,7 +73,6 @@ jobs:
{ config: "default", shard: 3, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "deploy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
@ -295,3 +294,53 @@ jobs:
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}
linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build:
name: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build-label.yml
with:
use_split_build: true
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build-test:
name: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build
- target-determination
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build.outputs.test-matrix }}
linux-focal-cuda11_8-py3_9-gcc9-experimental-split-build:
name: linux-focal-cuda11.8-py3.9-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build-label.yml
with:
use_split_build: true
build-environment: linux-focal-cuda11.8-py3.9-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "multigpu", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
]}
build-with-debug: false
linux-focal-cuda11_8-py3_9-gcc9-experimental-split-build-test:
name: linux-focal-cuda11.8-py3.9-gcc9-experimental-split-build-test
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda11_8-py3_9-gcc9-experimental-split-build
- target-determination
with:
build-environment: linux-focal-cuda11.8-py3.9-gcc9-experimental-split-build
docker-image: ${{ needs.linux-focal-cuda11_8-py3_9-gcc9-experimental-split-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda11_8-py3_9-gcc9-experimental-split-build.outputs.test-matrix }}

View File

@ -35,22 +35,33 @@ jobs:
id-token: write
contents: read
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
linux-jammy-py3_8-gcc11-build:
name: linux-jammy-py3.8-gcc11
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-jammy-py3.8-gcc11
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "docs_test", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "backwards_compat", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "distributed", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "docs_test", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "backwards_compat", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "distributed", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
linux-jammy-py3_8-gcc11-test:
@ -75,7 +86,9 @@ jobs:
linux-jammy-py3_8-gcc11-no-ops:
name: linux-jammy-py3.8-gcc11-no-ops
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-jammy-py3.8-gcc11-no-ops
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
@ -86,7 +99,9 @@ jobs:
linux-jammy-py3_8-gcc11-pch:
name: linux-jammy-py3.8-gcc11-pch
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-jammy-py3.8-gcc11-pch
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
@ -98,17 +113,19 @@ jobs:
linux-jammy-py3_10-clang15-asan-build:
name: linux-jammy-py3.10-clang15-asan
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-jammy-py3.10-clang15-asan
docker-image-name: pytorch-linux-jammy-py3-clang15-asan
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 5, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 6, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 1, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 5, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 6, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
]}
sync-tag: asan-build
@ -128,13 +145,15 @@ jobs:
linux-focal-py3_8-clang10-onnx-build:
name: linux-focal-py3.8-clang10-onnx
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-py3.8-clang10-onnx
docker-image-name: pytorch-linux-focal-py3-clang10-onnx
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
linux-focal-py3_8-clang10-onnx-test:
@ -151,19 +170,22 @@ jobs:
linux-focal-py3_8-clang10-build:
name: linux-focal-py3.8-clang10
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-py3.8-clang10
docker-image-name: pytorch-linux-focal-py3.8-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
linux-focal-py3_8-clang10-test:
name: linux-focal-py3.8-clang10
@ -179,22 +201,24 @@ jobs:
linux-focal-py3_11-clang10-build:
name: linux-focal-py3.11-clang10
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-py3.11-clang10
docker-image-name: pytorch-linux-focal-py3.11-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
linux-focal-py3_11-clang10-test:
name: linux-focal-py3.11-clang10
uses: ./.github/workflows/_linux-test.yml
@ -209,17 +233,20 @@ jobs:
linux-focal-py3_12-clang10-build:
name: linux-focal-py3.12-clang10
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-py3.12-clang10
docker-image-name: pytorch-linux-focal-py3.12-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
linux-focal-py3_12-clang10-test:
@ -235,14 +262,16 @@ jobs:
linux-focal-cuda11_8-py3_10-gcc9-build:
name: linux-focal-cuda11.8-py3.10-gcc9
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-cuda11.8-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.8xlarge.nvidia.gpu" },
]}
linux-focal-cuda11_8-py3_10-gcc9-test:
@ -260,17 +289,18 @@ jobs:
linux-focal-cuda12_1-py3_10-gcc9-build:
name: linux-focal-cuda12.1-py3.10-gcc9
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "deploy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-test:
@ -288,7 +318,9 @@ jobs:
linux-jammy-py3-clang12-mobile-build:
name: linux-jammy-py3-clang12-mobile-build
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-jammy-py3-clang12-mobile-build
docker-image-name: pytorch-linux-jammy-py3-clang15-asan
build-generates-artifacts: false
@ -300,7 +332,9 @@ jobs:
linux-jammy-cuda-11_8-cudnn9-py3_8-clang12-build:
name: linux-jammy-cuda11.8-cudnn9-py3.8-clang12
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-jammy-cuda11.8-cudnn9-py3.8-clang12
docker-image-name: pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12
test-matrix: |
@ -311,7 +345,9 @@ jobs:
linux-focal-py3-clang9-mobile-custom-build-static:
name: linux-focal-py3-clang9-mobile-custom-build-static
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-py3-clang9-mobile-custom-build-static
docker-image-name: pytorch-linux-focal-py3-clang9-android-ndk-r21e
build-generates-artifacts: false
@ -323,12 +359,14 @@ jobs:
linux-focal-py3_8-clang9-xla-build:
name: linux-focal-py3_8-clang9-xla
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-py3.8-clang9-xla
docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base:v1.1-lite
test-matrix: |
{ include: [
{ config: "xla", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "xla", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
]}
linux-focal-py3_8-clang9-xla-test:
@ -345,51 +383,59 @@ jobs:
if: github.event_name == 'pull_request'
name: win-vs2019-cpu-py3
uses: ./.github/workflows/_win-build.yml
needs: get-label-type
with:
build-environment: win-vs2019-cpu-py3
cuda-version: cpu
sync-tag: win-cpu-build
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 3, runner: "windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 3, runner: "windows.4xlarge.nonephemeral" },
{ config: "default", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
]}
linux-focal-cpu-py3_10-gcc9-bazel-test:
name: linux-focal-cpu-py3.10-gcc9-bazel-test
uses: ./.github/workflows/_bazel-build-test.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.large"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-version: cpu
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1, runner: "linux.4xlarge" },
{ config: "default", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
]}
linux-focal-cuda12_1-py3_10-gcc9-bazel-test:
name: linux-focal-cuda12.1-py3.10-gcc9-bazel-test
uses: ./.github/workflows/_bazel-build-test.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.large"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-version: "12.1"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-bazel-test:
name: linux-focal-cuda12.4-py3.10-gcc9-bazel-test
uses: ./.github/workflows/_bazel-build-test.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.large"
build-environment: linux-focal-cuda12.4-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
cuda-version: "12.4"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single:
@ -417,7 +463,9 @@ jobs:
linux-jammy-py3_8-gcc11-mobile-lightweight-dispatch-build:
name: linux-jammy-py3.8-gcc11-mobile-lightweight-dispatch-build
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-jammy-py3.8-gcc111-mobile-lightweight-dispatch-build
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
build-generates-artifacts: false
@ -431,7 +479,9 @@ jobs:
if: github.event_name == 'pull_request'
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
sync-tag: rocm-build
@ -445,17 +495,19 @@ jobs:
linux-focal-cuda12_1-py3_10-gcc9-sm86-build:
name: linux-focal-cuda12.1-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-sm86-test:
@ -472,12 +524,14 @@ jobs:
linux-jammy-py3-clang12-executorch-build:
name: linux-jammy-py3-clang12-executorch
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-jammy-py3-clang12-executorch
docker-image-name: pytorch-linux-jammy-py3-clang12-executorch
test-matrix: |
{ include: [
{ config: "executorch", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "executorch", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
linux-jammy-py3-clang12-executorch-test:
@ -488,3 +542,59 @@ jobs:
build-environment: linux-jammy-py3-clang12-executorch
docker-image: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.test-matrix }}
linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build:
name: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
use_split_build: true
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build-test:
name: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build
- target-determination
with:
timeout-minutes: 360
build-environment: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build.outputs.test-matrix }}
linux-focal-py3_12-clang10-experimental-split-build:
name: linux-focal-py3.12-clang10-experimental-split-build
uses: ./.github/workflows/_linux-build-label.yml
with:
use_split_build: True
build-environment: linux-focal-py3.12-clang10
docker-image-name: pytorch-linux-focal-py3.12-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
]}
linux-focal-py3_12-clang10-experimental-split-build-test:
name: linux-focal-py3.12-clang10-experimental-split-build
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-py3_12-clang10-experimental-split-build
with:
build-environment: linux-focal-py3.12-clang10-experimental-split-build
docker-image: ${{ needs.linux-focal-py3_12-clang10-experimental-split-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_12-clang10-experimental-split-build.outputs.test-matrix }}
timeout-minutes: 600

View File

@ -36,6 +36,15 @@ jobs:
id-token: write
contents: read
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_1-py3-gcc9-slow-gradcheck-build:
name: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck
uses: ./.github/workflows/_linux-build.yml
@ -97,7 +106,8 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.8-clang10
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "slow", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "slow", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
]}
linux-focal-py3_8-clang10-test:
@ -119,7 +129,8 @@ jobs:
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 1, runner: "linux.rocm.gpu" },
{ config: "slow", shard: 1, num_shards: 2, runner: "linux.rocm.gpu" },
{ config: "slow", shard: 2, num_shards: 2, runner: "linux.rocm.gpu" },
]}
linux-focal-rocm6_1-py3_8-test:
@ -139,14 +150,16 @@ jobs:
linux-jammy-py3_10-clang15-asan-build:
name: linux-jammy-py3.10-clang15-asan
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-jammy-py3.10-clang15-asan
docker-image-name: pytorch-linux-jammy-py3-clang15-asan
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 3, runner: "linux.4xlarge" },
{ config: "slow", shard: 2, num_shards: 3, runner: "linux.4xlarge" },
{ config: "slow", shard: 3, num_shards: 3, runner: "linux.4xlarge" },
{ config: "slow", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "slow", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "slow", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
]}
sync-tag: asan-build

View File

@ -34,6 +34,15 @@ jobs:
id-token: write
contents: read
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_4-py3_10-gcc9-sm86-build:
name: linux-focal-cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build-label.yml
@ -170,15 +179,17 @@ jobs:
win-vs2019-cpu-py3-build:
name: win-vs2019-cpu-py3
uses: ./.github/workflows/_win-build.yml
needs: get-label-type
with:
build-environment: win-vs2019-cpu-py3
cuda-version: cpu
sync-tag: win-cpu-build
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 3, runner: "windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 3, runner: "windows.4xlarge.nonephemeral" },
{ config: "default", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
]}
win-vs2019-cpu-py3-test:
@ -192,28 +203,21 @@ jobs:
cuda-version: cpu
test-matrix: ${{ needs.win-vs2019-cpu-py3-build.outputs.test-matrix }}
win-vs2019-cuda11_8-py3-build:
name: win-vs2019-cuda11.8-py3
win-vs2019-cuda12_1-py3-build:
name: win-vs2019-cuda12.1-py3
uses: ./.github/workflows/_win-build.yml
needs: get-label-type
with:
build-environment: win-vs2019-cuda11.8-py3
cuda-version: "11.8"
sync-tag: win-cuda-build
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 6, runner: "windows.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 6, runner: "windows.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 6, runner: "windows.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 6, runner: "windows.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 6, runner: "windows.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 6, num_shards: 6, runner: "windows.g5.4xlarge.nvidia.gpu" },
{ config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge.nonephemeral" },
]}
build-environment: win-vs2019-cuda12.1-py3
cuda-version: "12.1"
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
linux-focal-rocm6_1-py3_8-build:
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_linux-build-label.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
sync-tag: rocm-build
@ -238,3 +242,59 @@ jobs:
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor distributed/test_c10d_common distributed/test_c10d_nccl"
linux-focal-cuda12_4-py3_10-gcc9-experimental-split-build:
name: linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build-label.yml
with:
use_split_build: true
build-environment: linux-focal-cuda12.4-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-experimental-split-build-test:
name: linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_4-py3_10-gcc9-experimental-split-build
- target-determination
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-experimental-split-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-experimental-split-build.outputs.test-matrix }}
linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build:
name: linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build-label.yml
with:
use_split_build: true
build-environment: linux-focal-cuda11.8-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
]}
linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build-test:
name: linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build
- target-determination
with:
timeout-minutes: 360
build-environment: linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build
docker-image: ${{ needs.linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build.outputs.test-matrix }}

View File

@ -9,6 +9,8 @@ jobs:
name: try_merge_pr_${{ github.event.client_payload.pr_num }}
runs-on: linux.20_04.4x
environment: mergebot
permissions:
id-token: write
env:
GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
steps:
@ -43,6 +45,7 @@ jobs:
IGNORE_CURRENT: ${{ github.event.client_payload.ignore_current }}
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
DRCI_BOT_KEY: ${{ secrets.DRCI_BOT_KEY }}
GITHUB_RUN_ID: ${{ github.run_id }}
run: |
set -x
if [ -n "${REBASE}" ]; then
@ -84,6 +87,22 @@ jobs:
set -x
python3 .github/scripts/comment_on_pr.py "${PR_NUM}" "merge"
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
continue-on-error: true
with:
role-to-assume: arn:aws:iam::308535385114:role/upload_to_ossci_raw_job_status
aws-region: us-east-1
- name: Upload merge record to s3
if: always()
continue-on-error: true
uses: seemethere/upload-artifact-s3@v5
with:
s3-bucket: ossci-raw-job-status
s3-prefix: merges/${{ github.repository }}/${{ github.event.client_payload.pr_num }}/${{ github.event.client_payload.comment_id }}/${{ github.run_id }}
path: merge_record.json
# We want newer merge commands to supercede old ones
concurrency:
group: try-merge-${{ github.event.client_payload.pr_num }}

View File

@ -25,12 +25,11 @@ jobs:
upload-test-stats:
needs: get_workflow_conclusion
if:
github.repository_owner == 'pytorch' &&
(github.event.workflow_run.conclusion == 'success' || github.event.workflow_run.conclusion == 'failure' ||
needs.get_workflow_conclusion.outputs.conclusion == 'success' || needs.get_workflow_conclusion.outputs.conclusion == 'failure')
if: github.repository_owner == 'pytorch'
runs-on: ubuntu-22.04
environment: upload-stats
permissions:
id-token: write
name: Upload test stats for ${{ github.event.workflow_run.id }}, attempt ${{ github.event.workflow_run.run_attempt }}
steps:
- name: Print workflow information
@ -41,6 +40,13 @@ jobs:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
continue-on-error: true
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_upload-torch-test-stats
aws-region: us-east-1
- uses: actions/setup-python@v4
with:
python-version: '3.11'
@ -52,8 +58,6 @@ jobs:
- name: Upload test artifacts
id: upload-s3
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
WORKFLOW_ARTIFACTS_URL: ${{ github.event.workflow_run.artifacts_url }}
WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
@ -69,8 +73,6 @@ jobs:
- name: Upload test stats
env:
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
WORKFLOW_RUN_ATTEMPT: ${{ github.event.workflow_run.run_attempt }}
@ -84,8 +86,6 @@ jobs:
- name: Analyze disabled tests rerun
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
WORKFLOW_ARTIFACTS_URL: ${{ github.event.workflow_run.artifacts_url }}
WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
@ -99,14 +99,12 @@ jobs:
if: steps.upload-s3.outcome && steps.upload-s3.outcome == 'success' && github.event.workflow_run.name == 'inductor-micro-benchmark'
env:
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
WORKFLOW_RUN_ATTEMPT: ${{ github.event.workflow_run.run_attempt }}
REPO_FULLNAME: ${{ github.event.workflow_run.repository.full_name }}
HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
run: |
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}" --head-branch "${HEAD_BRANCH}" --rockset-collection oss_ci_benchmark --rockset-workspace benchmarks --match-filename "^gpt_fast_benchmark"
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}" --head-branch "${HEAD_BRANCH}" --rockset-collection oss_ci_benchmark --rockset-workspace benchmarks --dynamodb-table torchci-oss-ci-benchmark --match-filename "^gpt_fast_benchmark"
check-api-rate:
if: ${{ always() && github.repository_owner == 'pytorch' }}

View File

@ -26,6 +26,8 @@ jobs:
github.event.workflow_run.conclusion == 'failure' || needs.get-conclusion.outputs.conclusion == 'failure'
runs-on: ubuntu-22.04
environment: upload-stats
permissions:
id-token: write
name: Upload dynamo performance stats for ${{ github.event.workflow_run.id }}, attempt ${{ github.event.workflow_run.run_attempt }}
steps:
- name: Checkout PyTorch
@ -34,6 +36,13 @@ jobs:
submodules: false
fetch-depth: 1
- name: Configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
continue-on-error: true
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_upload-torch-test-stats
aws-region: us-east-1
- uses: actions/setup-python@v4
with:
python-version: '3.11'
@ -45,8 +54,6 @@ jobs:
- name: Upload torch dynamo performance stats to S3
id: upload-s3
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
WORKFLOW_ARTIFACTS_URL: ${{ github.event.workflow_run.artifacts_url }}
WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
@ -61,11 +68,9 @@ jobs:
if: steps.upload-s3.outcome && steps.upload-s3.outcome == 'success'
env:
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
WORKFLOW_RUN_ATTEMPT: ${{ github.event.workflow_run.run_attempt }}
REPO_FULLNAME: ${{ github.event.workflow_run.repository.full_name }}
HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
run: |
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}" --head-branch "${HEAD_BRANCH}" --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --match-filename "^inductor_"
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}" --head-branch "${HEAD_BRANCH}" --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --dynamodb-table torchci-dynamo-perf-stats --match-filename "^inductor_"

1
.gitignore vendored
View File

@ -129,6 +129,7 @@ env
scripts/release_notes/*.json
sccache-stats*.json
lint.json
merge_record.json
# These files get copied over on invoking setup.py
torchgen/packaged/*

View File

@ -68,6 +68,8 @@ include_patterns = [
'aten/src/ATen/native/cudnn/*.cpp',
'c10/**/*.h',
'c10/**/*.cpp',
'distributed/c10d/*DMAConnectivity.*',
'distributed/c10d/*SymmetricMemory.*',
'torch/csrc/**/*.h',
'torch/csrc/**/*.hpp',
'torch/csrc/**/*.cpp',
@ -136,7 +138,7 @@ init_command = [
'numpy==1.24.3 ; python_version == "3.8"',
'numpy==1.26.0 ; python_version >= "3.9"',
'expecttest==0.1.6',
'mypy==1.9.0',
'mypy==1.10.0',
'sympy==1.11.1',
'types-requests==2.27.25',
'types-PyYAML==6.0.7',
@ -202,6 +204,8 @@ include_patterns = [
'torch/csrc/*.cpp',
'torch/csrc/**/*.h',
'torch/csrc/**/*.cpp',
'torch/csrc/jit/serialization/*.h',
'torch/csrc/jit/serialization/*.cpp',
]
exclude_patterns = [
# The negative filters below are to exclude files that include onnx_pb.h or
@ -216,7 +220,6 @@ exclude_patterns = [
'c10/util/complex_math.h',
'c10/util/complex_utils.h',
'c10/util/flat_hash_map.h',
'c10/util/Float8*.h',
'c10/util/logging*.h',
'c10/util/hash.h',
'c10/util/strong_type.h',
@ -224,7 +227,6 @@ exclude_patterns = [
'c10/util/win32-headers.h',
'c10/util/*inl.h',
'c10/test/**/*.h',
'aten/src/ATen/core/TensorImpl_test.cpp',
'third_party/**/*',
'torch/csrc/api/**',
'torch/csrc/autograd/generated/**',
@ -232,10 +234,8 @@ exclude_patterns = [
'torch/csrc/dynamo/eval_frame.h',
'torch/csrc/inductor/**/*',
'torch/csrc/jit/**/*',
'torch/csrc/jit/serialization/import_legacy.cpp',
'torch/csrc/jit/serialization/export.cpp',
'torch/csrc/jit/serialization/mobile_bytecode_generated.h',
'torch/csrc/lazy/**/*',
'torch/csrc/mps/**/*',
]
init_command = [
'python3',
@ -999,7 +999,6 @@ command = [
]
exclude_patterns = [
'tools/gen_vulkan_spv.py',
'torch/__init__.py', # Skip this file to format because it's part of the public API
# We don't care too much about files in this directory, don't enforce
# formatting on them
'caffe2/**/*.py',
@ -1099,14 +1098,12 @@ exclude_patterns = [
'test/test_namedtuple_return_api.py',
'test/test_native_functions.py',
'test/test_native_mha.py',
'test/test_nestedtensor.py',
'test/test_nn.py',
'test/test_out_dtype_op.py',
'test/test_overrides.py',
'test/test_prims.py',
'test/test_proxy_tensor.py',
'test/test_pruning_op.py',
'test/test_public_bindings.py',
'test/test_quantization.py',
'test/test_reductions.py',
'test/test_scatter_gather_ops.py',
@ -1132,8 +1129,6 @@ exclude_patterns = [
'test/test_type_promotion.py',
'test/test_unary_ufuncs.py',
'test/test_vulkan.py',
'test/test_xnnpack_integration.py',
'test/torch_np/numpy_test/**/*.py',
'torch/_awaits/__init__.py',
'torch/_custom_op/__init__.py',
'torch/_custom_op/autograd.py',
@ -1194,9 +1189,6 @@ exclude_patterns = [
'torch/_export/serde/upgrade.py',
'torch/_export/trace.py',
'torch/_export/verifier.py',
'torch/_higher_order_ops/__init__.py',
'torch/_higher_order_ops/out_dtype.py',
'torch/_higher_order_ops/wrap.py',
'torch/_vendor/**',
'torch/ao/__init__.py',
'torch/ao/nn/__init__.py',
@ -1393,172 +1385,8 @@ exclude_patterns = [
'torch/contrib/_tensorboard_vis.py',
"torch/cuda/_gpu_trace.py",
'torch/cuda/_memory_viz.py', # mypy: Value of type "object" is not indexable
'torch/distributed/__init__.py',
'torch/distributed/_composable_state.py',
'torch/distributed/_shard/__init__.py',
'torch/distributed/_shard/_utils.py',
'torch/distributed/_shard/api.py',
'torch/distributed/_shard/checkpoint/__init__.py',
'torch/distributed/_shard/common_op_utils.py',
'torch/distributed/_shard/metadata.py',
'torch/distributed/_shard/op_registry_utils.py',
'torch/distributed/_shard/sharded_optim/__init__.py',
'torch/distributed/_shard/sharded_optim/api.py',
'torch/distributed/_shard/sharded_tensor/__init__.py',
'torch/distributed/_shard/sharded_tensor/_ops/__init__.py',
'torch/distributed/_shard/sharded_tensor/_ops/_common.py',
'torch/distributed/_shard/sharded_tensor/_ops/binary_cmp.py',
'torch/distributed/_shard/sharded_tensor/_ops/init.py',
'torch/distributed/_shard/sharded_tensor/_ops/misc_ops.py',
'torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py',
'torch/distributed/_shard/sharded_tensor/api.py',
'torch/distributed/_shard/sharded_tensor/logger.py',
'torch/distributed/_shard/sharded_tensor/logging_handlers.py',
'torch/distributed/_shard/sharded_tensor/metadata.py',
'torch/distributed/_shard/sharded_tensor/reshard.py',
'torch/distributed/_shard/sharded_tensor/shard.py',
'torch/distributed/_shard/sharded_tensor/utils.py',
'torch/distributed/_shard/sharder.py',
'torch/distributed/_shard/sharding_plan/__init__.py',
'torch/distributed/_shard/sharding_plan/api.py',
'torch/distributed/_shard/sharding_spec/__init__.py',
'torch/distributed/_shard/sharding_spec/_internals.py',
'torch/distributed/_shard/sharding_spec/api.py',
'torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py',
'torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/__init__.py',
'torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/_common.py',
'torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding.py',
'torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding_bag.py',
'torch/distributed/_sharded_tensor/__init__.py',
'torch/distributed/_sharding_spec/__init__.py',
'torch/distributed/_tools/__init__.py',
'torch/distributed/_tools/memory_tracker.py',
'torch/distributed/algorithms/__init__.py',
'torch/distributed/algorithms/_checkpoint/__init__.py',
'torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py',
'torch/distributed/algorithms/_comm_hooks/__init__.py',
'torch/distributed/algorithms/_comm_hooks/default_hooks.py',
'torch/distributed/algorithms/_optimizer_overlap/__init__.py',
'torch/distributed/algorithms/_optimizer_overlap/optimizer_overlap.py',
'torch/distributed/algorithms/_quantization/__init__.py',
'torch/distributed/algorithms/_quantization/quantization.py',
'torch/distributed/algorithms/ddp_comm_hooks/__init__.py',
'torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py',
'torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py',
'torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py',
'torch/distributed/algorithms/ddp_comm_hooks/mixed_precision_hooks.py',
'torch/distributed/algorithms/ddp_comm_hooks/optimizer_overlap_hooks.py',
'torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py',
'torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py',
'torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py',
'torch/distributed/algorithms/join.py',
'torch/distributed/algorithms/model_averaging/__init__.py',
'torch/distributed/algorithms/model_averaging/averagers.py',
'torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py',
'torch/distributed/algorithms/model_averaging/utils.py',
'torch/distributed/argparse_util.py',
'torch/distributed/autograd/__init__.py',
'torch/distributed/benchmarks/benchmark_ddp_rpc.py',
'torch/distributed/c10d_logger.py',
'torch/distributed/collective_utils.py',
'torch/distributed/constants.py',
'torch/distributed/distributed_c10d.py',
'torch/distributed/elastic/__init__.py',
'torch/distributed/elastic/agent/__init__.py',
'torch/distributed/elastic/agent/server/__init__.py',
'torch/distributed/elastic/agent/server/api.py',
'torch/distributed/elastic/agent/server/local_elastic_agent.py',
'torch/distributed/elastic/events/__init__.py',
'torch/distributed/elastic/events/api.py',
'torch/distributed/elastic/events/handlers.py',
'torch/distributed/elastic/metrics/__init__.py',
'torch/distributed/elastic/metrics/api.py',
'torch/distributed/elastic/multiprocessing/__init__.py',
'torch/distributed/elastic/multiprocessing/api.py',
'torch/distributed/elastic/multiprocessing/errors/__init__.py',
'torch/distributed/elastic/multiprocessing/errors/error_handler.py',
'torch/distributed/elastic/multiprocessing/errors/handlers.py',
'torch/distributed/elastic/multiprocessing/redirects.py',
'torch/distributed/elastic/multiprocessing/tail_log.py',
'torch/distributed/elastic/rendezvous/__init__.py',
'torch/distributed/elastic/rendezvous/api.py',
'torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py',
'torch/distributed/elastic/rendezvous/dynamic_rendezvous.py',
'torch/distributed/elastic/rendezvous/etcd_rendezvous.py',
'torch/distributed/elastic/rendezvous/etcd_rendezvous_backend.py',
'torch/distributed/elastic/rendezvous/etcd_server.py',
'torch/distributed/elastic/rendezvous/etcd_store.py',
'torch/distributed/elastic/rendezvous/registry.py',
'torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py',
'torch/distributed/elastic/rendezvous/utils.py',
'torch/distributed/elastic/timer/__init__.py',
'torch/distributed/elastic/timer/api.py',
'torch/distributed/elastic/timer/file_based_local_timer.py',
'torch/distributed/elastic/timer/local_timer.py',
'torch/distributed/elastic/utils/__init__.py',
'torch/distributed/elastic/utils/api.py',
'torch/distributed/elastic/utils/data/__init__.py',
'torch/distributed/elastic/utils/data/cycling_iterator.py',
'torch/distributed/elastic/utils/data/elastic_distributed_sampler.py',
'torch/distributed/elastic/utils/distributed.py',
'torch/distributed/elastic/utils/log_level.py',
'torch/distributed/elastic/utils/logging.py',
'torch/distributed/elastic/utils/store.py',
'torch/distributed/examples/memory_tracker_example.py',
'torch/distributed/launch.py',
'torch/distributed/launcher/__init__.py',
'torch/distributed/launcher/api.py',
'torch/distributed/logging_handlers.py',
'torch/distributed/nn/__init__.py',
'torch/distributed/nn/api/__init__.py',
'torch/distributed/nn/api/remote_module.py',
'torch/distributed/nn/functional.py',
'torch/distributed/nn/jit/__init__.py',
'torch/distributed/nn/jit/instantiator.py',
'torch/distributed/nn/jit/templates/__init__.py',
'torch/distributed/nn/jit/templates/remote_module_template.py',
'torch/distributed/optim/__init__.py',
'torch/distributed/optim/apply_optimizer_in_backward.py',
'torch/distributed/optim/functional_adadelta.py',
'torch/distributed/optim/functional_adagrad.py',
'torch/distributed/optim/functional_adam.py',
'torch/distributed/optim/functional_adamax.py',
'torch/distributed/optim/functional_adamw.py',
'torch/distributed/optim/functional_rmsprop.py',
'torch/distributed/optim/functional_rprop.py',
'torch/distributed/optim/functional_sgd.py',
'torch/distributed/optim/named_optimizer.py',
'torch/distributed/optim/optimizer.py',
'torch/distributed/optim/post_localSGD_optimizer.py',
'torch/distributed/optim/utils.py',
'torch/distributed/optim/zero_redundancy_optimizer.py',
'torch/distributed/remote_device.py',
'torch/distributed/rendezvous.py',
'torch/distributed/rpc/__init__.py',
'torch/distributed/rpc/_testing/__init__.py',
'torch/distributed/rpc/_testing/faulty_agent_backend_registry.py',
'torch/distributed/rpc/_utils.py',
'torch/distributed/rpc/api.py',
'torch/distributed/rpc/backend_registry.py',
'torch/distributed/rpc/constants.py',
'torch/distributed/rpc/functions.py',
'torch/distributed/rpc/internal.py',
'torch/distributed/rpc/options.py',
'torch/distributed/rpc/rref_proxy.py',
'torch/distributed/rpc/server_process_global_profiler.py',
'torch/distributed/run.py',
'torch/distributed/tensor/__init__.py',
'torch/distributed/tensor/parallel/__init__.py',
'torch/distributed/tensor/parallel/_utils.py',
'torch/distributed/tensor/parallel/_view_with_dim_change.py',
'torch/distributed/tensor/parallel/api.py',
'torch/distributed/tensor/parallel/fsdp.py',
'torch/distributed/tensor/parallel/input_reshard.py',
'torch/distributed/tensor/parallel/multihead_attention_tp.py',
'torch/distributed/tensor/parallel/style.py',
'torch/fft/__init__.py',
'torch/func/__init__.py',
'torch/functional.py',
'torch/futures/__init__.py',
'torch/fx/__init__.py',
'torch/fx/_compatibility.py',
@ -1644,20 +1472,9 @@ exclude_patterns = [
'torch/fx/subgraph_rewriter.py',
'torch/fx/tensor_type.py',
'torch/fx/traceback.py',
'torch/hub.py',
'torch/library.py',
'torch/linalg/__init__.py',
'torch/monitor/__init__.py',
'torch/nested/__init__.py',
'torch/nn/__init__.py',
'torch/nn/_reduction.py',
'torch/nn/backends/__init__.py',
'torch/nn/backends/thnn.py',
'torch/nn/common_types.py',
'torch/nn/cpp.py',
'torch/nn/functional.py',
'torch/nn/grad.py',
'torch/nn/init.py',
'torch/nn/intrinsic/__init__.py',
'torch/nn/intrinsic/modules/__init__.py',
'torch/nn/intrinsic/modules/fused.py',
@ -1674,40 +1491,6 @@ exclude_patterns = [
'torch/nn/intrinsic/quantized/modules/bn_relu.py',
'torch/nn/intrinsic/quantized/modules/conv_relu.py',
'torch/nn/intrinsic/quantized/modules/linear_relu.py',
'torch/nn/modules/__init__.py',
'torch/nn/modules/_functions.py',
'torch/nn/modules/activation.py',
'torch/nn/modules/adaptive.py',
'torch/nn/modules/batchnorm.py',
'torch/nn/modules/channelshuffle.py',
'torch/nn/modules/container.py',
'torch/nn/modules/conv.py',
'torch/nn/modules/distance.py',
'torch/nn/modules/dropout.py',
'torch/nn/modules/flatten.py',
'torch/nn/modules/fold.py',
'torch/nn/modules/instancenorm.py',
'torch/nn/modules/lazy.py',
'torch/nn/modules/linear.py',
'torch/nn/modules/loss.py',
'torch/nn/modules/module.py',
'torch/nn/modules/normalization.py',
'torch/nn/modules/padding.py',
'torch/nn/modules/pixelshuffle.py',
'torch/nn/modules/pooling.py',
'torch/nn/modules/rnn.py',
'torch/nn/modules/sparse.py',
'torch/nn/modules/transformer.py',
'torch/nn/modules/upsampling.py',
'torch/nn/modules/utils.py',
'torch/nn/parallel/__init__.py',
'torch/nn/parallel/_functions.py',
'torch/nn/parallel/comm.py',
'torch/nn/parallel/data_parallel.py',
'torch/nn/parallel/parallel_apply.py',
'torch/nn/parallel/replicate.py',
'torch/nn/parallel/scatter_gather.py',
'torch/nn/parameter.py',
'torch/nn/qat/__init__.py',
'torch/nn/qat/dynamic/__init__.py',
'torch/nn/qat/dynamic/modules/__init__.py',
@ -1745,44 +1528,10 @@ exclude_patterns = [
'torch/nn/quantized/modules/normalization.py',
'torch/nn/quantized/modules/rnn.py',
'torch/nn/quantized/modules/utils.py',
'torch/nn/utils/__init__.py',
'torch/nn/utils/_deprecation_utils.py',
'torch/nn/utils/_expanded_weights/__init__.py',
'torch/nn/utils/_expanded_weights/conv_expanded_weights.py',
'torch/nn/utils/_expanded_weights/conv_utils.py',
'torch/nn/utils/_expanded_weights/embedding_expanded_weights.py',
'torch/nn/utils/_expanded_weights/expanded_weights_impl.py',
'torch/nn/utils/_expanded_weights/expanded_weights_utils.py',
'torch/nn/utils/_expanded_weights/group_norm_expanded_weights.py',
'torch/nn/utils/_expanded_weights/instance_norm_expanded_weights.py',
'torch/nn/utils/_expanded_weights/layer_norm_expanded_weights.py',
'torch/nn/utils/_expanded_weights/linear_expanded_weights.py',
'torch/nn/utils/_per_sample_grad.py',
'torch/nn/utils/clip_grad.py',
'torch/nn/utils/convert_parameters.py',
'torch/nn/utils/fusion.py',
'torch/nn/utils/init.py',
'torch/nn/utils/memory_format.py',
'torch/nn/utils/parametrizations.py',
'torch/nn/utils/parametrize.py',
'torch/nn/utils/prune.py',
'torch/nn/utils/rnn.py',
'torch/nn/utils/spectral_norm.py',
'torch/nn/utils/weight_norm.py',
'torch/overrides.py',
'torch/quasirandom.py',
'torch/random.py',
'torch/return_types.py',
'torch/serialization.py',
'torch/signal/__init__.py',
'torch/signal/windows/__init__.py',
'torch/signal/windows/windows.py',
'torch/sparse/__init__.py',
'torch/sparse/_semi_structured_conversions.py',
'torch/sparse/_triton_ops.py',
'torch/sparse/semi_structured.py',
'torch/special/__init__.py',
'torch/storage.py',
'torch/testing/_internal/__init__.py',
'torch/testing/_internal/autocast_test_lists.py',
'torch/testing/_internal/autograd_function_db.py',
@ -1790,9 +1539,7 @@ exclude_patterns = [
'torch/testing/_internal/codegen/__init__.py',
'torch/testing/_internal/codegen/random_topo_test.py',
'torch/testing/_internal/common_cuda.py',
'torch/testing/_internal/common_device_type.py',
'torch/testing/_internal/common_distributed.py',
'torch/testing/_internal/common_dtype.py',
'torch/testing/_internal/common_jit.py',
'torch/testing/_internal/common_methods_invocations.py',
'torch/testing/_internal/common_modules.py',
@ -1857,7 +1604,6 @@ exclude_patterns = [
'torch/testing/_internal/test_module/__init__.py',
'torch/testing/_internal/test_module/future_div.py',
'torch/testing/_internal/test_module/no_future_div.py',
'torch/utils/__init__.py',
'torch/utils/_contextlib.py',
'torch/utils/_cpp_extension_versioner.py',
'torch/utils/_crash_handler.py',
@ -1908,53 +1654,6 @@ exclude_patterns = [
'torch/utils/collect_env.py',
'torch/utils/cpp_backtrace.py',
'torch/utils/cpp_extension.py',
'torch/utils/data/__init__.py',
'torch/utils/data/_utils/__init__.py',
'torch/utils/data/_utils/collate.py',
'torch/utils/data/_utils/fetch.py',
'torch/utils/data/_utils/pin_memory.py',
'torch/utils/data/_utils/serialization.py',
'torch/utils/data/_utils/signal_handling.py',
'torch/utils/data/_utils/worker.py',
'torch/utils/data/backward_compatibility.py',
'torch/utils/data/dataloader.py',
'torch/utils/data/datapipes/__init__.py',
'torch/utils/data/datapipes/_decorator.py',
'torch/utils/data/datapipes/_hook_iterator.py',
'torch/utils/data/datapipes/_typing.py',
'torch/utils/data/datapipes/dataframe/__init__.py',
'torch/utils/data/datapipes/dataframe/dataframe_wrapper.py',
'torch/utils/data/datapipes/dataframe/dataframes.py',
'torch/utils/data/datapipes/dataframe/datapipes.py',
'torch/utils/data/datapipes/dataframe/structures.py',
'torch/utils/data/datapipes/datapipe.py',
'torch/utils/data/datapipes/gen_pyi.py',
'torch/utils/data/datapipes/iter/__init__.py',
'torch/utils/data/datapipes/iter/callable.py',
'torch/utils/data/datapipes/iter/combinatorics.py',
'torch/utils/data/datapipes/iter/combining.py',
'torch/utils/data/datapipes/iter/filelister.py',
'torch/utils/data/datapipes/iter/fileopener.py',
'torch/utils/data/datapipes/iter/grouping.py',
'torch/utils/data/datapipes/iter/routeddecoder.py',
'torch/utils/data/datapipes/iter/selecting.py',
'torch/utils/data/datapipes/iter/sharding.py',
'torch/utils/data/datapipes/iter/streamreader.py',
'torch/utils/data/datapipes/iter/utils.py',
'torch/utils/data/datapipes/map/__init__.py',
'torch/utils/data/datapipes/map/callable.py',
'torch/utils/data/datapipes/map/combinatorics.py',
'torch/utils/data/datapipes/map/combining.py',
'torch/utils/data/datapipes/map/grouping.py',
'torch/utils/data/datapipes/map/utils.py',
'torch/utils/data/datapipes/utils/__init__.py',
'torch/utils/data/datapipes/utils/common.py',
'torch/utils/data/datapipes/utils/decoder.py',
'torch/utils/data/datapipes/utils/snapshot.py',
'torch/utils/data/distributed.py',
'torch/utils/data/graph.py',
'torch/utils/data/graph_settings.py',
'torch/utils/data/sampler.py',
'torch/utils/dlpack.py',
'torch/utils/file_baton.py',
'torch/utils/flop_counter.py',
@ -1994,8 +1693,9 @@ init_command = [
'--dry-run={{DRYRUN}}',
'--no-black-binary',
'black==23.12.1',
'ufmt==2.1.0',
'usort==1.0.6',
'ufmt==2.7.0',
'usort==1.0.8.post1',
'isort==5.13.2',
]
is_formatter = true
@ -2079,7 +1779,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'ruff==0.4.8',
'ruff==0.5.0',
]
is_formatter = true

View File

@ -461,7 +461,6 @@ filegroup(
filegroup(
name = "caffe2_perfkernels_srcs",
srcs = [
"caffe2/perfkernels/embedding_lookup.cc",
"caffe2/perfkernels/embedding_lookup_idx.cc",
],
)
@ -499,7 +498,6 @@ cc_library(
hdrs = [
"caffe2/core/common.h",
"caffe2/perfkernels/common.h",
"caffe2/perfkernels/embedding_lookup.h",
"caffe2/perfkernels/embedding_lookup_idx.h",
"caffe2/utils/fixed_divisor.h",
] + glob([
@ -746,6 +744,7 @@ cc_library(
"torch/csrc/cuda/python_nccl.cpp",
"torch/csrc/cuda/nccl.cpp",
"torch/csrc/distributed/c10d/intra_node_comm.cu",
"torch/csrc/distributed/c10d/CUDASymmetricMemory.cu",
"torch/csrc/distributed/c10d/Utils.cu",
"torch/csrc/distributed/c10d/quantization/quantization_gpu.cu",
],
@ -763,6 +762,7 @@ cc_library(
":torch_headers",
"@kineto",
"@cpp-httplib",
"@nlohmann",
] + if_cuda([
"@cuda//:nvToolsExt",
"@cutlass",

View File

@ -208,7 +208,6 @@ endif()
include(CMakeDependentOption)
option(ATEN_NO_TEST "Do not build ATen test binaries" OFF)
option(BUILD_BINARY "Build C++ binaries" OFF)
option(BUILD_DOCS "Build Caffe2 documentation" OFF)
option(BUILD_CUSTOM_PROTOBUF
"Build and use Caffe2's own protobuf under third_party" ON)
option(BUILD_PYTHON "Build Python binaries" ON)
@ -750,7 +749,6 @@ if(NOT TORCH_BUILD_VERSION)
CACHE STRING "Torch build version" FORCE)
endif()
caffe2_parse_version_str(TORCH ${TORCH_BUILD_VERSION})
caffe2_parse_version_str(CAFFE2 ${TORCH_BUILD_VERSION})
set(TORCH_SOVERSION "${TORCH_VERSION_MAJOR}.${TORCH_VERSION_MINOR}")
# ---[ CMake scripts + modules
@ -1223,45 +1221,6 @@ endif()
add_subdirectory(c10)
add_subdirectory(caffe2)
# --[ Documentation
if(BUILD_DOCS)
# check if Doxygen is installed
find_package(Doxygen)
if(DOXYGEN_FOUND)
message("Generating documentation")
set(DOXYGEN_C_IN ${CMAKE_CURRENT_SOURCE_DIR}/docs/caffe2/.Doxyfile-c)
set(DOXYGEN_C_OUT ${CMAKE_CURRENT_SOURCE_DIR}/docs/caffe2/Doxyfile-c)
set(DOXYGEN_P_IN ${CMAKE_CURRENT_SOURCE_DIR}/docs/caffe2/.Doxyfile-python)
set(DOXYGEN_P_OUT ${CMAKE_CURRENT_SOURCE_DIR}/docs/caffe2/Doxyfile-python)
if(EXISTS ${CMAKE_CURRENT_BINARY_DIR}/docs)
file(REMOVE_RECURSE ${CMAKE_CURRENT_BINARY_DIR}/docs)
endif()
file(MAKE_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/docs)
configure_file(${DOXYGEN_C_IN} ${DOXYGEN_C_OUT} @ONLY)
configure_file(${DOXYGEN_P_IN} ${DOXYGEN_P_OUT} @ONLY)
add_custom_target(
doc_doxygen_c ALL
COMMAND ${DOXYGEN_EXECUTABLE} ${DOXYGEN_C_OUT}
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMENT "Generating C++ API documentation with Doxygen"
VERBATIM)
add_custom_target(
doc_doxygen_python ALL
COMMAND ${DOXYGEN_EXECUTABLE} ${DOXYGEN_P_OUT}
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMENT "Generating Python API documentation with Doxygen"
VERBATIM)
else()
message(
FATAL_ERROR "Doxygen needs to be installed to generate the documentation")
endif()
endif()
# ---[ CMake related files Uninistall option.
if(NOT TARGET caffe2_uninstall)
configure_file(

View File

@ -43,12 +43,12 @@ nn/qat/ @jerryzh168
/torch/csrc/distributed/rpc/tensorpipe_agent.h @jiayisuse @osalpekar @lw
# ONNX Export
/torch/_dynamo/backends/onnxrt.py @bowenbao @thiagocrepaldi @wschin
/torch/csrc/jit/passes/onnx.h @bowenbao @thiagocrepaldi
/torch/csrc/jit/passes/onnx.cpp @bowenbao @thiagocrepaldi
/torch/csrc/jit/passes/onnx/ @bowenbao @thiagocrepaldi
/torch/onnx/ @bowenbao @thiagocrepaldi @wschin
/test/onnx/ @bowenbao @thiagocrepaldi @wschin
/torch/_dynamo/backends/onnxrt.py @wschin @xadupre
/torch/csrc/jit/passes/onnx.h @titaiwangms @shubhambhokare1 @xadupre
/torch/csrc/jit/passes/onnx.cpp @titaiwangms @shubhambhokare1 @xadupre
/torch/csrc/jit/passes/onnx/ @titaiwangms @shubhambhokare1 @xadupre
/torch/onnx/ @titaiwangms @shubhambhokare1 @justinchuby @wschin @xadupre
/test/onnx/ @titaiwangms @shubhambhokare1 @justinchuby @wschin @xadupre
# CI
/.ci @pytorch/pytorch-dev-infra
@ -57,6 +57,7 @@ nn/qat/ @jerryzh168
/.ci/docker/ @jeffdaily
/.ci/docker/ci_commit_pins/triton.txt @desertfire @Chillee @eellison @shunting314 @bertmaher @jeffdaily @jataylo @jithunnair-amd @pruthvistony
/.ci/docker/ci_commit_pins/triton-rocm.txt @jeffdaily @jataylo @jithunnair-amd @pruthvistony
/.ci/docker/ci_commit_pins/triton-xpu.txt @EikanWang @gujinghui
# Github Actions
# This list is for people wanting to be notified every time there's a change
@ -107,10 +108,10 @@ aten/src/ATen/detail/MTIAHooksInterface.h @egienvalue
torch/csrc/mtia/ @egienvalue
# Profiler
torch/csrc/autograd/profiler* @aaronenyeshi
torch/autograd/profiler* @aaronenyeshi
torch/csrc/profiler/ @aaronenyeshi
torch/profiler/ @aaronenyeshi
torch/csrc/autograd/profiler* @aaronenyeshi @sraikund16
torch/autograd/profiler* @aaronenyeshi @sraikund16
torch/csrc/profiler/ @aaronenyeshi @sraikund16
torch/profiler/ @aaronenyeshi @sraikund16
# AOTDispatch tests
test/functorch/test_aotdispatch.py @ezyang @Chillee
@ -132,6 +133,15 @@ caffe2/operators/hip @jeffdaily @jithunnair-amd
caffe2/operators/rnn/hip @jeffdaily @jithunnair-amd
caffe2/utils/hip @jeffdaily @jithunnair-amd
# XPU-specific files
/aten/src/ATen/xpu/ @EikanWang @gujinghui
/c10/xpu/ @EikanWang @gujinghui
/torch/csrc/xpu/ @EikanWang @gujinghui
/torch/xpu/ @EikanWang @gujinghui
/test/xpu/ @EikanWang @gujinghui
/test/test_xpu.py @EikanWang @gujinghui
/third_party/xpu.txt @EikanWang @gujinghui
# torch.export
/torch/export/ @avikchaudhuri @gmagogsfm @tugsbayasgalan @zhxchen17
/torch/_export/ @avikchaudhuri @gmagogsfm @tugsbayasgalan @zhxchen17

View File

@ -77,6 +77,11 @@ RUN case ${TARGETPLATFORM} in \
esac && \
/opt/conda/bin/conda clean -ya
RUN /opt/conda/bin/pip install torchelastic
RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); \
echo "Is torch compiled with cuda: ${IS_CUDA}"; \
if test "${IS_CUDA}" != "True" -a ! -z "${CUDA_VERSION}"; then \
exit 1; \
fi
FROM ${BASE_IMAGE} as official
ARG PYTORCH_VERSION

View File

@ -51,6 +51,7 @@ Following is the Release Compatibility Matrix for PyTorch releases:
| PyTorch version | Python | Stable CUDA | Experimental CUDA | Stable ROCm |
| --- | --- | --- | --- | --- |
| 2.4 | >=3.8, <=3.12 | CUDA 11.8, CUDA 12.1, CUDNN 9.1.0.70 | CUDA 12.4, CUDNN 9.1.0.70 | ROCm 6.1 |
| 2.3 | >=3.8, <=3.11, (3.12 experimental) | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 6.0 |
| 2.2 | >=3.8, <=3.11, (3.12 experimental) | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 5.7 |
| 2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 5.6 |
@ -60,15 +61,19 @@ Following is the Release Compatibility Matrix for PyTorch releases:
## Release Cadence
Following is the release cadence for year 2023/2024. All dates below are tentative, for latest updates on the release scheduled please follow [dev discuss](https://dev-discuss.pytorch.org/c/release-announcements/27).
Following is the release cadence for year 2023/2024. All dates below are tentative, for latest updates on the release scheduled please follow [dev discuss](https://dev-discuss.pytorch.org/c/release-announcements/27). Please note: Patch Releases are optional.
| Minor Version | Release branch cut | Release date | First patch release date | Second patch release date|
| --- | --- | --- | --- | --- |
| 2.1 | Aug 2023 | Oct 2023 | Nov 2023 | Dec 2023 |
| 2.2 | Dec 2023 | Jan 2024 | Feb 2024 | Mar 2024 |
| 2.3 | Mar 2024 | Apr 2024 | Jun 2024 | Not planned |
| 2.4 | Jun 2024 | Jul 2024 | Aug 2024 | Sep 2024 |
| 2.5 | Aug 2024 | Oct 2024 | Nov 2024 | Dec 2024 |
| 2.4 | Jun 2024 | Jul 2024 | (Sept 2024) | Not planned |
| 2.5 | Aug 2024 | Oct 2024 | (Nov 2024) | (Dec 2024) |
| 2.6 | Dec 2024 | Jan 2025 | (Feb 2025) | (Mar 2025) |
| 2.7 | Mar 2025 | Apr 2025 | (May 2025) | (Jun 2025) |
| 2.8 | Jun 2025 | Jul 2025 | (Aug 2025) | (Sept 2025) |
| 2.9 | Aug 2025 | Oct 2025 | (Nov 2025) | (Dec 2025) |
## General Overview
@ -290,7 +295,7 @@ After the final RC is created. The following tasks should be performed :
* Create validation issue for the release, see for example [Validations for 2.1.2 release](https://github.com/pytorch/pytorch/issues/114904) and perform required validations.
* Run performance tests in [benchmark repository](https://github.com/pytorch/benchmark). Make sure there are no prerformance regressions.
* Run performance tests in [benchmark repository](https://github.com/pytorch/benchmark). Make sure there are no performance regressions.
* Prepare and stage PyPI binaries for promotion. This is done with this script:
[`pytorch/builder:release/pypi/promote_pypi_to_staging.sh`](https://github.com/pytorch/builder/blob/main/release/pypi/promote_pypi_to_staging.sh)
@ -429,12 +434,12 @@ need to support these particular versions of software.
## Operating Systems
Supported OS flavors are summarized in the table below:
| Operating System family | Architectrue | Notes |
| Operating System family | Architecture | Notes |
| --- | --- | --- |
| Linux | aarch64, x86_64 | Wheels are manylinux2014 compatible, i.e. they should be runnable on any Linux system with glibc-2.17 or above. |
| MacOS | arm64 | Builds should be compatible with MacOS 11 (Big Sur) or newer, but are actively tested against MacOS 14 (Sonoma). |
| MacOS | x86_64 | Requires MacOS Catalina or above, not supported after 2.2, see https://github.com/pytorch/pytorch/issues/114602 |
| Windows | x86_64 | Buils are compatible with Windows-10 or newer. |
| Windows | x86_64 | Builds are compatible with Windows-10 or newer. |
# Submitting Tutorials

View File

@ -6,7 +6,7 @@
- [Untrusted inputs](#untrusted-inputs)
- [Data privacy](#data-privacy)
- [Using distributed features](#using-distributed-features)
- [**CI/CD security principles**](#cicd-security-principles)
## Reporting Security Issues
Beware that none of the topics under [Using Pytorch Securely](#using-pytorch-securely) are considered vulnerabilities of Pytorch.
@ -61,3 +61,27 @@ If applicable, prepare your model against bad inputs and prompt injections. Some
PyTorch can be used for distributed computing, and as such there is a `torch.distributed` package. PyTorch Distributed features are intended for internal communication only. They are not built for use in untrusted environments or networks.
For performance reasons, none of the PyTorch Distributed primitives (including c10d, RPC, and TCPStore) include any authorization protocol and will send messages unencrypted. They accept connections from anywhere, and execute the workload sent without performing any checks. Therefore, if you run a PyTorch Distributed program on your network, anybody with access to the network can execute arbitrary code with the privileges of the user running PyTorch.
## CI/CD security principles
_Audience_: Contributors and reviewers, especially if modifying the workflow files/build system.
PyTorch CI/CD security philosophy is based on finding a balance between open and transparent CI pipelines while keeping the environment efficient and safe.
PyTorch testing requirements are complex, and a large part of the code base can only be tested on specialized powerful hardware, such as GPU, making it a lucrative target for resource misuse. To prevent this, we require workflow run approval for PRs from non-member contributors. To keep the volume of those approvals relatively low, we easily extend write permissions to the repository to regular contributors.
More widespread write access to the repo presents challenges when it comes to reviewing changes, merging code into trunk, and creating releases. [Protected branches](https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branches) are used to restrict the ability to merge to the trunk/release branches only to the repository administrators and merge bot. The merge bot is responsible for mechanistically merging the change and validating reviews against the path-based rules defined in [merge_rules.yml](https://github.com/pytorch/pytorch/blob/main/.github/merge_rules.yaml). Once a PR has been reviewed by person(s) mentioned in these rules, leaving a `@pytorchbot merge` comment on the PR will initiate the merge process. To protect merge bot credentials from leaking, merge actions must be executed only on ephemeral runners (see definition below) using a specialized deployment environment.
To speed up the CI system, build steps of the workflow rely on the distributed caching mechanism backed by [sccache](https://github.com/mozilla/sccache), making them susceptible to cache corruption compromises. For that reason binary artifacts generated during CI should not be executed in an environment that contains an access to any sensitive/non-public information and should not be published for use by general audience. One should not have any expectation about the lifetime of those artifacts, although in practice they likely remain accessible for about two weeks after the PR has been closed.
To speed up CI system setup, PyTorch relies heavily on Docker to pre-build and pre-install the dependencies. To prevent a potentially malicious PR from altering ones that were published in the past, ECR has been configured to use immutable tags.
To improve runner availability and more efficient resource utilization, some of the CI runners are non-ephemeral, i.e., workflow steps from completely unrelated PRs could be scheduled sequentially on the same runner, making them susceptible to reverse shell attacks. For that reason, PyTorch does not rely on the repository secrets mechanism, as these can easily be compromised in such attacks.
### Release pipelines security
To ensure safe binary releases, PyTorch release pipelines are built on the following principles:
- All binary builds/upload jobs must be run on ephemeral runners, i.e., on a machine that is allocated from the cloud to do the build and released back to the cloud after the build is finished. This protects those builds from interference from external actors, who potentially can get reverse shell access to a non-ephemeral runner and wait there for a binary build.
- All binary builds are cold-start builds, i.e., distributed caching/incremental builds are not permitted. This renders builds much slower than incremental CI builds but isolates them from potential compromises of the intermediate artifacts caching systems.
- All upload jobs are executed in a [deployment environments](https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment) that are restricted to protected branches
- Security credentials needed to upload binaries to PyPI/conda or stable indexes `download.pytorch.org/whl` are never uploaded to repo secrets storage/environment. This requires an extra manual step to publish the release but ensures that access to those would not be compromised by deliberate/accidental leaks of secrets stored in the cloud.
- No binary artifacts should be published to GitHub releases pages, as these are overwritable by anyone with write permission to the repo.

View File

@ -174,6 +174,12 @@ new_local_repository(
path = "third_party/cpp-httplib",
)
new_local_repository(
name = "nlohmann",
build_file = "//third_party:nlohmann.BUILD",
path = "third_party/nlohmann",
)
new_local_repository(
name = "tensorpipe",
build_file = "//third_party:tensorpipe.BUILD",

View File

@ -119,7 +119,7 @@ class PytorchJni : public facebook::jni::HybridClass<PytorchJni> {
}
deviceType_ = deviceJniCodeToDeviceType(device);
module_ = torch::jit::load(
std::move(modelPath->toStdString()), c10::nullopt, extra_files);
std::move(modelPath->toStdString()), std::nullopt, extra_files);
if (has_extra) {
static auto putMethod =
facebook::jni::JMap<facebook::jni::JString, facebook::jni::JString>::

View File

@ -84,9 +84,9 @@ class PytorchJni : public facebook::jni::HybridClass<PytorchJni> {
}
deviceType_ = deviceJniCodeToDeviceType(device);
module_ = torch::jit::_load_for_mobile(
std::move(modelPath->toStdString()), c10::nullopt, extra_files);
std::move(modelPath->toStdString()), std::nullopt, extra_files);
torch::jit::_load_extra_only_for_mobile(
std::move(modelPath->toStdString()), c10::nullopt, extra_files);
std::move(modelPath->toStdString()), std::nullopt, extra_files);
if (has_extra) {
static auto putMethod =
facebook::jni::JMap<facebook::jni::JString, facebook::jni::JString>::

View File

@ -53,11 +53,6 @@ if(NOT BUILD_LITE_INTERPRETER)
file(GLOB_RECURSE ATen_CORE_TEST_SRCS "core/*_test.cpp")
endif()
EXCLUDE(ATen_CORE_SRCS "${ATen_CORE_SRCS}" ${ATen_CORE_TEST_SRCS})
# Exclude TensorImpl_test.cpp if compiling without Caffe2
if(NOT BUILD_LITE_INTERPRETER)
file(GLOB_RECURSE ATen_CORE_EXCLUDED_TEST_SRCS "core/TensorImpl_test.cpp")
EXCLUDE(ATen_CORE_TEST_SRCS "${ATen_CORE_TEST_SRCS}" ${ATen_CORE_EXCLUDED_TEST_SRCS})
endif()
file(GLOB base_h "*.h" "detail/*.h" "cpu/*.h" "cpu/vec/vec512/*.h" "cpu/vec/vec256/*.h" "cpu/vec/vec256/vsx/*.h" "cpu/vec/vec256/zarch/*.h" "cpu/vec/*.h" "quantized/*.h" "functorch/*.h")
file(GLOB base_cpp "*.cpp" "detail/*.cpp" "cpu/*.cpp" "functorch/*.cpp")
@ -473,6 +468,7 @@ endif()
if(USE_CUDA AND NOT USE_ROCM)
list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/cutlass/include)
list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/cutlass/tools/util/include)
if($ENV{ATEN_STATIC_CUDA})
list(APPEND ATen_CUDA_DEPENDENCY_LIBS
${CUDA_LIBRARIES}

View File

@ -222,7 +222,7 @@ c10::intrusive_ptr<c10::TensorImpl> CPUGeneratorImpl::get_state() const {
static const size_t size = sizeof(CPUGeneratorImplState);
static_assert(std::is_standard_layout_v<CPUGeneratorImplState>, "CPUGeneratorImplState is not a PODType");
auto state_tensor = at::detail::empty_cpu({(int64_t)size}, ScalarType::Byte, c10::nullopt, c10::nullopt, c10::nullopt, c10::nullopt);
auto state_tensor = at::detail::empty_cpu({(int64_t)size}, ScalarType::Byte, std::nullopt, std::nullopt, std::nullopt, std::nullopt);
auto rng_state = state_tensor.data_ptr();
// accumulate generator data to be copied into byte tensor

View File

@ -3,7 +3,7 @@
#include <ATen/core/Generator.h>
#include <ATen/core/MT19937RNGEngine.h>
#include <c10/core/GeneratorImpl.h>
#include <c10/util/Optional.h>
#include <optional>
namespace at {

View File

@ -56,6 +56,14 @@ void Context::setDeterministicCuDNN(bool b) {
deterministic_cudnn = b;
}
bool Context::deterministicMkldnn() const {
return deterministic_mkldnn;
}
void Context::setDeterministicMkldnn(bool b) {
deterministic_mkldnn = b;
}
bool Context::deterministicAlgorithms() const {
return _deterministic_algorithms;
}
@ -145,6 +153,13 @@ void Context::setSDPUseCuDNN(bool e) {
enabled_cudnnSDP = e;
}
void Context::setSDPUseOverrideable(bool e) {
enabled_overrideable = e;
}
bool Context::userEnabledOverrideableSDP() const {
return enabled_overrideable;
}
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
static const char cublas_config_var_name[] = "CUBLAS_WORKSPACE_CONFIG";
@ -263,7 +278,24 @@ void Context::setLinalgPreferredBackend(at::LinalgBackend b) {
}
}
at::BlasBackend Context::blasPreferredBackend() const {
at::BlasBackend Context::blasPreferredBackend() {
#ifdef USE_ROCM
if (blas_preferred_backend == at::BlasBackend::Cublaslt) {
static const bool hipblaslt_unsupported = []() {
static const std::vector<std::string> archs = {"gfx90a", "gfx940", "gfx941", "gfx942"};
for (auto index: c10::irange(getNumGPUs())) {
if (!detail::getCUDAHooks().isGPUArch(index, archs)) {
TORCH_WARN_ONCE(
"Attempting to use hipBLASLt on an unsupported architecture! "
"Overriding blas backend to hipblas");
return true;
}
}
return false;
}();
if (hipblaslt_unsupported) blas_preferred_backend = at::BlasBackend::Cublas;
}
#endif
return blas_preferred_backend;
}

View File

@ -59,12 +59,14 @@ class TORCH_API Context {
}
}
const AcceleratorHooksInterface& getAcceleratorHooksInterface(
std::optional<c10::DeviceType> opt_device_type = c10::nullopt) {
std::optional<c10::DeviceType> opt_device_type = std::nullopt) {
c10::DeviceType device_type = opt_device_type.has_value()
? opt_device_type.value()
: at::getAccelerator(true).value();
if (device_type == at::kCUDA) {
return at::detail::getCUDAHooks();
} else if (device_type == at::kXPU) {
return at::detail::getXPUHooks();
} else if (device_type == at::kMPS) {
return at::detail::getMPSHooks();
} else if (device_type == at::kPrivateUse1) {
@ -188,6 +190,8 @@ class TORCH_API Context {
void setBenchmarkLimitCuDNN(int);
bool deterministicCuDNN() const;
void setDeterministicCuDNN(bool);
bool deterministicMkldnn() const;
void setDeterministicMkldnn(bool);
bool userEnabledNNPACK() const;
void setUserEnabledNNPACK(bool e);
@ -214,10 +218,13 @@ class TORCH_API Context {
void setSDPUseCuDNN(bool);
bool userEnabledCuDNNSDP() const;
void setSDPUseOverrideable(bool);
bool userEnabledOverrideableSDP() const;
at::LinalgBackend linalgPreferredBackend() const;
void setLinalgPreferredBackend(at::LinalgBackend);
at::BlasBackend blasPreferredBackend() const;
at::BlasBackend blasPreferredBackend();
void setBlasPreferredBackend(at::BlasBackend);
// Note [Enabling Deterministic Operations]
@ -358,6 +365,7 @@ class TORCH_API Context {
c10::once_flag thp_init;
bool enabled_cudnn = true;
bool deterministic_cudnn = false;
bool deterministic_mkldnn = false;
bool _deterministic_algorithms = false;
bool _deterministic_algorithms_warn_only = false;
bool _deterministic_fill_uninitialized_memory = true;
@ -365,6 +373,7 @@ class TORCH_API Context {
bool enabled_mem_efficientSDP = true;
bool enabled_mathSDP = true;
bool enabled_cudnnSDP = true;
bool enabled_overrideable = true;
#ifdef USE_ROCM
bool benchmark_cudnn = true;
#else
@ -398,7 +407,7 @@ class TORCH_API Context {
bool release_original_weights = false;
#endif
bool display_vmap_fallback_warnings_ = false;
std::optional<at::QEngine> quantized_engine = c10::nullopt;
std::optional<at::QEngine> quantized_engine = std::nullopt;
bool enable_sparse_tensor_invariant_checks = false;
bool allow_fp16_reduction_cpu = false;

View File

@ -115,6 +115,9 @@ static DLDevice getDLDevice(const Tensor& tensor, c10::DeviceIndex device_id) {
ctx.device_id =
at::detail::getXPUHooks().getGlobalIdxFromDevice(tensor.device());
break;
case DeviceType::MAIA:
ctx.device_type = DLDeviceType::kDLMAIA;
break;
default:
TORCH_CHECK(false, "Cannot pack tensors on " + tensor.device().str());
}
@ -141,6 +144,8 @@ static Device getATenDevice(const DLDevice& ctx, void* data) {
#endif
case DLDeviceType::kDLOneAPI:
return at::detail::getXPUHooks().getDeviceFromPtr(data);
case DLDeviceType::kDLMAIA:
return at::Device(DeviceType::MAIA, ctx.device_id);
default:
TORCH_CHECK(
false, "Unsupported device_type: ", std::to_string(ctx.device_type));

View File

@ -1,39 +1,37 @@
#include <ATen/DeviceAccelerator.h>
#include <ATen/Context.h>
#include <ATen/DeviceAccelerator.h>
namespace at {
C10_API std::optional<DeviceType> getAccelerator(bool checked) {
#define CHECK_NO_CUDA \
TORCH_CHECK(!at::hasCUDA(), "Cannot have both CUDA and PrivateUse1");
#define DETECT_AND_ASSIGN_ACCELERATOR(device_name) \
if (at::has##device_name()) { \
device_type = k##device_name; \
TORCH_CHECK( \
!is_accelerator_detected, \
"Cannot have ", \
device_type.value(), \
" with other accelerators."); \
is_accelerator_detected = true; \
}
#define CHECK_NO_PU1 \
TORCH_CHECK(!is_privateuse1_backend_registered(), "Cannot have both CUDA and PrivateUse1");
if (is_privateuse1_backend_registered()) {
// We explicitly allow PrivateUse1 and another device at the same time as we
// use this for testing. Whenever a PrivateUse1 device is registered, use it
// first.
return kPrivateUse1;
}
std::optional<DeviceType> device_type = std::nullopt;
bool is_accelerator_detected = false;
DETECT_AND_ASSIGN_ACCELERATOR(CUDA)
DETECT_AND_ASSIGN_ACCELERATOR(MTIA)
DETECT_AND_ASSIGN_ACCELERATOR(XPU)
if (checked) {
TORCH_CHECK(
device_type, "Cannot access accelerator device when none is available.")
}
return device_type;
#define CHECK_NO_MTIA \
TORCH_CHECK(!at::hasMTIA(), "Cannot have MTIA with other devices");
if (is_privateuse1_backend_registered()) {
// We explicitly allow PrivateUse1 and another device at the same time
// as we use this for testing.
// Whenever a PrivateUse1 device is registered, use it first.
return kPrivateUse1;
} else if (at::hasCUDA()) {
CHECK_NO_PU1
CHECK_NO_MTIA
return kCUDA;
} else if (at::hasMTIA()) {
CHECK_NO_CUDA
CHECK_NO_PU1
return kMTIA;
} else {
TORCH_CHECK(!checked, "Cannot access accelerator device when none is available.")
return std::nullopt;
}
#undef CHECK_NO_CUDA
#undef CHECK_NO_PU1
#undef DETECT_AND_ASSIGN_ACCELERATOR
}
} // namespace at

View File

@ -13,9 +13,9 @@
// - It provides a set of common APIs as defined by AcceleratorHooksInterface
//
// As of today, accelerator devices are (in no particular order):
// CUDA, MTIA, PrivateUse1
// CUDA, MTIA, XPU, PrivateUse1
// We want to add once all the proper APIs are supported and tested:
// HIP, MPS, XPU
// HIP, MPS
namespace at {

View File

@ -17,14 +17,14 @@ namespace at {
/// Return the Device of a Tensor, if the Tensor is defined.
inline std::optional<Device> device_of(const Tensor& t) {
if (t.defined()) {
return c10::make_optional(t.device());
return std::make_optional(t.device());
} else {
return c10::nullopt;
return std::nullopt;
}
}
inline std::optional<Device> device_of(const std::optional<Tensor>& t) {
return t.has_value() ? device_of(t.value()) : c10::nullopt;
return t.has_value() ? device_of(t.value()) : std::nullopt;
}
/// Return the Device of a TensorList, if the list is non-empty and
@ -34,7 +34,7 @@ inline std::optional<Device> device_of(ITensorListRef t) {
if (!t.empty()) {
return device_of(t.front());
} else {
return c10::nullopt;
return std::nullopt;
}
}

Some files were not shown because too many files have changed in this diff Show More