pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-14 06:07:55 +08:00

Author	SHA1	Message	Date
Ting Lu	0911360736	fix cuda 13.0 cuda-arch-list to be 8.9 for L4 gpu test	2025-11-13 05:35:51 +00:00
Ting Lu	38de8d0d33	specify cuda_arch_list for 13.0 tests	2025-11-13 05:35:51 +00:00
Ting Lu	a9fe64bee2	also add the cusparse cmake flag	2025-11-13 05:35:51 +00:00
Ting Lu	84436662a3	use 13.0.2	2025-11-13 05:35:51 +00:00
Ting Lu	794e09311c	add win build too	2025-11-13 05:35:51 +00:00
Ting Lu	641d0bae63	Add eager tests cuda 13.0	2025-11-13 05:35:51 +00:00
jainapurva	f9851af59b	Add Attention ops to CI (#165915 ) This pull request introduces a new attention operator microbenchmark workflow to the CI system, enabling automated benchmarking and reporting for attention-related operations. The main changes include adding a new GitHub Actions workflow, to add attention benchmarks to the existing Pytorch operator microbenchmark [dashboard](https://hud.pytorch.org/benchmark/v3/dashboard/pytorch_operator_microbenchmark?renderGroupId=main&time.start=2025-10-27T00%3A00%3A00.000Z&time.end=2025-10-29T01%3A00%3A00.000Z&filters.device=cuda&filters.arch=NVIDIA+A100-SXM4-40GB&filters.deviceName=cuda%7C%7CNVIDIA+A100-SXM4-40GB&filters.operatorName=&lcommit.commit=665df0bc7288996d638fcc3da750f8cb2addd6d0&lcommit.workflow_id=18888994873&lcommit.date=2025-10-29T00%3A00%3A00Z&lcommit.branch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915&rcommit.commit=665df0bc7288996d638fcc3da750f8cb2addd6d0&rcommit.workflow_id=18888994873&rcommit.date=2025-10-29T00%3A00%3A00Z&rcommit.branch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915&lbranch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915&rbranch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165915 Approved by: https://github.com/jbschlosser	2025-11-13 05:30:04 +00:00
Rob Timpe	eeebf9f664	[dynamo] [3.14] Update broken numpy test (#167681 ) This is related to upgrading numpy versions, not 3.14 specifically. See https://github.com/numpy/numpy/pull/27148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/167681 Approved by: https://github.com/williamwen42 ghstack dependencies: #167619	2025-11-13 04:27:55 +00:00
Rob Timpe	d9a50bf9a8	[dynamo] [3.14] Support np._CopyMode (#167619 ) Upgrading scipy to 1.16 introduced errors related to the `copy` parameter of `np.array`. Add special handling for `np._CopyMode.IF_NEEDED`, which is not handled correctly, but matches the existing behavior when `copy=None`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167619 Approved by: https://github.com/williamwen42	2025-11-13 04:27:55 +00:00
Nicolas Macchioni	2984331c87	[inductor][NFC][2/X] extract do_autotuning/autotune/benchmark from AlgorithmSelectorCache.__call__ (#167489 ) Summary: see https://github.com/pytorch/pytorch/pull/167487 for context Test Plan: CI Differential Revision: D86714833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/167489 Approved by: https://github.com/aorenste	2025-11-13 03:29:39 +00:00
Prachi Gupta	9b68682df2	[ROCm] Enable several DISABLED issues (#167183 ) Profiler: Fixes #166422 Default: Fixes #165386 Fixes #145019 Fixes #145069 Fixes #165295 Fixes #165294 Fixes #165093 Fixes #164235 Fixes #164194 Fixes #164193 Fixes #155217 Fixes #163918 Fixes #163917 Fixes #155235 Fixes #122352 Fixes #121576 Fixes #121806 Fixes #104366 Inductor: Fixes #164337 Fixes #148523 Fixes #115002 Fixes #111066 Fixes #107774 Distributed Fixes #161612 Fixes #161502 Fixes #161459 Fixes #161402 Fixes #155711 Fixes #152201 Fixes #152367 Fixes #152349 Fixes #152168 Fixes #152169 Fixes #151153 Fixes #151077 Fixes #112815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/167183 Approved by: https://github.com/jeffdaily	2025-11-13 02:50:35 +00:00
PyTorch MergeBot	8f5f89c9a0	Revert "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace (#167248 )" This reverts commit 537167aa1e50a4379dca244163aaf369ed8e5161. Reverted https://github.com/pytorch/pytorch/pull/167248 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/167248#issuecomment-3524925727))	2025-11-13 02:46:35 +00:00
Benji Beck	8919f69362	[Inductor][2/2] Decouple flags for optimization and debug symbols (#167575 ) Summary: What: Decouple flags for compile (unoptimized build) and symbols (optimized build) Why: Reduce confusion around naming and usage Test Plan: Unit test & CI Differential Revision: D86683526 Pull Request resolved: https://github.com/pytorch/pytorch/pull/167575 Approved by: https://github.com/jansel, https://github.com/hl475	2025-11-13 00:59:15 +00:00
angelayi	19c867873a	[opqaue obj] Add attribute support (#167230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/167230 Approved by: https://github.com/zou3519 ghstack dependencies: #163284, #163714, #163936	2025-11-13 00:35:20 +00:00
angelayi	e3dadb1d36	[opaque obj] torch.compile support (#163936 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163936 Approved by: https://github.com/zou3519 ghstack dependencies: #163284, #163714	2025-11-13 00:35:20 +00:00
angelayi	c9b09a31e8	[opaque obj] Allow non-effectful scriptobjs (#163714 ) Fixes functionalization so that we can run ops using ScriptObjects w/o needing effects. Previously we would run into an error when running functionalization on the TorchBindOpOverloads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163714 Approved by: https://github.com/zou3519 ghstack dependencies: #163284	2025-11-13 00:35:20 +00:00
angelayi	35571fe94b	[effects] Add register_effectful_op (#163284 ) Refactored register_effectful_op to return a handler to match how fake kernels are registered. This makes it easier to deregister effects Pull Request resolved: https://github.com/pytorch/pytorch/pull/163284 Approved by: https://github.com/zou3519	2025-11-13 00:35:20 +00:00
Aaron Orenstein	485f2b607a	ProxyTorchDispatchMode: Decomposing missing sympy.SymExpr should handle constant literals (#167585 ) The previous work to decompose missing sympy.SymExpr (#164717) handled combinations of sub-nodes (like `s1s2`) but I forgot to handle explicit literals (like `2s2`). Added a unit test based on the report. Fixes T244632748 Pull Request resolved: https://github.com/pytorch/pytorch/pull/167585 Approved by: https://github.com/bobrenjc93	2025-11-13 00:27:10 +00:00

47 changed files with 1298 additions and 555 deletions

									
										12

.ci/docker/build.sh
									
												View File
												
				@ -116,7 +116,7 @@ case "$tag" in

				    INSTALL_MINGW=yes

				    ;;

				  pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11)

				    CUDA_VERSION=13.0.0

				    CUDA_VERSION=13.0.2

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				@ -125,6 +125,16 @@ case "$tag" in

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc9)

				    CUDA_VERSION=13.0.2

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.10

									
										18

.ci/pytorch/test.sh
									
												View File
												
				@ -1680,6 +1680,22 @@ test_operator_microbenchmark() {

				  done

				}

				test_attention_microbenchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  TEST_DIR=$(pwd)

				  # Install attention-gym dependency

				  echo "Installing attention-gym..."

				  python -m pip install git+https://github.com/meta-pytorch/attention-gym.git@main

				  pip show triton

				  cd "${TEST_DIR}"/benchmarks/transformer

				  $TASKSET python score_mod.py --config configs/config_basic.yaml \

				    --output-json-for-dashboard "${TEST_REPORTS_DIR}/attention_microbenchmark.json"

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

				@ -1737,6 +1753,8 @@ elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then

				  fi

				elif [[ "${TEST_CONFIG}" == *operator_microbenchmark* ]]; then

				  test_operator_microbenchmark

				elif [[ "${TEST_CONFIG}" == *attention_microbenchmark* ]]; then

				  test_attention_microbenchmark

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

									
										73

.github/workflows/attention_op_microbenchmark.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,73 @@

				name: attention_op_microbenchmark

				on:

				  push:

				    tags:

				      - ciflow/op-benchmark/*

				  workflow_dispatch:

				  schedule:

				    # Run at 06:00 UTC everyday

				    - cron: 0 7 * * *

				concurrency:

				  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}

				  cancel-in-progress: true

				permissions:

				  id-token: write

				  contents: read

				jobs:

				  attn-microbenchmark-build:

				    if: github.repository_owner == 'pytorch'

				    uses: ./.github/workflows/_linux-build.yml

				    with:

				      runner: linux.12xlarge.memory

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11

				      cuda-arch-list: '8.0 9.0'

				      test-matrix: |

				        { include: [

				          { config: "attention_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.aws.a100" },

				          { config: "attention_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.aws.h100" },

				        ]}

				    secrets: inherit

				  attn-microbenchmark-test:

				    name: attn-microbenchmark-test

				    uses: ./.github/workflows/_linux-test.yml

				    needs: attn-microbenchmark-build

				    with:

				      timeout-minutes: 500

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80

				      docker-image: ${{ needs.attn-microbenchmark-build.outputs.docker-image }}

				      test-matrix: ${{ needs.attn-microbenchmark-build.outputs.test-matrix }}

				    secrets: inherit

				  # B200 runner

				  opmicrobenchmark-build-b200:

				    if: github.repository_owner == 'pytorch'

				    name: opmicrobenchmark-build-b200

				    uses: ./.github/workflows/_linux-build.yml

				    with:

				      runner: linux.12xlarge.memory

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11

				      cuda-arch-list: '10.0'

				      test-matrix: |

				        { include: [

				          { config: "operator_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.dgx.b200" },

				        ]}

				    secrets: inherit

				  opmicrobenchmark-test-b200:

				    name: opmicrobenchmark-test-b200

				    uses: ./.github/workflows/_linux-test.yml

				    needs: opmicrobenchmark-build-b200

				    with:

				      timeout-minutes: 500

				      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100

				      docker-image: ${{ needs.opmicrobenchmark-build-b200.outputs.docker-image }}

				      test-matrix: ${{ needs.opmicrobenchmark-build-b200.outputs.test-matrix }}

				      aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only

				    secrets: inherit

									
										1

.github/workflows/docker-builds.yml
									
										vendored
									
												View File
												
				@ -54,6 +54,7 @@ jobs:

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm,

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks,

				          pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9,

				          pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc9,

				          pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11,

				          pytorch-linux-jammy-py3.10-clang12,

				          pytorch-linux-jammy-py3.11-clang12,

									
										33

.github/workflows/periodic.yml
									
										vendored
									
												View File
												
				@ -204,6 +204,39 @@ jobs:

				      test-matrix: ${{ needs.linux-jammy-cuda13_0-py3_10-gcc11-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-cuda13_0-py3_10-gcc9-debug-build:

				    name: linux-jammy-cuda13.0-py3.10-gcc9-debug

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda13.0-py3.10-gcc9-debug

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc9

				      cuda-arch-list: 8.6

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 7, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu", owners: ["oncall:debug-build"] },

				          { config: "default", shard: 2, num_shards: 7, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu", owners: ["oncall:debug-build"] },

				          { config: "default", shard: 3, num_shards: 7, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu", owners: ["oncall:debug-build"] },

				          { config: "default", shard: 4, num_shards: 7, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu", owners: ["oncall:debug-build"] },

				          { config: "default", shard: 5, num_shards: 7, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu", owners: ["oncall:debug-build"] },

				          { config: "default", shard: 6, num_shards: 7, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu", owners: ["oncall:debug-build"] },

				          { config: "default", shard: 7, num_shards: 7, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu", owners: ["oncall:debug-build"] },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda13_0-py3_10-gcc9-debug-test:

				    name: linux-jammy-cuda13.0-py3.10-gcc9-debug

				    uses: ./.github/workflows/_linux-test.yml

				    needs:

				      - linux-jammy-cuda13_0-py3_10-gcc9-debug-build

				      - target-determination

				    with:

				      build-environment: linux-jammy-cuda13.0-py3.10-gcc9-debug

				      docker-image: ${{ needs.linux-jammy-cuda13_0-py3_10-gcc9-debug-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-cuda13_0-py3_10-gcc9-debug-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3-gcc11-slow-gradcheck-build:

				    name: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck

				    uses: ./.github/workflows/_linux-build.yml

									
										29

.github/workflows/pull.yml
									
										vendored
									
												View File
												
				@ -268,6 +268,35 @@ jobs:

				        ]}

				    secrets: inherit

				  linux-jammy-cuda13_0-py3_10-gcc9-build:

				    name: linux-jammy-cuda13.0-py3.10-gcc9

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda13.0-py3.10-gcc9

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc9

				      cuda-arch-list: 8.9

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda13_0-py3_10-gcc9-test:

				    name: linux-jammy-cuda13.0-py3.10-gcc9

				    uses: ./.github/workflows/_linux-test.yml

				    needs: linux-jammy-cuda13_0-py3_10-gcc9-build

				    with:

				      build-environment: linux-jammy-cuda13.0-py3.10-gcc9

				      docker-image: ${{ needs.linux-jammy-cuda13_0-py3_10-gcc9-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-cuda13_0-py3_10-gcc9-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-cpu-py3_10-gcc11-bazel-test:

				    name: linux-jammy-cpu-py3.10-gcc11-bazel-test

				    uses: ./.github/workflows/_bazel-build-test.yml

									
										29

.github/workflows/slow.yml
									
										vendored
									
												View File
												
				@ -78,6 +78,35 @@ jobs:

				      test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm86-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-cuda13_0-py3_10-gcc11-sm86-build:

				    name: linux-jammy-cuda13.0-py3.10-gcc11-sm86

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda13.0-py3.10-gcc11-sm86

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11

				      cuda-arch-list: 8.6

				      test-matrix: |

				        { include: [

				          { config: "slow", shard: 1, num_shards: 3, runner: "linux.g5.4xlarge.nvidia.gpu" },

				          { config: "slow", shard: 2, num_shards: 3, runner: "linux.g5.4xlarge.nvidia.gpu" },

				          { config: "slow", shard: 3, num_shards: 3, runner: "linux.g5.4xlarge.nvidia.gpu" },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda13_0-py3_10-gcc11-sm86-test:

				    name: linux-jammy-cuda13.0-py3.10-gcc11-sm86

				    uses: ./.github/workflows/_linux-test.yml

				    needs:

				      - linux-jammy-cuda13_0-py3_10-gcc11-sm86-build

				      - target-determination

				    with:

				      build-environment: linux-jammy-cuda13.0-py3.10-gcc11-sm86

				      docker-image: ${{ needs.linux-jammy-cuda13_0-py3_10-gcc11-sm86-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-cuda13_0-py3_10-gcc11-sm86-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-py3_10-clang12-build:

				    name: linux-jammy-py3.10-clang12

				    uses: ./.github/workflows/_linux-build.yml

									
										67

.github/workflows/trunk.yml
									
										vendored
									
												View File
												
				@ -63,6 +63,23 @@ jobs:

				        ]}

				    secrets: inherit

				  libtorch-linux-jammy-cuda13_0-py3_10-gcc11-debug-build:

				    name: libtorch-linux-jammy-cuda13.0-py3.10-gcc11-debug

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      build-environment: libtorch-linux-jammy-cuda13.0-py3.10-gcc11

				      cuda-arch-list: '7.5 8.9'

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11

				      build-generates-artifacts: false

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      runner: "linux.c7i.4xlarge"

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 1 },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda12_8-py3_10-gcc11-build:

				    name: linux-jammy-cuda12.8-py3.10-gcc11

				    uses: ./.github/workflows/_linux-build.yml

				@ -99,6 +116,41 @@ jobs:

				      test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build.outputs.test-matrix }}

				    secrets: inherit

				  linux-jammy-cuda13_0-py3_10-gcc11-build:

				    name: linux-jammy-cuda13.0-py3.10-gcc11

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda13.0-py3.10-gcc11

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11

				      cuda-arch-list: '7.5 8.9'

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },

				          { config: "distributed", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.12xlarge.nvidia.gpu" },

				          { config: "distributed", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.12xlarge.nvidia.gpu" },

				          { config: "distributed", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.12xlarge.nvidia.gpu" },

				          { config: "pr_time_benchmarks", shard: 1, num_shards: 1, runner: "linux.g4dn.metal.nvidia.gpu" },

				        ]}

				    secrets: inherit

				  linux-jammy-cuda13_0-py3_10-gcc11-test:

				    name: linux-jammy-cuda13.0-py3.10-gcc11

				    uses: ./.github/workflows/_linux-test.yml

				    needs:

				      - linux-jammy-cuda13_0-py3_10-gcc11-build

				      - target-determination

				    with:

				      timeout-minutes: 360

				      build-environment: linux-jammy-cuda13.0-py3.10-gcc11

				      docker-image: ${{ needs.linux-jammy-cuda13_0-py3_10-gcc11-build.outputs.docker-image }}

				      test-matrix: ${{ needs.linux-jammy-cuda13_0-py3_10-gcc11-build.outputs.test-matrix }}

				    secrets: inherit

				  # no-ops builds test USE_PER_OPERATOR_HEADERS=0 where ATen/ops is not generated

				  linux-jammy-cuda12_8-py3_10-gcc11-no-ops-build:

				@ -115,6 +167,21 @@ jobs:

				        ]}

				    secrets: inherit

				  linux-jammy-cuda13_0-py3_10-gcc11-no-ops-build:

				    name: linux-jammy-cuda13.0-py3.10-gcc11-no-ops

				    uses: ./.github/workflows/_linux-build.yml

				    needs: get-label-type

				    with:

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build-environment: linux-jammy-cuda13.0-py3.10-gcc11-no-ops

				      cuda-arch-list: '7.5 8.9'

				      docker-image-name: ci-image:pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11

				      test-matrix: |

				        { include: [

				          { config: "default", shard: 1, num_shards: 1 },

				        ]}

				    secrets: inherit

				  macos-py3-arm64-build:

				    if: github.repository_owner == 'pytorch'

				    name: macos-py3-arm64

									
										10

aten/src/ATen/cuda/CUDAContextLight.h
									
												View File
												
				@ -3,7 +3,6 @@

				#include <cstdint>

				#include <map>

				#include <shared_mutex>

				#include <cuda_runtime_api.h>

				#include <cusparse.h>

				@ -89,13 +88,8 @@ TORCH_CUDA_CPP_API cublasHandle_t getCurrentCUDABlasHandle();

				TORCH_CUDA_CPP_API cublasLtHandle_t getCurrentCUDABlasLtHandle();

				TORCH_CUDA_CPP_API void clearCublasWorkspaces();

				struct WorkspaceMapWithMutex {

				  std::map<std::tuple<void*, void*>, at::DataPtr> map;

				  std::shared_mutex mutex;

				};

				TORCH_CUDA_CPP_API WorkspaceMapWithMutex& cublas_handle_stream_to_workspace();

				TORCH_CUDA_CPP_API WorkspaceMapWithMutex& cublaslt_handle_stream_to_workspace();

				TORCH_CUDA_CPP_API std::map<std::tuple<void *, void *>, at::DataPtr>& cublas_handle_stream_to_workspace();

				TORCH_CUDA_CPP_API std::map<std::tuple<void *, void *>, at::DataPtr>& cublaslt_handle_stream_to_workspace();

				TORCH_CUDA_CPP_API size_t getChosenWorkspaceSize();

				TORCH_CUDA_CPP_API size_t getCUDABlasLtWorkspaceSize();

				TORCH_CUDA_CPP_API void* getCUDABlasLtWorkspace();

									
										95

aten/src/ATen/cuda/CublasHandlePool.cpp
									
												View File
												
				@ -99,7 +99,7 @@ void destroyCublasHandle(cublasHandle_t handle) {

				//   - Comments of @soumith copied from cuDNN handle pool implementation

				#ifdef NO_CUDNN_DESTROY_HANDLE

				#else

				  cublasDestroy(handle);

				    cublasDestroy(handle);

				#endif

				}

				@ -107,27 +107,19 @@ using CuBlasPoolType = DeviceThreadHandlePool<cublasHandle_t, createCublasHandle

				} // namespace

				WorkspaceMapWithMutex& cublas_handle_stream_to_workspace() {

				  static auto& instance = *new WorkspaceMapWithMutex;

				std::map<std::tuple<void *, void *>, at::DataPtr>& cublas_handle_stream_to_workspace() {

				  static auto& instance = *new std::map<std::tuple<void *, void *>, at::DataPtr>;

				  return instance;

				}

				WorkspaceMapWithMutex& cublaslt_handle_stream_to_workspace() {

				  static auto& instance = *new WorkspaceMapWithMutex;

				std::map<std::tuple<void *, void *>, at::DataPtr>& cublaslt_handle_stream_to_workspace() {

				  static auto& instance = *new std::map<std::tuple<void *, void *>, at::DataPtr>;

				  return instance;

				}

				void clearCublasWorkspaces() {

				  {

				    auto& workspace = cublas_handle_stream_to_workspace();

				    std::unique_lock<std::shared_mutex> lock(workspace.mutex);

				    workspace.map.clear();

				  }

				  {

				    auto& workspace = cublaslt_handle_stream_to_workspace();

				    std::unique_lock<std::shared_mutex> lock(workspace.mutex);

				    workspace.map.clear();

				  }

				  cublas_handle_stream_to_workspace().clear();

				  cublaslt_handle_stream_to_workspace().clear();

				}

				size_t parseChosenWorkspaceSize() {

				@ -249,10 +241,8 @@ void* getCUDABlasLtWorkspace() {

				    auto stream = c10::cuda::getCurrentCUDAStream();

				    cudaStream_t _stream = stream;

				    auto key = std::make_tuple(static_cast<void *>(handle), static_cast<void *>(_stream));

				    auto& workspace = at::cuda::cublas_handle_stream_to_workspace();

				    std::shared_lock<std::shared_mutex> lock(workspace.mutex);

				    auto workspace_it = workspace.map.find(key);

				    TORCH_INTERNAL_ASSERT(workspace_it != workspace.map.end());

				    auto workspace_it = at::cuda::cublas_handle_stream_to_workspace().find(key);

				    TORCH_INTERNAL_ASSERT(workspace_it != at::cuda::cublas_handle_stream_to_workspace().end());

				    return workspace_it->second.mutable_get();

				  }

				#endif

				@ -260,34 +250,11 @@ void* getCUDABlasLtWorkspace() {

				  auto stream = c10::cuda::getCurrentCUDAStream();

				  cudaStream_t _stream = stream;

				  auto key = std::make_tuple(static_cast<void *>(handle), static_cast<void *>(_stream));

				  auto& workspace = cublaslt_handle_stream_to_workspace();

				  // Fast path: check if workspace already exists

				  {

				    std::shared_lock<std::shared_mutex> lock(workspace.mutex);

				    auto workspace_it = workspace.map.find(key);

				    if (workspace_it != workspace.map.end()) {

				      return workspace_it->second.mutable_get();

				    }

				  }

				  // Slow path: allocate workspace outside the lock

				  auto new_workspace = getNewCUDABlasLtWorkspace();

				  // Insert with lock (double-check in case another thread inserted while we

				  // were allocating)

				  {

				    std::unique_lock<std::shared_mutex> lock(workspace.mutex);

				    auto workspace_it = workspace.map.find(key);

				    if (workspace_it == workspace.map.end()) {

				      workspace_it =

				          workspace.map.emplace(key, std::move(new_workspace)).first;

				    }

				    // else: another thread inserted it, our new_workspace will be automatically

				    // freed

				    return workspace_it->second.mutable_get();

				  auto workspace_it = cublaslt_handle_stream_to_workspace().find(key);

				  if (workspace_it == cublaslt_handle_stream_to_workspace().end()) {

				    workspace_it = cublaslt_handle_stream_to_workspace().insert(workspace_it, {key, getNewCUDABlasLtWorkspace()});

				  }

				  return workspace_it->second.mutable_get();

				}

				cublasHandle_t getCurrentCUDABlasHandle() {

				@ -333,39 +300,11 @@ cublasHandle_t getCurrentCUDABlasHandle() {

				  // all the memory and cublas's cudaMallocAsync will return OOM

				  cudaStream_t _stream = stream;

				  auto key = std::make_tuple(static_cast<void *>(handle), static_cast<void *>(_stream));

				  auto& workspace = cublas_handle_stream_to_workspace();

				  size_t workspace_size = getChosenWorkspaceSize();

				  // Fast path: check if workspace already exists

				  {

				    std::shared_lock<std::shared_mutex> lock(workspace.mutex);

				    auto workspace_it = workspace.map.find(key);

				    if (workspace_it != workspace.map.end()) {

				      TORCH_CUDABLAS_CHECK(cublasSetWorkspace(

				          handle, workspace_it->second.get(), workspace_size));

				      return handle;

				    }

				  }

				  // Slow path: allocate workspace outside the lock

				  auto new_workspace = getNewWorkspace();

				  // Insert with lock (double-check in case another thread inserted while we

				  // were allocating)

				  {

				    std::unique_lock<std::shared_mutex> lock(workspace.mutex);

				    auto workspace_it = workspace.map.find(key);

				    if (workspace_it == workspace.map.end()) {

				      workspace_it =

				          workspace.map.emplace(key, std::move(new_workspace)).first;

				    }

				    // else: another thread inserted it, our new_workspace will be automatically

				    // freed

				    TORCH_CUDABLAS_CHECK(

				        cublasSetWorkspace(handle, workspace_it->second.get(), workspace_size));

				  auto workspace_it = cublas_handle_stream_to_workspace().find(key);

				  if (workspace_it == cublas_handle_stream_to_workspace().end()) {

				    workspace_it = cublas_handle_stream_to_workspace().insert(workspace_it, {key, getNewWorkspace()});

				  }

				  TORCH_CUDABLAS_CHECK(cublasSetWorkspace(handle, workspace_it->second.get(), getChosenWorkspaceSize()));

				#if !defined(USE_ROCM)

				  // On CUDA >= 11, and architecture >= Ampere, cuBLAS can use TF32 to speedup

				  // FP32 data type calculations based on the value of the allow_tf32 flag.

									
										1

aten/src/ATen/test/CMakeLists.txt
									
												View File
												
				@ -61,7 +61,6 @@ list(APPEND ATen_CUDA_TEST_SRCS

				  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_complex_math_test.cu

				  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_complex_test.cu

				  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_cub_test.cu

				  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_cublas_handle_pool_test.cpp

				  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_device_test.cpp

				  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_distributions_test.cu

				  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_dlconvertor_test.cpp

									
										77

aten/src/ATen/test/cuda_cublas_handle_pool_test.cpp
									
												View File
											
				@ -1,77 +0,0 @@

				#include <gtest/gtest.h>

				#include <ATen/cuda/CUDAContext.h>

				#include <c10/cuda/CUDACachingAllocator.h>

				#include <c10/cuda/CUDAGuard.h>

				#include <atomic>

				#include <thread>

				#include <vector>

				// Test concurrent access to getCurrentCUDABlasHandle and getCUDABlasLtWorkspace

				// to verify that the data race fix is working correctly

				TEST(CUDABlasHandlePoolTest, ConcurrentGetAndClearWorkspaces) {

				  if (!at::cuda::is_available()) {

				    return;

				  }

				  constexpr int num_accessor_threads = 15;

				  constexpr int num_clear_threads = 5;

				  constexpr int iterations_per_thread = 50;

				  std::atomic<bool> stop{false};

				  std::atomic<int> error_count{0};

				  std::vector<std::thread> threads;

				  threads.reserve(num_accessor_threads + num_clear_threads);

				  // Launch accessor threads

				  for (int i = 0; i < num_accessor_threads; ++i) {

				    threads.emplace_back([&stop, &error_count]() {

				      try {

				        at::cuda::CUDAGuard device_guard(0);

				        while (!stop.load(std::memory_order_relaxed)) {

				          const auto handle = at::cuda::getCurrentCUDABlasHandle();

				          const auto workspace = at::cuda::getCUDABlasLtWorkspace();

				          if (handle == nullptr || workspace == nullptr) {

				            error_count++;

				          }

				        }

				      } catch (const std::exception& e) {

				        error_count++;

				      }

				    });

				  }

				  // Launch threads that clear workspaces

				  for (int i = 0; i < num_clear_threads; ++i) {

				    threads.emplace_back([&error_count]() {

				      try {

				        for (int j = 0; j < iterations_per_thread; ++j) {

				          at::cuda::clearCublasWorkspaces();

				          std::this_thread::yield();

				        }

				      } catch (const std::exception& e) {

				        error_count++;

				      }

				    });

				  }

				  // Let them run for a bit

				  std::this_thread::sleep_for(std::chrono::milliseconds(100));

				  stop.store(true, std::memory_order_relaxed);

				  for (auto& thread : threads) {

				    thread.join();

				  }

				  EXPECT_EQ(error_count.load(), 0);

				}

				int main(int argc, char* argv[]) {

				  ::testing::InitGoogleTest(&argc, argv);

				  c10::cuda::CUDACachingAllocator::init(1);

				  return RUN_ALL_TESTS();

				}

									
										23

benchmarks/transformer/score_mod.py
									
												View File
												
				@ -125,6 +125,17 @@ AttentionType = Literal[

				]

				DtypeString = Literal["bfloat16", "float16", "float32"]

				SpeedupType = Literal["fwd", "bwd"]

				# Operator Name mapping

				backend_to_operator_name = {

				    "math": "math attention kernel",

				    "efficient": "efficient attention kernel",

				    "cudnn": "cudnn attention kernel",

				    "fav2": "flash attention 2 kernel",

				    "fav3": "flash attention 3 kernel",

				    "fakv": "flash attention kv cache kernel",

				    "og-eager": "eager attention kernel",

				    "flex": "flex attention kernel",

				}

				def benchmark_torch_function_in_microseconds(func: Callable, *args, **kwargs) -> float:

				@ -1265,12 +1276,14 @@ def _output_json_for_dashboard(

				                model: ModelInfo

				                metric: MetricInfo

				            operator_name = backend_to_operator_name.get(backend, backend)

				            # Benchmark extra info

				            benchmark_extra_info = {

				                "input_config": input_config,

				                "device": device,

				                "arch": device_arch,

				                "operator_name": backend,

				                "operator_name": operator_name,

				                "attn_type": config.attn_type,

				                "shape": str(config.shape),

				                "max_autotune": config.max_autotune,

				@ -1288,7 +1301,7 @@ def _output_json_for_dashboard(

				                    type="attention-benchmark",

				                    origins=["pytorch"],

				                    extra_info={

				                        "operator_name": backend,

				                        "operator_name": operator_name,

				                        "attn_type": config.attn_type,

				                    },

				                ),

				@ -1315,7 +1328,7 @@ def _output_json_for_dashboard(

				                        type="attention-benchmark",

				                        origins=["pytorch"],

				                        extra_info={

				                            "operator_name": backend,

				                            "operator_name": operator_name,

				                        },

				                    ),

				                    metric=MetricInfo(

				@ -1341,7 +1354,7 @@ def _output_json_for_dashboard(

				                        type="attention-benchmark",

				                        origins=["pytorch"],

				                        extra_info={

				                            "operator_name": backend,

				                            "operator_name": operator_name,

				                        },

				                    ),

				                    metric=MetricInfo(

				@ -1371,7 +1384,7 @@ def _output_json_for_dashboard(

				                        type="attention-benchmark",

				                        origins=["pytorch"],

				                        extra_info={

				                            "operator_name": backend,

				                            "operator_name": operator_name,

				                        },

				                    ),

				                    metric=MetricInfo(

									
										3

cmake/Dependencies.cmake
									
												View File
												
				@ -1394,6 +1394,9 @@ if(NOT INTERN_BUILD_MOBILE)

				  # https://github.com/pytorch/pytorch/pull/55292

				  string(APPEND CMAKE_CUDA_FLAGS " -DCUB_WRAPPED_NAMESPACE=at_cuda_detail")

				  # Suppress cusparse warnings

				  string(APPEND CMAKE_CUDA_FLAGS " -DDISABLE_CUSPARSE_DEPRECATED")

				  message(STATUS "Found CUDA with FP16 support, compiling with torch.cuda.HalfTensor")

				  string(APPEND CMAKE_CUDA_FLAGS " -DCUDA_HAS_FP16=1"

				                                 " -D__CUDA_NO_HALF_OPERATORS__"

									
										4

test/distributed/test_c10d_gloo.py
									
												View File
												
				@ -54,12 +54,10 @@ from torch.testing._internal.common_distributed import (

				    verify_ddp_error_logged,

				)

				from torch.testing._internal.common_utils import (

				    MI300_ARCH,

				    retry_on_connect_failures,

				    run_tests,

				    skip_but_pass_in_sandcastle,

				    skipIfRocm,

				    skipIfRocmArch,

				    TestCase,

				)

				@ -1233,7 +1231,7 @@ class ProcessGroupGlooTest(MultiProcessTestCase):

				        self._test_gather_stress(inputs, lambda t: t.clone())

				    @skip_if_lt_x_gpu(2)

				    @skipIfRocmArch(MI300_ARCH)

				    @skipIfRocm

				    @requires_gloo()

				    def test_gather_stress_cuda(self):

				        inputs = [torch.tensor([i + self.rank]).cuda() for i in range(1000)]

									
										89

test/higher_order_ops/test_with_effects.py
									
												View File
												
				@ -18,15 +18,16 @@ from functorch.compile import (

				    nop,

				)

				from torch._functorch.aot_autograd import aot_export_module

				from torch._higher_order_ops.effects import with_effects

				from torch._higher_order_ops.effects import (

				    _EffectType,

				    _get_effect,

				    _register_effectful_op,

				    with_effects,

				)

				from torch._higher_order_ops.torchbind import enable_torchbind_tracing

				from torch.fx.experimental.proxy_tensor import make_fx

				from torch.testing import FileCheck

				from torch.testing._internal.common_cuda import (

				    _get_torch_cuda_version,

				    SM70OrLater,

				    SM80OrLater,

				)

				from torch.testing._internal.common_cuda import SM70OrLater, SM80OrLater

				from torch.testing._internal.common_quantization import skipIfNoDynamoSupport

				from torch.testing._internal.common_utils import (

				    IS_WINDOWS,

				@ -300,7 +301,6 @@ def forward(self, arg0_1, arg1_1, arg2_1):

				    @unittest.skipIf(IS_WINDOWS, "triton")

				    @unittest.skipIf(TEST_WITH_ROCM, "triton")

				    @unittest.skipIf(not SM80OrLater, "triton")

				    @unittest.skipIf(_get_torch_cuda_version() >= (11, 7), "triton")

				    @unittest.skipIf(not TEST_CUDA, "triton")

				    @skipIfNoDynamoSupport

				    def test_register_effectful_custom_op(self):

				@ -308,41 +308,23 @@ def forward(self, arg0_1, arg1_1, arg2_1):

				            torch._dynamo.config.capture_scalar_outputs = True

				            torch._dynamo.config.capture_dynamic_output_shape_ops = True

				            torch.library.define(

				                "mylib::record_scalar_tensor",

				                "(Tensor x, str prefix) -> ()",

				                lib=lib,

				            )

				            # global variable to store the recorded tensor and prefix.

				            recorded_dict = {}

				            # Pytorch custorm op implementation

				            @torch.library.impl(

				                "mylib::record_scalar_tensor",

				                "CompositeExplicitAutograd",

				                lib=lib,

				            )

				            def record_scalar_tensor(x, prefix):

				            # Pytorch custom op implementation

				            @torch.library.custom_op("mylib::record_scalar_tensor", mutates_args=())

				            def record_scalar_tensor(x: torch.Tensor, prefix: str) -> None:

				                recorded_dict[prefix] = x.clone()

				                return

				            # Meta function of the custom op

				            @torch.library.register_fake(

				                "mylib::record_scalar_tensor",

				                lib=lib,

				            )

				            @record_scalar_tensor.register_fake

				            def record_scalar_tensor_meta(x, prefix):

				                return

				            from torch._higher_order_ops.effects import (

				                _EffectType,

				                _register_effectful_op,

				            )

				            record_scalar_tensor.register_effect(_EffectType.ORDERED)

				            _register_effectful_op(

				                torch.ops.mylib.record_scalar_tensor.default, _EffectType.ORDERED

				            )

				            self.assertEqual(_get_effect(record_scalar_tensor), _EffectType.ORDERED)

				            my_config = {}

				            my_config["MockModule"] = "mean"

				@ -469,13 +451,12 @@ def forward(self, arg0_1, arg1_1, arg2_1):

				            torch.library.register_autograd("_mylib::zoo", foo_bwd, lib=lib)

				            from torch._higher_order_ops.effects import (

				                _EffectType,

				                _register_effectful_op,

				            torch.library._register_effectful_op(

				                torch.ops._mylib.zoo.default, _EffectType.ORDERED

				            )

				            torch.library._register_effectful_op(

				                torch.ops._mylib.zoo2.default, _EffectType.ORDERED

				            )

				            _register_effectful_op(torch.ops._mylib.zoo.default, _EffectType.ORDERED)

				            _register_effectful_op(torch.ops._mylib.zoo2.default, _EffectType.ORDERED)

				            def fn(x, y):

				                return torch.ops._mylib.zoo(x) + y

				@ -687,13 +668,13 @@ def forward(self, arg0_1, arg1_1):

				            torch.library.register_autograd("_mylib::foo", foo_bwd, lib=lib)

				            from torch._higher_order_ops.effects import (

				                _deregister_effectful_op,

				                _EffectType,

				                _register_effectful_op,

				            handle = _register_effectful_op(

				                torch.ops._mylib.foo.default, _EffectType.ORDERED

				            )

				            self.assertEqual(

				                _get_effect(torch.ops._mylib.foo.default), _EffectType.ORDERED

				            )

				            _register_effectful_op(torch.ops._mylib.foo.default, _EffectType.ORDERED)

				            try:

				                def fn(x, y):

				@ -779,17 +760,13 @@ def forward(self, tangents_1, tangents_2, tangents_token):

				                    else:

				                        raise NotImplementedError

				            finally:

				                _deregister_effectful_op(torch.ops._mylib.foo.default)

				                handle.destroy()

				            self.assertEqual(_get_effect(torch.ops._mylib.foo.default), None)

				    @skipIfNoDynamoSupport

				    def test_regular_effectful_op_only_in_backward(self):

				        from torch._higher_order_ops.effects import (

				            _deregister_effectful_op,

				            _EffectType,

				            _register_effectful_op,

				        )

				        _register_effectful_op(torch.ops.aten.cos.default, _EffectType.ORDERED)

				        handle = _register_effectful_op(torch.ops.aten.cos.default, _EffectType.ORDERED)

				        try:

				            def fn(x):

				@ -852,17 +829,11 @@ def forward(self, primals_1, primals_2, tangents_1, tangents_2, tangents_token):

				    return (mul, mul_1, getitem_2)""",

				            )

				        finally:

				            _deregister_effectful_op(torch.ops.aten.cos.default)

				            handle.destroy()

				    @skipIfNoDynamoSupport

				    def test_regular_effectful_op_in_forward_and_backward(self):

				        from torch._higher_order_ops.effects import (

				            _deregister_effectful_op,

				            _EffectType,

				            _register_effectful_op,

				        )

				        _register_effectful_op(torch.ops.aten.cos.default, _EffectType.ORDERED)

				        handle = _register_effectful_op(torch.ops.aten.cos.default, _EffectType.ORDERED)

				        try:

				            def fn(x):

				@ -897,7 +868,7 @@ def forward(self, primals_2, getitem_1, tangents_1, tangents_token):

				    return (mul_1, getitem_2)""",

				            )

				        finally:

				            _deregister_effectful_op(torch.ops.aten.cos.default)

				            handle.destroy()

				if __name__ == "__main__":

									
										53

test/inductor/test_compile.py
									
												View File
												
				@ -136,12 +136,59 @@ class TestStandaloneInductor(TestCase):

				        mod_opt = inductor.compile(mod, inp)

				        self.assertEqual(mod(*inp), mod_opt(*inp))

				    @mock.patch.dict(os.environ, {"TORCHINDUCTOR_DEBUG_COMPILE": "1"})

				    def test_inductor_generate_debug_compile(self):

				        cpp_code = """

				        int main(){

				            return 0;

				        }

				        """

				        _, source_path = write(

				            cpp_code,

				            "cpp",

				        )

				        build_option = CppOptions()

				        cpp_builder = CppBuilder(

				            name="test_compile",

				            sources=source_path,

				            output_dir=os.path.dirname(source_path),

				            BuildOption=build_option,

				        )

				        cpp_builder.build()

				        binary_path = cpp_builder.get_target_file_path()

				        """

				        When we turn on generate debug compile.

				        On Windows, it should create a [module_name].pdb file. It helps debug by WinDBG.

				        On Linux, it should create some debug sections in binary file.

				        """

				        def check_linux_debug_section(module_path: str):

				            check_cmd = shlex.split(f"readelf -S {module_path}")

				            output = safe_command_output(check_cmd)

				            has_debug_sym = ".debug_info" in output

				            self.assertEqual(has_debug_sym, True)

				        def check_windows_pdb_exist(module_path: str):

				            file_name_no_ext = os.path.splitext(module_path)[0]

				            file_name_pdb = f"{file_name_no_ext}.pdb"

				            has_pdb_file = os.path.exists(file_name_pdb)

				            self.assertEqual(has_pdb_file, True)

				        if _IS_WINDOWS:

				            check_windows_pdb_exist(binary_path)

				        elif _IS_MACOS:

				            pass  # MacOS not sure that if it should be works.

				        else:

				            check_linux_debug_section(binary_path)

				    @mock.patch.dict(os.environ, {"TORCHINDUCTOR_DEBUG_SYMBOL": "1"})

				    def test_inductor_generate_debug_symbol(self):

				        cpp_code = """

				int main(){

				    return 0;

				}

				        int main(){

				            return 0;

				        }

				        """

				        _, source_path = write(

									
										10

test/test_numpy_interop.py
									
												View File
												
				@ -683,6 +683,16 @@ class TestNumPyInterop(TestCase):

				        ):

				            f(xs)

				    def test_copy_mode(self):

				        def f(x):

				            return np.array(x, copy=np._CopyMode.IF_NEEDED)

				        opt_f = torch.compile(backend="eager", fullgraph=True)(f)

				        x = np.array([1, 2, 3])

				        # Should run without throwing an exception

				        y = opt_f(x)

				        self.assertEqual(y, f(x))

				instantiate_device_type_tests(TestNumPyInterop, globals())

									
										5

test/test_opaque_obj.py
									
												View File
												
				@ -90,7 +90,7 @@ class TestOpaqueObject(TestCase):

				            # This is not accurate since the queue could have tensors that are

				            # not rank 1

				            ctx = torch._custom_op.impl.get_ctx()

				            u0 = ctx.create_unbacked_symint()

				            u0 = ctx.new_dynamic_size()

				            return torch.empty(u0)

				        self.lib._register_fake("queue_pop", pop_impl_fake)

				@ -107,8 +107,7 @@ class TestOpaqueObject(TestCase):

				        @size_impl.register_fake

				        def size_impl_fake(q: torch._C.ScriptObject) -> int:

				            ctx = torch._custom_op.impl.get_ctx()

				            u0 = ctx.create_unbacked_symint()

				            torch._check_is_size(u0)

				            u0 = ctx.new_dynamic_size()

				            return u0

				        super().setUp()

									
										265

test/test_opaque_obj_v2.py
									
												View File
												
				@ -1,12 +1,22 @@

				# Owner(s): ["module: custom-operators"]

				import random

				from contextlib import ExitStack

				import torch

				from torch._dynamo.test_case import run_tests, TestCase

				from torch._dynamo.testing import AotEagerAndRecordGraphs

				from torch._functorch.aot_autograd import (

				    aot_compile_joint_with_descriptors,

				    aot_export_joint_with_descriptors,

				    aot_export_module,

				)

				from torch._library.effects import EffectType

				from torch._library.fake_class_registry import FakeScriptObject

				from torch._library.opaque_object import register_opaque_type

				from torch._subclasses.fake_tensor import FakeTensorMode

				from torch.fx.experimental.proxy_tensor import make_fx

				from torch.fx.experimental.symbolic_shapes import ShapeEnv

				from torch.testing._internal.common_utils import (

				    instantiate_parametrized_tests,

				    parametrize,

				@ -41,11 +51,21 @@ class OpaqueQueue:

				class RNGState:

				    def __init__(self, seed):

				        self.rng = random.Random(seed)

				        self.seed = seed

				        self.rng = random.Random(self.seed)

				class Counter:

				    def __init__(self, start):

				        self.counter = torch.tensor(start)

				    def increment_counter(self):

				        self.counter += 1

				register_opaque_type(OpaqueQueue, "_TestOpaqueObject_OpaqueQueue")

				register_opaque_type(RNGState, "_TestOpaqueObject_RNGState")

				register_opaque_type(Counter, "_TestOpaqueObject_Counter")

				class TestOpaqueObject(TestCase):

				@ -125,6 +145,20 @@ class TestOpaqueObject(TestCase):

				        def noisy_inject_fake(x: torch.Tensor, obj: RNGState) -> torch.Tensor:

				            return torch.empty_like(x)

				        @torch.library.custom_op(

				            "_TestOpaqueObject::increment_counter",

				            mutates_args=["prev"],

				        )

				        def increment_counter_impl(c: Counter, prev: torch.Tensor) -> torch.Tensor:

				            assert isinstance(c, Counter)

				            prev.copy_(c.counter)

				            c.increment_counter()

				            return c.counter

				        @increment_counter_impl.register_fake

				        def increment_counter_fake(c: Counter, prev: torch.Tensor) -> torch.Tensor:

				            return torch.empty_like(prev)

				        super().setUp()

				    def tearDown(self):

				@ -233,6 +267,235 @@ def forward(self, arg0_1, arg1_1):

				        ):

				            make_fx(f, tracing_mode=make_fx_tracing_mode)(RNGState(0), torch.ones(3))

				    def test_aot_export(self):

				        class Model(torch.nn.Module):

				            def __init__(self) -> None:

				                super().__init__()

				            def forward(self, rng_state, x):

				                x = torch.ops._TestOpaqueObject.noisy_inject(x, rng_state)

				                x = x * x

				                x = torch.ops._TestOpaqueObject.noisy_inject(x, rng_state)

				                x = x + x

				                return (x,)

				        mod = Model()

				        rng = RNGState(0)

				        x = torch.ones(2, 3)

				        fake_mode = torch._subclasses.fake_tensor.FakeTensorMode()

				        fake_rng = torch._library.fake_class_registry.maybe_to_fake_obj(fake_mode, rng)

				        fake_x = fake_mode.from_tensor(x)

				        gm = aot_export_module(mod, (fake_rng, fake_x), trace_joint=False)[0]

				        # By default we don't register ops containing PyObjs as being effectful

				        self.assertExpectedInline(

				            gm.code.strip(),

				            """\

				def forward(self, arg0_1, arg1_1):

				    noisy_inject = torch.ops._TestOpaqueObject.noisy_inject.default(arg1_1, arg0_1);  arg1_1 = None

				    mul = torch.ops.aten.mul.Tensor(noisy_inject, noisy_inject);  noisy_inject = None

				    noisy_inject_1 = torch.ops._TestOpaqueObject.noisy_inject.default(mul, arg0_1);  mul = arg0_1 = None

				    add = torch.ops.aten.add.Tensor(noisy_inject_1, noisy_inject_1);  noisy_inject_1 = None

				    return (add,)""",  # noqa: B950

				        )

				        torch.library._register_effectful_op(

				            "_TestOpaqueObject::noisy_inject", EffectType.ORDERED

				        )

				        try:

				            gm = aot_export_module(mod, (rng, fake_x), trace_joint=False)[0]

				            # inputs: token, rng, x

				            # return: token, res

				            self.assertExpectedInline(

				                gm.code.strip(),

				                """\

				def forward(self, arg0_1, arg1_1, arg2_1):

				    with_effects = torch.ops.higher_order.with_effects(arg0_1, torch.ops._TestOpaqueObject.noisy_inject.default, arg2_1, arg1_1);  arg0_1 = arg2_1 = None

				    getitem = with_effects[0]

				    getitem_1 = with_effects[1];  with_effects = None

				    mul = torch.ops.aten.mul.Tensor(getitem_1, getitem_1);  getitem_1 = None

				    with_effects_1 = torch.ops.higher_order.with_effects(getitem, torch.ops._TestOpaqueObject.noisy_inject.default, mul, arg1_1);  getitem = mul = arg1_1 = None

				    getitem_2 = with_effects_1[0]

				    getitem_3 = with_effects_1[1];  with_effects_1 = None

				    add = torch.ops.aten.add.Tensor(getitem_3, getitem_3);  getitem_3 = None

				    return (getitem_2, add)""",  # noqa: B950

				            )

				        finally:

				            torch.library._register_effectful_op(

				                "_TestOpaqueObject::noisy_inject", None

				            )

				    def test_compile(self):

				        def foo(rng_state, x):

				            x = torch.ops._TestOpaqueObject.noisy_inject(x, rng_state)

				            x = x * x

				            x = torch.ops._TestOpaqueObject.noisy_inject(x, rng_state)

				            x = x + x

				            return x

				        rng = RNGState(0)

				        x = torch.ones(2, 3)

				        res = torch.compile(foo, fullgraph=True, backend="inductor")(rng, x)

				        self.assertFalse(torch.allclose(res, x * x + x))

				        backend = AotEagerAndRecordGraphs()

				        torch.compile(foo, fullgraph=True, backend=backend)(rng, x)

				        self.assertExpectedInline(

				            backend.graphs[0].code.strip(),

				            """\

				def forward(self, L_x_ : torch.Tensor, L_rng_state_ : __main___RNGState):

				    l_x_ = L_x_

				    l_rng_state_ = L_rng_state_

				    x = torch.ops._TestOpaqueObject.noisy_inject(l_x_, l_rng_state_);  l_x_ = None

				    x_1 = x * x;  x = None

				    x_2 = torch.ops._TestOpaqueObject.noisy_inject(x_1, l_rng_state_);  x_1 = l_rng_state_ = None

				    x_3 = x_2 + x_2;  x_2 = None

				    return (x_3,)""",  # noqa: B950

				        )

				        self.assertExpectedInline(

				            backend.fw_graphs[0].code.strip(),

				            """\

				def forward(self, arg0_1, arg1_1):

				    noisy_inject = torch.ops._TestOpaqueObject.noisy_inject.default(arg0_1, arg1_1);  arg0_1 = None

				    mul = torch.ops.aten.mul.Tensor(noisy_inject, noisy_inject);  noisy_inject = None

				    noisy_inject_1 = torch.ops._TestOpaqueObject.noisy_inject.default(mul, arg1_1);  mul = arg1_1 = None

				    add = torch.ops.aten.add.Tensor(noisy_inject_1, noisy_inject_1);  noisy_inject_1 = None

				    return (add,)""",  # noqa: B950

				        )

				    def test_compile_intermediate(self):

				        counter = Counter(0)

				        def foo(x, y):

				            z = torch.ops._TestOpaqueObject.increment_counter(counter, y)

				            x = x * z

				            z = torch.ops._TestOpaqueObject.increment_counter(counter, y)

				            x = x + z

				            return x, counter

				        inp = (torch.tensor(1), torch.tensor(0))

				        backend = AotEagerAndRecordGraphs()

				        opt_f = torch.compile(foo, fullgraph=True, backend=backend)

				        res = opt_f(*inp)

				        self.assertEqual(res[0], torch.tensor(3))

				        self.assertEqual(res[1].counter, torch.tensor(2))

				        res = opt_f(*inp)

				        self.assertEqual(res[0], torch.tensor(7))

				        self.assertEqual(res[1].counter, torch.tensor(4))

				        # counter is automatically lifted as an input

				        # Even though we returned counter in the eager code, it does not get

				        # returned in the graph because dynamo does not detect that the object

				        # is mutated.

				        self.assertExpectedInline(

				            backend.fw_graphs[0].code.strip(),

				            """\

				def forward(self, arg0_1, arg1_1, arg2_1):

				    auto_functionalized_v2 = torch.ops.higher_order.auto_functionalized_v2(torch.ops._TestOpaqueObject.increment_counter.default, c = arg1_1, _prev_base_index = 0, _all_bases = [arg0_1])

				    getitem = auto_functionalized_v2[0]

				    getitem_1 = auto_functionalized_v2[1];  auto_functionalized_v2 = None

				    mul = torch.ops.aten.mul.Tensor(arg2_1, getitem);  arg2_1 = getitem = None

				    auto_functionalized_v2_1 = torch.ops.higher_order.auto_functionalized_v2(torch.ops._TestOpaqueObject.increment_counter.default, c = arg1_1, _prev_base_index = 0, _all_bases = [getitem_1]);  arg1_1 = getitem_1 = None

				    getitem_2 = auto_functionalized_v2_1[0]

				    getitem_3 = auto_functionalized_v2_1[1];  auto_functionalized_v2_1 = None

				    add = torch.ops.aten.add.Tensor(mul, getitem_2);  mul = getitem_2 = None

				    copy_ = torch.ops.aten.copy_.default(arg0_1, getitem_3);  arg0_1 = getitem_3 = copy_ = None

				    return (add,)""",  # noqa: B950

				        )

				    def test_compile_attribute(self):

				        counter = Counter(0)

				        def foo(counter, x):

				            x = x * x

				            counter.increment_counter()

				            return x

				        with self.assertRaisesRegex(

				            RuntimeError, "Attempted to access attributes/methods on an OpaqueObject"

				        ):

				            torch.compile(foo)(counter, torch.ones(2, 3))

				        def bar(counter, x):

				            x = x * x

				            x += counter.counter

				            return x

				        with self.assertRaisesRegex(

				            RuntimeError, "Attempted to access attributes/methods on an OpaqueObject"

				        ):

				            torch.compile(bar)(counter, torch.ones(2, 3))

				    def test_export_joint(self):

				        class Moo(torch.nn.Module):

				            def forward(self, x, y):

				                return x * y

				        register_opaque_type(Moo, "_TestOpaqueObject_Moo")

				        torch.library.define(

				            "_TestOpaqueObject::module_mul",

				            "(_TestOpaqueObject_Moo a, Tensor b, SymInt c) -> Tensor",

				            tags=torch.Tag.pt2_compliant_tag,

				            lib=self.lib,

				        )

				        @torch.library.impl(

				            "_TestOpaqueObject::module_mul", "CompositeExplicitAutograd", lib=self.lib

				        )

				        def module_mul_impl(m: Moo, a: torch.Tensor, b: int) -> torch.Tensor:

				            assert isinstance(m, Moo)

				            return m(a, b)

				        @torch.library.register_fake("_TestOpaqueObject::module_mul", lib=self.lib)

				        def module_mul_fake(m: Moo, a: torch.Tensor, b: int) -> torch.Tensor:

				            return torch.empty_like(a)

				        def module_mul_setup_context(ctx, inputs, output):

				            m, a, b = inputs

				            ctx.b = b

				        def module_mul_backward(ctx, grad) -> torch.Tensor:

				            return None, grad * ctx.b, None

				        torch.library.register_autograd(

				            "_TestOpaqueObject::module_mul",

				            module_mul_backward,

				            setup_context=module_mul_setup_context,

				            lib=self.lib,

				        )

				        class M(torch.nn.Module):

				            def __init__(self):

				                super().__init__()

				                self.moo = Moo()

				            def forward(self, x, y):

				                b = y.item()

				                return torch.ops._TestOpaqueObject.module_mul(self.moo, x, b)

				        inp = (torch.randn(3, requires_grad=True), torch.tensor(4))

				        with ExitStack() as stack:

				            with FakeTensorMode(shape_env=ShapeEnv()):

				                joint = aot_export_joint_with_descriptors(stack, M(), inp)

				                self.assertExpectedInline(

				                    joint.graph_module.code.strip(),

				                    """\

				def forward(self, primals, tangents):

				    primals_1, primals_2, tangents_1, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec)

				    _local_scalar_dense = torch.ops.aten._local_scalar_dense.default(primals_2);  primals_2 = None

				    _opaque_obj0 = self._opaque_obj0

				    module_mul = torch.ops._TestOpaqueObject.module_mul.default(_opaque_obj0, primals_1, _local_scalar_dense);  _opaque_obj0 = primals_1 = None

				    mul_1 = torch.ops.aten.mul.Tensor(tangents_1, _local_scalar_dense);  tangents_1 = _local_scalar_dense = None

				    return pytree.tree_unflatten([module_mul, mul_1, None], self._out_spec)""",  # noqa: B950

				                )

				                compiled_fn = aot_compile_joint_with_descriptors(joint)

				        self.assertEqual(compiled_fn(*inp), M()(*inp))

				instantiate_parametrized_tests(TestOpaqueObject)

									
										21

test/test_proxy_tensor.py
									
												View File
												
				@ -796,6 +796,27 @@ def forward(self, x_1):

				        self._test(f, [torch.randn(1, 10), torch.zeros(1, dtype=torch.long)])

				    @unittest.skipIf(not HAS_CUDA, 'CUDA-only test')

				    def test_T244632748(self):

				        class TestModule(torch.nn.Module):

				            def forward(self, x):

				                return x + (x.shape[0] * 2)

				        mod = TestModule()

				        sample = torch.randn((5, 5)).to("cuda")

				        dim0 = torch.export.Dim.DYNAMIC(max=100)

				        dynamic_shapes = {"x": (dim0, torch.export.Dim.STATIC)}

				        ep = torch.export.export(mod, (sample,), dynamic_shapes=dynamic_shapes)

				        gm = ep.module()

				        symint = list(gm.graph.nodes)[3].meta["val"]

				        list(gm.graph.nodes)[3].replace_all_uses_with(symint)

				        gm.graph.eliminate_dead_code()

				        inductor_fx = torch._inductor.aot_compile(

				            gm, (sample,), options={"fx_wrapper": True, "compile_threads": 1}

				        )

				class TestGenericProxyTensorReal(TestGenericProxyTensor):

				    tracing_mode = "real"

									
										2

test/torch_np/numpy_tests/lib/test_histograms.py
									
												View File
												
				@ -310,7 +310,7 @@ class TestHistogram(TestCase):

				        )

				        # these should not crash

				        np.histogram([np.array(0.5) for i in range(10)] + [0.500000000000001])

				        np.histogram([np.array(0.5) for i in range(10)] + [0.500000000000002])

				        np.histogram([np.array(0.5) for i in range(10)] + [0.5])

				    @xpassIfTorchDynamo_np  # (reason="bins='auto'")

									
										10

torch/_dynamo/graph_break_registry.json
									
												View File
												
				@ -3657,5 +3657,15 @@

				      "Explanation": "Encountered triton kernel unsupported feature: {msg}",

				      "Hints": []

				    }

				  ],

				  "GB0362": [

				    {

				      "Gb_type": "Attempted to access attributes/methods on an OpaqueObject",

				      "Context": "value={self.value}, attr={name}",

				      "Explanation": "Attribute/method access of OpaqueObjects is not supported.",

				      "Hints": [

				        "Use custom operators instead of direct attribute/method access."

				      ]

				    }

				  ]

				}

									
										3

torch/_dynamo/output_graph.py
									
												View File
												
				@ -56,6 +56,7 @@ from torch._guards import (

				    tracing,

				    TracingContext,

				)

				from torch._library.opaque_object import is_opaque_type

				from torch._subclasses.fake_tensor import FakeTensor

				from torch._utils_internal import signpost_event

				from torch.export.dynamic_shapes import _ConstraintTarget

				@ -2605,6 +2606,8 @@ class OutputGraph(OutputGraphCommon):

				                                    fake_attr_val,

				                                )

				                        continue

				                    if is_opaque_type(type(node.meta["grapharg"].example)):

				                        continue

				                    fake = (

				                        arg.fake_tensor if arg.fake_tensor is not None else arg.example

				                    )

									
										42

torch/_dynamo/variables/builder.py
									
												View File
												
				@ -58,6 +58,7 @@ from torch._dynamo.utils import (

				from torch._guards import TracingContext

				from torch._higher_order_ops.flat_apply import flat_apply

				from torch._higher_order_ops.torchbind import call_torchbind

				from torch._library.opaque_object import is_opaque_type

				from torch._ops import HigherOrderOperator

				from torch._subclasses.fake_tensor import FakeTensor, is_fake, maybe_get_fake_mode

				from torch._subclasses.meta_utils import is_sparse_any, safe_grad

				@ -1452,27 +1453,32 @@ class VariableBuilder:

				                    source=self.source,

				                )

				            # This exists to allow a smoother transition.

				            # The implications are:

				            # The script objects won't be tracked as proxies.

				            # Methods on these objects won't show up in the graph.

				            # The original script object might be mutated.

				            if not hasattr(value, "__obj_flatten__"):

				                return self.wrap_user_defined(value)

				            if is_opaque_type(type(value)):

				                self.install_guards(GuardBuilder.TYPE_MATCH)

				            # Install the guards on the fully qualified name of the script object

				            LazyVariableTracker.realize_all(

				                VariableBuilder(self.tx, ScriptObjectQualifiedNameSource(self.source))(

				                    value._type().qualified_name()  # type: ignore[attr-defined]

				            elif not hasattr(value, "__obj_flatten__"):

				                # This exists to allow a smoother transition.

				                # The implications are:

				                # The script objects won't be tracked as proxies.

				                # Methods on these objects won't show up in the graph.

				                # The original script object might be mutated.

				                return self.wrap_user_defined(value)

				            else:

				                # Install the guards on the fully qualified name of the script object

				                LazyVariableTracker.realize_all(

				                    VariableBuilder(

				                        self.tx, ScriptObjectQualifiedNameSource(self.source)

				                    )(

				                        value._type().qualified_name()  # type: ignore[attr-defined]

				                    )

				                )

				            )

				            # Install the guards on the content of the script object by setting the source

				            # to be FlattenScriptObjectSource, which calls __obj_flatten__() to get the contents.

				            LazyVariableTracker.realize_all(

				                VariableBuilder(self.tx, FlattenScriptObjectSource(self.source))(

				                    value.__obj_flatten__()

				                # Install the guards on the content of the script object by setting the source

				                # to be FlattenScriptObjectSource, which calls __obj_flatten__() to get the contents.

				                LazyVariableTracker.realize_all(

				                    VariableBuilder(self.tx, FlattenScriptObjectSource(self.source))(

				                        value.__obj_flatten__()

				                    )

				                )

				            )

				            fake_script_obj = torch._library.fake_class_registry.maybe_to_fake_obj(

				                self.tx.output.fake_mode, value

									
										16

torch/_dynamo/variables/misc.py
									
												View File
												
				@ -18,6 +18,7 @@ Key classes include:

				"""

				import dataclasses

				import enum

				import functools

				import inspect

				import itertools

				@ -1604,11 +1605,16 @@ class NumpyVariable(VariableTracker):

				        return self.value

				    def as_proxy(self):

				        if config.trace_numpy and isinstance(self.value, type):

				            # This handles numpy dtype attributes such as np.float32

				            # We return a string as we don't want to serialize non-PyTorch objects in the output FX graph

				            # In torch/_numpy we normalize strings to their dtypes when the input is a dtype, as NumPy does

				            return self.value.__name__

				        if config.trace_numpy:

				            # Can replace with EnumType once we drop 3.10 support

				            if isinstance(self.value, enum.EnumMeta):

				                # This is mostly for np._CopyMode

				                return self.value

				            if isinstance(self.value, type):

				                # This handles numpy dtype attributes such as np.float32

				                # We return a string as we don't want to serialize non-PyTorch objects in the output FX graph

				                # In torch/_numpy we normalize strings to their dtypes when the input is a dtype, as NumPy does

				                return self.value.__name__

				        return super().as_proxy()

									
										13

torch/_dynamo/variables/script_object.py
									
												View File
												
				@ -25,6 +25,7 @@ from typing_extensions import ParamSpec

				import torch

				from torch._guards import Source

				from torch._library.opaque_object import is_opaque_type, OpaqueTypeStr

				from torch.fx.proxy import Proxy

				from .. import graph_break_hints

				@ -61,7 +62,7 @@ class TorchScriptObjectVariable(UserDefinedObjectVariable):

				    @classmethod

				    def is_matching_cls(cls, user_cls: type) -> bool:

				        return issubclass(user_cls, torch.ScriptObject)

				        return issubclass(user_cls, torch.ScriptObject) or is_opaque_type(user_cls)

				    @staticmethod

				    def create(proxy: Proxy, value: Any, **options: Any) -> "TorchScriptObjectVariable":

				@ -80,6 +81,16 @@ class TorchScriptObjectVariable(UserDefinedObjectVariable):

				        "Dynamo cannot safely trace script object due to graph break."

				    )

				    def var_getattr(self, tx: "InstructionTranslator", name: str) -> VariableTracker:

				        if getattr(self.value, "script_class_name", "") == OpaqueTypeStr:

				            unimplemented(

				                gb_type="Attempted to access attributes/methods on an OpaqueObject",

				                context=f"value={self.value}, attr={name}",

				                explanation="Attribute/method access of OpaqueObjects is not supported.",

				                hints=[

				                    "Use custom operators instead of direct attribute/method access.",

				                ],

				            )

				        from torch._higher_order_ops.torchbind import call_torchbind

				        from ..source import AttrSource

									
										5

torch/_export/non_strict_utils.py
									
												View File
												
				@ -24,6 +24,7 @@ from torch._export.passes.lift_constants_pass import ConstantAttrMap

				from torch._export.utils import _fakify_params_buffers

				from torch._guards import Source

				from torch._library.fake_class_registry import FakeScriptObject

				from torch._library.opaque_object import is_opaque_type

				from torch._subclasses.fake_tensor import FakeTensorMode

				from torch.export import Constraint

				from torch.export.dynamic_shapes import (

				@ -946,7 +947,9 @@ def _fakify_script_objects(

				    try:

				        for obj, fqns in constant_attrs.items():

				            if torch._library.fake_class_registry._is_script_object(obj):

				            if torch._library.fake_class_registry._is_script_object(

				                obj

				            ) or is_opaque_type(obj):

				                fake_script_obj = _maybe_fakify_obj(obj)

				                for fqn in fqns:

				                    cur_mod, attr = _leaf_mod_and_attr(mod, fqn)

									
										3

torch/_functorch/_aot_autograd/frontend_utils.py
									
												View File
												
				@ -8,6 +8,7 @@ from typing import Any, Optional

				import torch

				import torch.utils._pytree as pytree

				from torch._guards import detect_fake_mode

				from torch._library.opaque_object import is_opaque_type

				from torch._subclasses import FakeTensor, FakeTensorMode

				from torch.fx.experimental.proxy_tensor import _pytree_subclasses_that_lose_info

				from torch.fx.experimental.symbolic_shapes import ShapeEnv

				@ -46,7 +47,7 @@ def process_inputs(

				                        hint=x,

				                        source=source,

				                    )

				            if isinstance(x, torch.ScriptObject):

				            if isinstance(x, torch.ScriptObject) or is_opaque_type(type(x)):

				                return torch._library.fake_class_registry.maybe_to_fake_obj(

				                    fake_mode, x

				                )

									
										3

torch/_functorch/aot_autograd.py
									
												View File
												
				@ -534,6 +534,7 @@ def create_aot_state(

				    stack.enter_context(autograd_fallback_mode("error"))

				    from torch._library.fake_class_registry import FakeScriptObject, maybe_to_fake_obj

				    from torch._library.opaque_object import is_opaque_type

				    # Tracing may mutate the states the fake script object,

				    # so we need to duplicate the fake script objects so that subsequent tracing

				@ -541,7 +542,7 @@ def create_aot_state(

				    def _dup_fake_script_obj(fake_flat_args):

				        return [

				            maybe_to_fake_obj(detect_fake_mode(fake_flat_args), arg.real_obj)

				            if isinstance(arg, FakeScriptObject)

				            if isinstance(arg, FakeScriptObject) or is_opaque_type(type(arg))

				            else arg

				            for arg in fake_flat_args

				        ]

									
										98

torch/_higher_order_ops/effects.py
									
												View File
												
				@ -1,13 +1,13 @@

				# mypy: allow-untyped-defs

				from enum import Enum

				from typing import Any, Optional, Union

				from weakref import WeakKeyDictionary

				import torch

				import torch.utils._pytree as pytree

				from torch._C import DispatchKey

				from torch._higher_order_ops.torchbind import call_torchbind

				from torch._library.fake_class_registry import FakeScriptObject

				from torch._library.custom_ops import CustomOpDef

				from torch._library.effects import EffectType

				from torch._library.utils import RegistrationHandle

				from torch._ops import HigherOrderOperator

				from torch._subclasses.fake_tensor import FakeTensorMode

				from torch.fx.experimental.proxy_tensor import (

				@ -17,39 +17,50 @@ from torch.fx.experimental.proxy_tensor import (

				)

				class _EffectType(Enum):

				    ORDERED = "Ordered"

				_op_identifier = Union[

				    str,

				    "torch._ops.OpOverload",

				    "torch._library.custom_ops.CustomOpDef",

				    "torch._ops.HigherOrderOperator",

				]

				OpType = Union["torch._ops.HigherOrderOperator", "torch._ops.OpOverload"]

				_EffectType = EffectType

				OpType = Union[torch._ops.HigherOrderOperator, torch._ops.OpOverload]

				def _get_op_qualname(op: _op_identifier) -> str:

				    """Convert an op identifier to a qualified string key."""

				    if isinstance(op, torch._ops.OpOverload):

				        return op._name

				    elif isinstance(op, torch._ops.HigherOrderOperator):

				        return f"{op.namespace}::{op.name()}"

				    elif isinstance(op, CustomOpDef):

				        return op._qualname

				    elif isinstance(op, str):

				        return op

				    raise ValueError(f"Invalid operator input {op}")

				SIDE_EFFECTS = WeakKeyDictionary[OpType, _EffectType](

				    [

				        (torch.ops.aten._print.default, _EffectType.ORDERED),

				        (torch.ops.aten._async_error.default, _EffectType.ORDERED),

				        (call_torchbind, _EffectType.ORDERED),

				    ]

				)

				def _register_effectful_op(

				    op: _op_identifier, effect: Optional[EffectType]

				) -> RegistrationHandle:

				    qualname = _get_op_qualname(op)

				    entry = torch._library.simple_registry.singleton.find(qualname)

				    handle = entry.effect.register(effect)

				    return handle

				def _register_effectful_op(op: OpType, effect: _EffectType):

				    assert isinstance(

				        op, (torch._ops.OpOverload, torch._ops.HigherOrderOperator)

				    ) and not has_aliasing(op)

				    if op in SIDE_EFFECTS and SIDE_EFFECTS[op] != effect:

				        raise RuntimeError(

				            f"Already registered effect type {SIDE_EFFECTS[op]} to op {op}, "

				            f"trying to register a different effect type {effect}."

				        )

				    SIDE_EFFECTS[op] = effect

				def _get_effect(op: _op_identifier) -> Optional[_EffectType]:

				    qualname = _get_op_qualname(op)

				    entry = torch._library.simple_registry.singleton.find(qualname)

				    return entry.effect.effect

				def _deregister_effectful_op(op: OpType):

				    if op not in SIDE_EFFECTS:

				        raise RuntimeError(f"Op {op} is not registered as effectful")

				    del SIDE_EFFECTS[op]

				_register_effectful_op("aten::_print", _EffectType.ORDERED)

				_register_effectful_op("aten::_async_error", _EffectType.ORDERED)

				_register_effectful_op("profiler::_record_function_exit._RecordFunction", None)

				_register_effectful_op(call_torchbind, _EffectType.ORDERED)

				class WithEffects(HigherOrderOperator):

				@ -78,7 +89,7 @@ class WithEffects(HigherOrderOperator):

				    ) -> tuple[Any, ...]:

				        assert isinstance(op, (torch._ops.HigherOrderOperator, torch._ops.OpOverload))

				        assert not has_aliasing(op), "Ops with aliasing is not supported"

				        assert has_effects(op, args, kwargs)

				        assert has_effects(op)

				        assert isinstance(kwargs, dict)

				        return super().__call__(token, op, *args, **kwargs)

				@ -89,7 +100,7 @@ with_effects = WithEffects()

				def has_aliasing(op: OpType):

				    # NOT FOR PUBLIC USE

				    if isinstance(op, torch._ops.HigherOrderOperator):

				        return op not in SIDE_EFFECTS

				        return not _get_effect(op)

				    for arg in op._schema.arguments:

				        if arg.alias_info is not None:

				@ -100,7 +111,7 @@ def has_aliasing(op: OpType):

				    return False

				def has_effects(op, args, kwargs) -> bool:

				def has_effects(op) -> bool:

				    # Skip over the profiler's RecordFunction as they should not show up in the graph

				    _skip_ops = {torch.ops.profiler._record_function_exit._RecordFunction}

				    if op in _skip_ops:

				@ -109,31 +120,10 @@ def has_effects(op, args, kwargs) -> bool:

				    return (

				        isinstance(op, (torch._ops.HigherOrderOperator, torch._ops.OpOverload))

				        and not has_aliasing(op)

				        and get_effect_key(op, args, kwargs) is not None

				        and _get_effect(op) is not None

				    )

				def get_effect_key(op, args, kwargs) -> Optional[_EffectType]:

				    if op in SIDE_EFFECTS:

				        return SIDE_EFFECTS[op]

				    for arg in args:

				        if isinstance(arg, (torch.ScriptObject, FakeScriptObject)):

				            # Add it to the table so that next time we see the same op we don't

				            # have to parse through the args again

				            SIDE_EFFECTS[op] = _EffectType.ORDERED

				            return _EffectType.ORDERED

				    for arg in kwargs.values():

				        if isinstance(arg, (torch.ScriptObject, FakeScriptObject)):

				            # Add it to the table so that next time we see the same op we don't

				            # have to parse through the args again

				            SIDE_EFFECTS[op] = _EffectType.ORDERED

				            return _EffectType.ORDERED

				    return None

				def new_token_tensor() -> torch.Tensor:

				    return torch.tensor([])

				@ -238,7 +228,7 @@ def handle_effects(

				    # Get a token. We can't do `tokens.get(op, torch.tensor([]))` because

				    # this will create an empty tensor during proxy mode tracing if the token

				    # doesn't exist. But the tokens should always exist during proxy mode tracing.

				    key = get_effect_key(op, args, kwargs)

				    key = _get_effect(op)

				    assert key is not None

				    if key not in tokens:

				        assert allow_token_discovery, (

									
										4

torch/_inductor/codegen/wrapper.py
									
												View File
												
				@ -2122,6 +2122,10 @@ class PythonWrapperCodegen(CodeGen):

				            output.writeline(f"{name} = {val}")

				        def add_torchbind_input(name, value):

				            if value is None:

				                output.writeline(f"{name} = None")

				                return

				            import pickle

				            assert isinstance(value, torch.ScriptObject)

									
										5

torch/_inductor/compile_fx.py
									
												View File
												
				@ -91,6 +91,7 @@ from torch._inductor.utils import (

				    tensor_is_aligned,

				)

				from torch._library.fake_class_registry import FakeScriptObject

				from torch._library.opaque_object import is_opaque_type

				from torch._logging import trace_structured

				from torch._utils_internal import compile_time_strobelight_meta

				from torch.fx import GraphModule

				@ -2747,7 +2748,9 @@ def _compile_fx_main(

				                            node.meta["val"] = fake_mode.from_tensor(

				                                target, static_shapes=True

				                            )

				                        elif isinstance(target, torch.ScriptObject):

				                        elif isinstance(target, torch.ScriptObject) or is_opaque_type(

				                            type(target)

				                        ):

				                            node.meta["val"] = (

				                                torch._library.fake_class_registry.maybe_to_fake_obj(

				                                    fake_mode, target

									
										3

torch/_inductor/cpp_builder.py
									
												View File
												
				@ -883,11 +883,12 @@ def _get_optimization_cflags(

				    should_use_optimized_flags = not (

				        config.aot_inductor.debug_compile

				        or os.environ.get("TORCHINDUCTOR_DEBUG_SYMBOL", "0") == "1"

				        or os.environ.get("TORCHINDUCTOR_DEBUG_COMPILE", "0") == "1"

				    )

				    should_add_debug_symbol_flags = (

				        config.aot_inductor.debug_compile

				        or config.aot_inductor.debug_symbols

				        or os.environ.get("TORCHINDUCTOR_DEBUG_COMPILE", "0") == "1"

				        or os.environ.get("TORCHINDUCTOR_DEBUG_SYMBOL", "0") == "1"

				    )

				    if should_use_optimized_flags:

									
										11

torch/_inductor/ir.py
									
												View File
												
				@ -9242,12 +9242,9 @@ class EffectfulKernel(FallbackKernel):

				            unbacked_bindings=unbacked_bindings,

				        )

				        from torch._higher_order_ops.effects import get_effect_key

				        from torch._higher_order_ops.effects import _get_effect

				        uncovered_args = [

				            a.value if isinstance(a, TorchBindObject) else a for a in tensor_args

				        ]

				        effect_type = get_effect_key(kernel, (*nontensor_args, *uncovered_args), kwargs)

				        effect_type = _get_effect(kernel)

				        assert effect_type is not None

				        self.effect_type = effect_type

				        self.prev_effect_buffer = V.graph.effectful_ops.get(effect_type, None)

				@ -9298,6 +9295,10 @@ class TorchBindObject(NonTensorObj):

				    def get_buf_bytes(self) -> int:

				        # Returns the sum of all tensors in the flattened object

				        real_script_obj = self.get_real_obj()

				        if real_script_obj is None:

				            return 0

				        assert hasattr(real_script_obj, "__obj_flatten__")

				        flat_dict = dict(real_script_obj.__obj_flatten__())

				        flat_elems = pytree.tree_flatten(flat_dict)[0]

									
										7

torch/_inductor/lowering.py
									
												View File
												
				@ -26,6 +26,7 @@ import torch.utils._pytree as pytree

				from torch._dynamo.utils import counters

				from torch._higher_order_ops.associative_scan import associative_scan_op

				from torch._higher_order_ops.triton_kernel_wrap import triton_kernel_wrapper_mutation

				from torch._library.fake_class_registry import FakeScriptObject

				from torch._library.utils import get_layout_constraint_tag

				from torch._prims_common import (  # pyrefly: ignore  # deprecated; pyrefly: ignore [deprecated]

				    canonicalize_dim,

				@ -2704,6 +2705,8 @@ def require_channels_last(_, *args, **kwargs):

				def constrain_to_fake_tensor(arg, fake_arg):

				    if isinstance(fake_arg, FakeScriptObject):

				        return arg

				    if isinstance(arg, ir.IRNode):

				        meta_stride_expr = [

				            s.node.expr if isinstance(s, torch.SymInt) else s for s in fake_arg.stride()

				@ -7453,9 +7456,9 @@ def _sink_tokens(tokens):

				def with_effects(token, op, *args, **kwargs):

				    result = ir.EffectfulKernel.create(op, *args, **kwargs)

				    from torch._higher_order_ops.effects import get_effect_key

				    from torch._higher_order_ops.effects import _get_effect

				    effect_type = get_effect_key(op, args, kwargs)

				    effect_type = _get_effect(op)

				    assert effect_type is not None

				    effectful_kernel = V.graph.effectful_ops[effect_type]

									
										399

torch/_inductor/select_algorithm.py
									
												View File
												
				@ -2740,163 +2740,10 @@ class AlgorithmSelectorCache(PersistentCache):

				        inputs_key = create_inputs_key(input_nodes)

				        # TODO(nmacchioni): remove this hacky way to tell if we ran benchmarking

				        has_autotuned = False

				        def benchmark(choices, hint_override: Optional[int] = None):

				            nonlocal has_autotuned

				            # TODO(nmacchioni): remove this hacky way to tell if we ran benchmarking

				            has_autotuned = True

				            counters["inductor"]["select_algorithm_autotune"] += 1

				            # TODO(nmacchioni): remove this layer of abstraction

				            # construct `benchmark_fn` which should pick between in-process and sub-process autotuning

				            benchmark_fn = self.make_benchmark_fn(

				                choices, input_nodes, layout, input_gen_fns, hint_override=hint_override

				            )

				            # `benchmark_fn(choices)` will execute each choice, and return a dict[choice, timing] which

				            # maps each choice to its runtime, calculated by the specified benchmarker, in milliseconds

				            return benchmark_fn(choices)

				        def autotune(choices, hint_override: Optional[int] = None):

				            log.debug("Starting autotuning")

				            with dynamo_timed(

				                f"{name}_template_autotuning",

				                log_pt2_compile_event=True,

				                dynamo_compile_column_us="compile_time_autotune_time_us",

				                metadata=_autotune_metadata(input_nodes),

				            ):

				                benchmark_results = benchmark(choices, hint_override=hint_override)

				                if config.max_autotune_report_choices_stats:

				                    _log_autotune_choices_stats(

				                        f"{name}_template_autotuning", benchmark_results

				                    )

				                return benchmark_results

				        if config.autotune_in_subproc:

				            # Initialize the suprocess pool so it will warmup early.

				            torch._inductor.autotune_process.get_tuning_process_pool()

				        def do_autotuning(choices, precompile_fn, hint_override: Optional[int] = None):

				            precompile_start_ts = time.time()

				            with dynamo_timed(

				                f"{name}_template_precompiling",

				                log_pt2_compile_event=True,

				                dynamo_compile_column_us="compile_time_autotune_time_us",

				            ):

				                precompile_fn()

				            precompile_elapse = time.time() - precompile_start_ts

				            log.debug("Precompilation elapsed time: %.02fs", precompile_elapse)

				            # Prune anything that failed to compile

				            choices = [c for c in choices if not c.failed]

				            if len(choices) == 0:

				                raise self.create_no_valid_choices(

				                    name, "All choices failed to compile for backend."

				                )

				            candidates = self.prescreen_choices(

				                choices, name, inputs_key, self.prescreening_cache

				            )

				            prescreening_elapse: Optional[float] = None

				            if candidates:

				                prescreening_start_ts = time.time()

				                timings = self.lookup(

				                    candidates,

				                    name,

				                    inputs_key,

				                    lambda choices: autotune(choices, hint_override=hint_override),

				                    hint_override=hint_override,

				                )

				                choices = self.prune_choices_postscreen(

				                    choices, timings, name, inputs_key, self.prescreening_cache

				                )

				                prescreening_elapse = time.time() - prescreening_start_ts

				                log.debug("Prescreening elapsed time: %.02fs", prescreening_elapse)

				            autotune_start_ts = time.time()

				            if best_config_future is not None:

				                best_config = await_sync(best_config_future)

				                important_keys = [

				                    "ACC_TYPE",

				                    "ALLOW_TF32",

				                    "BLOCK_K",

				                    "BLOCK_M",

				                    "BLOCK_N",

				                    "EVEN_K",

				                    "GROUP_M",

				                    "USE_FAST_ACCUM",

				                    "num_stages",

				                    "num_warps",

				                    "num_consumer_groups",

				                    "num_buffers_warp_spec",

				                ]

				                choices = [

				                    choice

				                    for choice in choices

				                    if all(

				                        f"{k}={best_config[k]}" in choice.description

				                        for k in important_keys

				                    )

				                    for k in important_keys

				                ]

				                log.info("Filtered to %d choices based on best_config", len(choices))

				            timings = self.lookup(

				                choices,

				                name,

				                inputs_key,

				                lambda choices: autotune(choices, hint_override=hint_override),

				                hint_override=hint_override,

				            )

				            autotune_elapse = time.time() - autotune_start_ts

				            log.debug("Autotuning elapsed time: %.02fs", autotune_elapse)

				            if timings and all(

				                not math.isfinite(timing) for timing in timings.values()

				            ):

				                raise NoValidChoicesError

				            if (

				                has_autotuned

				                or log.getEffectiveLevel() == logging.DEBUG

				                or config.trace.log_autotuning_results

				            ):

				                self.log_results(

				                    name,

				                    input_nodes,

				                    timings,

				                    autotune_elapse,

				                    precompile_elapse,

				                    prescreening_elapse,

				                    hint_override=hint_override,

				                )

				            def profiler_bench_function():

				                # we're not running through the normal caching autotuner method here because we want to avoid returning

				                # the cached value.

				                # Avoid benchmarking in a separate process because it's not easy to signal to the TuningProcess that we

				                # should use the profiler.

				                with config.patch(

				                    profile_bandwidth_with_do_bench_using_profiling=True,

				                    autotune_in_subproc=False,

				                ):

				                    return benchmark(choices)

				            for feedback_fn in self.feedback_saver_fns:

				                # re-benchmarking the same choices with profiler is a bit expensive, so pass it in as a thunk.

				                feedback_fn(

				                    timings,

				                    name,

				                    input_nodes,

				                    choices,

				                    profiler_bench_function,

				                )

				            return timings

				        precompile_fn = self.make_precompile_fn(

				            choices,

				            name,

				@ -2913,8 +2760,16 @@ class AlgorithmSelectorCache(PersistentCache):

				                    if not hasattr(c, "hint_override")

				                    or c.hint_override == hint_override

				                ]

				                timings = do_autotuning(

				                    filtered_choices, precompile_fn, hint_override=hint_override

				                timings = self.do_autotuning(

				                    name,

				                    input_nodes,

				                    layout,

				                    input_gen_fns,

				                    inputs_key,

				                    filtered_choices,

				                    precompile_fn,

				                    hint_override=hint_override,

				                    best_config_future=best_config_future,

				                )

				                min_extern_choice = float("inf")

				                for choice, timing in timings.items():

				@ -2950,7 +2805,16 @@ class AlgorithmSelectorCache(PersistentCache):

				                )

				            )

				        timings = do_autotuning(choices, precompile_fn)

				        timings = self.do_autotuning(

				            name,

				            input_nodes,

				            layout,

				            input_gen_fns,

				            inputs_key,

				            choices,

				            precompile_fn,

				            best_config_future=best_config_future,

				        )

				        # if timings is empty, we really have no choice but to return a semi-random

				        # choice. returning the first `ExternKernelCaller` is probably the safest bet

				        # in this case, since it will generally be the ATen kernel. if there are no

				@ -2986,6 +2850,229 @@ class AlgorithmSelectorCache(PersistentCache):

				            return node, choice

				        return node

				    def benchmark(

				        self,

				        choices,

				        input_nodes,

				        layout,

				        input_gen_fns,

				        hint_override: Optional[int] = None,

				    ):

				        counters["inductor"]["select_algorithm_autotune"] += 1

				        # TODO(nmacchioni): remove this layer of abstraction

				        # construct `benchmark_fn` which should pick between in-process and sub-process autotuning

				        benchmark_fn = self.make_benchmark_fn(

				            choices, input_nodes, layout, input_gen_fns, hint_override=hint_override

				        )

				        # `benchmark_fn(choices)` will execute each choice, and return a dict[choice, timing] which

				        # maps each choice to its runtime, calculated by the specified benchmarker, in milliseconds

				        return benchmark_fn(choices)

				    def autotune(

				        self,

				        name,

				        input_nodes,

				        layout,

				        input_gen_fns,

				        choices,

				        hint_override: Optional[int] = None,

				    ):

				        log.debug("Starting autotuning")

				        with dynamo_timed(

				            f"{name}_template_autotuning",

				            log_pt2_compile_event=True,

				            dynamo_compile_column_us="compile_time_autotune_time_us",

				            metadata=_autotune_metadata(input_nodes),

				        ):

				            benchmark_results = self.benchmark(

				                choices, input_nodes, layout, input_gen_fns, hint_override=hint_override

				            )

				            if config.max_autotune_report_choices_stats:

				                _log_autotune_choices_stats(

				                    f"{name}_template_autotuning", benchmark_results

				                )

				            return benchmark_results

				    def do_autotuning(

				        self,

				        name,

				        input_nodes,

				        layout,

				        input_gen_fns,

				        inputs_key,

				        choices,

				        precompile_fn,

				        hint_override: Optional[int] = None,

				        best_config_future=None,

				    ):

				        """Execute the autotuning process for kernel algorithm selection.

				        This method orchestrates the complete autotuning pipeline including precompilation,

				        prescreening, benchmarking, and feedback collection to select the optimal kernel

				        implementation for given inputs.

				        Args:

				            name: Name identifier for the operation being autotuned (e.g., 'mm', 'convolution').

				            input_nodes: List of input IR nodes used for benchmarking.

				            layout: Layout information specifying device and memory format for the operation.

				            input_gen_fns: Optional dict mapping argument indices to functions that generate

				                torch.Tensor inputs from ir.Buffer for benchmarking. If provided, these are

				                used instead of random tensors.

				            inputs_key: Cache key representing the input characteristics (sizes, strides, dtypes).

				            choices: List of ChoiceCaller objects representing candidate kernel implementations.

				            precompile_fn: Callable that precompiles all kernel choices before benchmarking.

				            hint_override: Optional index to override which choice is selected, used for testing

				                or forced selection.

				            best_config_future: Optional future containing pre-determined best configuration to

				                filter choices by specific config parameters.

				        Returns:

				            dict: Mapping from ChoiceCaller to benchmark timing in seconds. Choices with

				                non-finite timings (inf/nan) indicate failures.

				        Raises:

				            NoValidChoicesError: When all choices fail to compile or benchmark, or when all

				                timing results are non-finite.

				        """

				        precompile_start_ts = time.time()

				        with dynamo_timed(

				            f"{name}_template_precompiling",

				            log_pt2_compile_event=True,

				            dynamo_compile_column_us="compile_time_autotune_time_us",

				        ):

				            precompile_fn()

				        precompile_elapse = time.time() - precompile_start_ts

				        log.debug("Precompilation elapsed time: %.02fs", precompile_elapse)

				        # Prune anything that failed to compile

				        choices = [c for c in choices if not c.failed]

				        if len(choices) == 0:

				            raise self.create_no_valid_choices(

				                name, "All choices failed to compile for backend."

				            )

				        candidates = self.prescreen_choices(

				            choices, name, inputs_key, self.prescreening_cache

				        )

				        prescreening_elapse: Optional[float] = None

				        if candidates:

				            prescreening_start_ts = time.time()

				            timings = self.lookup(

				                candidates,

				                name,

				                inputs_key,

				                lambda choices: self.autotune(

				                    name,

				                    input_nodes,

				                    layout,

				                    input_gen_fns,

				                    choices,

				                    hint_override=hint_override,

				                ),

				                hint_override=hint_override,

				            )

				            choices = self.prune_choices_postscreen(

				                choices, timings, name, inputs_key, self.prescreening_cache

				            )

				            prescreening_elapse = time.time() - prescreening_start_ts

				            log.debug("Prescreening elapsed time: %.02fs", prescreening_elapse)

				        autotune_start_ts = time.time()

				        if best_config_future is not None:

				            best_config = await_sync(best_config_future)

				            important_keys = [

				                "ACC_TYPE",

				                "ALLOW_TF32",

				                "BLOCK_K",

				                "BLOCK_M",

				                "BLOCK_N",

				                "EVEN_K",

				                "GROUP_M",

				                "USE_FAST_ACCUM",

				                "num_stages",

				                "num_warps",

				                "num_consumer_groups",

				                "num_buffers_warp_spec",

				            ]

				            choices = [

				                choice

				                for choice in choices

				                if all(

				                    f"{k}={best_config[k]}" in choice.description

				                    for k in important_keys

				                )

				                for k in important_keys

				            ]

				            log.info("Filtered to %d choices based on best_config", len(choices))

				        has_autotuned: bool = False

				        def track_has_autotuned(choices):

				            nonlocal has_autotuned

				            has_autotuned = True

				            return self.autotune(

				                name,

				                input_nodes,

				                layout,

				                input_gen_fns,

				                choices,

				                hint_override=hint_override,

				            )

				        timings = self.lookup(

				            choices,

				            name,

				            inputs_key,

				            track_has_autotuned,

				            hint_override=hint_override,

				        )

				        autotune_elapse = time.time() - autotune_start_ts

				        log.debug("Autotuning elapsed time: %.02fs", autotune_elapse)

				        if timings and all(not math.isfinite(timing) for timing in timings.values()):

				            raise NoValidChoicesError

				        if (

				            has_autotuned

				            or log.getEffectiveLevel() == logging.DEBUG

				            or config.trace.log_autotuning_results

				        ):

				            self.log_results(

				                name,

				                input_nodes,

				                timings,

				                autotune_elapse,

				                precompile_elapse,

				                prescreening_elapse,

				                hint_override=hint_override,

				            )

				        def profiler_bench_function():

				            # we're not running through the normal caching autotuner method here because we want to avoid returning

				            # the cached value.

				            # Avoid benchmarking in a separate process because it's not easy to signal to the TuningProcess that we

				            # should use the profiler.

				            with config.patch(

				                profile_bandwidth_with_do_bench_using_profiling=True,

				                autotune_in_subproc=False,

				            ):

				                return self.benchmark(choices, input_nodes, layout, input_gen_fns)

				        for feedback_fn in self.feedback_saver_fns:

				            # re-benchmarking the same choices with profiler is a bit expensive, so pass it in as a thunk.

				            feedback_fn(

				                timings,

				                name,

				                input_nodes,

				                choices,

				                profiler_bench_function,

				            )

				        return timings

				    def create_no_valid_choices(self, name: str, reason: str) -> NoValidChoicesError:

				        backend_config = (

				            "max_autotune_gemm_backends"

									
										4

torch/_library/custom_ops.py
									
												View File
												
				@ -13,6 +13,7 @@ from torch.types import _dtype

				from torch.utils._exposed_in import exposed_in

				from . import autograd, utils

				from .effects import EffectType

				device_types_t = Optional[Union[str, Sequence[str]]]

				@ -471,6 +472,9 @@ class CustomOpDef:

				        self._abstract_fn = fn

				        return fn

				    def register_effect(self, effect: Optional[EffectType]) -> None:

				        self._lib._register_effectful_op(self._qualname, effect)

				    def register_torch_dispatch(

				        self, torch_dispatch_class: Any, fn: Optional[Callable] = None, /

				    ) -> Callable:

									
										68

torch/_library/effects.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,68 @@

				from enum import Enum

				from typing import Optional

				import torch

				class EffectType(Enum):

				    ORDERED = "Ordered"

				from torch._library.utils import RegistrationHandle

				class EffectHolder:

				    """A holder where one can register an effect impl to."""

				    def __init__(self, qualname: str):

				        self.qualname: str = qualname

				        self._set_default_effect()

				    def _set_default_effect(self) -> None:

				        self._effect: Optional[EffectType] = None

				        # If the op contains a ScriptObject input, we want to mark it as having effects

				        namespace, opname = torch._library.utils.parse_namespace(self.qualname)

				        split = opname.split(".")

				        if len(split) > 1:

				            assert len(split) == 2, (

				                f"Tried to split {opname} based on '.' but found more than 1 '.'"

				            )

				            opname, overload = split

				        else:

				            overload = ""

				        if namespace == "higher_order":

				            return

				        opname = f"{namespace}::{opname}"

				        if torch._C._get_operation_overload(opname, overload) is not None:

				            # Since we call this when destroying the library, sometimes the

				            # schema will be gone already at that time.

				            schema = torch._C._get_schema(opname, overload)

				            for arg in schema.arguments:

				                if isinstance(arg.type, torch.ClassType):

				                    self._effect = EffectType.ORDERED

				                    return

				    @property

				    def effect(self) -> Optional[EffectType]:

				        return self._effect

				    @effect.setter

				    def effect(self, _):

				        raise RuntimeError("Unable to directly set kernel.")

				    def register(self, effect: Optional[EffectType]) -> RegistrationHandle:

				        """Register an effect

				        Returns a RegistrationHandle that one can use to de-register this

				        effect.

				        """

				        self._effect = effect

				        def deregister_effect():

				            self._set_default_effect()

				        handle = RegistrationHandle(deregister_effect)

				        return handle

									
										3

torch/_library/simple_registry.py
									
												View File
												
				@ -1,6 +1,7 @@

				from collections.abc import Callable

				from typing import Any, Optional

				from .effects import EffectHolder

				from .fake_impl import FakeImplHolder

				from .utils import RegistrationHandle

				@ -51,6 +52,8 @@ class SimpleOperatorEntry:

				            GenericTorchDispatchRuleHolder(qualname)

				        )

				        self.effect: EffectHolder = EffectHolder(qualname)

				    # For compatibility reasons. We can delete this soon.

				    @property

				    def abstract_impl(self) -> FakeImplHolder:

									
										6

torch/_numpy/_util.py
									
												View File
												
				@ -230,6 +230,12 @@ def _coerce_to_tensor(obj, dtype=None, copy=False, ndmin=0):

				    if ndim_extra > 0:

				        tensor = tensor.view((1,) * ndim_extra + tensor.shape)

				    # special handling for np._CopyMode

				    try:

				        copy = bool(copy)

				    except ValueError:

				        # TODO handle _CopyMode.IF_NEEDED correctly

				        copy = False

				    # copy if requested

				    if copy:

				        tensor = tensor.clone()

									
										17

torch/_ops.py
									
												View File
												
				@ -1023,6 +1023,7 @@ class TorchBindOpOverload(OpOverload[_P, _T]):

				            DispatchKey.BackendSelect,

				            DispatchKey.PythonTLSSnapshot,

				            DispatchKey.PythonDispatcher,

				            DispatchKey.Functionalize,

				        ]

				        def _may_use_fallthrough_instead_of_fallback(key: DispatchKey):

				@ -1046,17 +1047,23 @@ class TorchBindOpOverload(OpOverload[_P, _T]):

				    def _register_as_effectful_op_temporarily(self):

				        from torch._higher_order_ops.effects import (

				            _EffectType,

				            _get_effect,

				            _register_effectful_op,

				            SIDE_EFFECTS,

				        )

				        try:

				            if self not in SIDE_EFFECTS:

				                _register_effectful_op(self, _EffectType.ORDERED)

				            # We don't want to register the effect if there already exists a

				            # registration, especially if the registration is None (explicitly

				            # no effect)

				            register_tmp_effect = _get_effect(self) is None

				            handle = None

				            if register_tmp_effect:

				                handle = _register_effectful_op(self, _EffectType.ORDERED)

				            yield

				        finally:

				            if self in SIDE_EFFECTS:

				                del SIDE_EFFECTS[self]

				            if register_tmp_effect:

				                assert handle is not None

				                handle.destroy()

				    # Use positional-only argument to avoid naming collision with aten ops arguments

				    # that are named "self". This way, all the aten ops can be called by kwargs.

									
										132

torch/_subclasses/functional_tensor.py
									
												View File
												
				@ -11,7 +11,7 @@ import torch

				import torch.fx.traceback as fx_traceback

				import torch.utils._pytree as pytree

				from torch._C import _functionalization_reapply_views_tls as _reapply_views

				from torch._ops import _get_dispatch_mode_pre_dispatch

				from torch._ops import _get_dispatch_mode_pre_dispatch, TorchBindOpOverload

				from torch._subclasses.meta_utils import is_sparse_any

				from torch.utils._python_dispatch import (

				    _detect_infra_mode,

				@ -471,7 +471,7 @@ class FunctionalTensorMode(TorchDispatchMode):

				        from torch._higher_order_ops.effects import handle_effects, has_effects

				        if has_effects(func, args, kwargs):

				        if has_effects(func):

				            assert not torch._C._dispatch_has_kernel_for_dispatch_key(

				                func.name(), torch._C.DispatchKey.Functionalize

				            )

				@ -504,65 +504,81 @@ class FunctionalTensorMode(TorchDispatchMode):

				            - FunctionalTensor._extra_dispatch_keys

				        )

				        # All we want to do here is reuse the existing C++ functionalization logic.

				        # This requires swizzling our TLS dispatch keys so that the Functionalize key is active.

				        with torch._C._ForceDispatchKeyGuard(include_to_set, exclude_to_set):

				            try:

				                # By default for python functionalization (for AOTAutograd), we reapply views.

				                old_apply_views = torch._functionalize_enable_reapply_views(True)  # type: ignore[attr-defined]

				        if isinstance(func, TorchBindOpOverload):

				            # When the function is a TorchBindOpOverload, meaning some of the

				            # inputs are FakeScriptObjects, we need to skip c++ dispatcher and

				            # dispatch in python because C++ dispatcher will check the schema

				            # and cannot recognize FakeScriptObject.

				            ctx = PythonFunctionalizeAPI()

				            fully_unwrapped_args = ctx.unwrap_tensors(args)

				            fully_unwrapped_kwargs = ctx.unwrap_tensors(

				                kwargs  # pyrefly: ignore[bad-argument-type]

				            )

				            outs_unwrapped = func(

				                *fully_unwrapped_args,

				                **fully_unwrapped_kwargs,

				            )

				            outs_wrapped = ctx.wrap_tensors(outs_unwrapped)

				        else:

				            # All we want to do here is reuse the existing C++ functionalization logic.

				            # This requires swizzling our TLS dispatch keys so that the Functionalize key is active.

				            with torch._C._ForceDispatchKeyGuard(include_to_set, exclude_to_set):

				                try:

				                    # By default for python functionalization (for AOTAutograd), we reapply views.

				                    old_apply_views = torch._functionalize_enable_reapply_views(True)  # type: ignore[attr-defined]

				                # Sometimes these functions cannot be directly dispatched to functionalize key

				                # because args are sometimes not functional tensors for some reason?

				                if func in FunctionalTensor.metadata_fns:

				                    outs_unwrapped = func(*args_unwrapped, **kwargs_unwrapped)

				                    outs_wrapped = pytree.tree_map_only(

				                        torch.Tensor, wrap, outs_unwrapped

				                    )

				                else:

				                    # Note: [Functionalization View Replay Annotation]

				                    # When functionalization encounters a mutation, it handles aliases by lazily regenerating the aliases

				                    # at the first time they are next used.

				                    # This is a problem when plumbing user annotations during tracing. We want the view ops from view replay

				                    # to have the same annotation that the user specified on the original views. But view replay in

				                    # functionalization happens the next time the alias is used (e.g. second_op(alias_with_pending_mutation)),

				                    # so when we regenerate views before calling into second_op, those views will end up getting the metadata

				                    # for second_op!

				                    #

				                    # Instead, we need to remember the node metadata from the original views, and ensure that this node metadata

				                    # is globally set when we lazily perform view replay.

				                    # The globally set metadata will be used to populate the fx node created for the replayed operation.

				                    if m := torch._C._get_dispatch_mode(

				                        torch._C._TorchDispatchModeKey.PROXY

				                    ):

				                        for a in pytree.tree_leaves([args, kwargs]):

				                            if not isinstance(a, FunctionalTensor):

				                                continue

				                            curr_node = m.tracer.tensor_tracker[

				                                torch._from_functional_tensor(a.elem)

				                            ].proxy.node

				                            with fx_traceback.set_current_replay_node(curr_node):

				                                torch._sync(a)

				                    # Sometimes these functions cannot be directly dispatched to functionalize key

				                    # because args are sometimes not functional tensors for some reason?

				                    if func in FunctionalTensor.metadata_fns:

				                        outs_unwrapped = func(*args_unwrapped, **kwargs_unwrapped)

				                        outs_wrapped = pytree.tree_map_only(

				                            torch.Tensor, wrap, outs_unwrapped

				                        )

				                    else:

				                        # Note: [Functionalization View Replay Annotation]

				                        # When functionalization encounters a mutation, it handles aliases by lazily regenerating the aliases

				                        # at the first time they are next used.

				                        # This is a problem when plumbing user annotations during tracing. We want the view ops from view replay

				                        # to have the same annotation that the user specified on the original views. But view replay in

				                        # functionalization happens the next time the alias is used (e.g. second_op(alias_with_pending_mutation)),

				                        # so when we regenerate views before calling into second_op, those views will end up getting the metadata

				                        # for second_op!

				                        #

				                        # Instead, we need to remember the node metadata from the original views, and ensure that this node metadata

				                        # is globally set when we lazily perform view replay.

				                        # The globally set metadata will be used to populate the fx node created for the replayed operation.

				                        if m := torch._C._get_dispatch_mode(

				                            torch._C._TorchDispatchModeKey.PROXY

				                        ):

				                            for a in pytree.tree_leaves([args, kwargs]):

				                                if not isinstance(a, FunctionalTensor):

				                                    continue

				                                curr_node = m.tracer.tensor_tracker[

				                                    torch._from_functional_tensor(a.elem)

				                                ].proxy.node

				                                with fx_traceback.set_current_replay_node(curr_node):

				                                    torch._sync(a)

				                    # When we dispatch to the C++ functionalization kernel, we might need to jump back to the

				                    # PreDispatch mode stack afterwards, to handle any other PreDispatch modes underneath

				                    # FunctionalTensorMode. If we call func() directly, we would need to exclude PreDispatch

				                    # from the TLS in order to avoid infinite looping, but this would prevent us from coming

				                    # back to PreDispatch later

				                    outs_unwrapped = func._op_dk(

				                        torch._C.DispatchKey.Functionalize,

				                        *args_unwrapped,

				                        **kwargs_unwrapped,

				                    )

				                        # When we dispatch to the C++ functionalization kernel, we might need to jump back to the

				                        # PreDispatch mode stack afterwards, to handle any other PreDispatch modes underneath

				                        # FunctionalTensorMode. If we call func() directly, we would need to exclude PreDispatch

				                        # from the TLS in order to avoid infinite looping, but this would prevent us from coming

				                        # back to PreDispatch later

				                        outs_unwrapped = func._op_dk(

				                            torch._C.DispatchKey.Functionalize,

				                            *args_unwrapped,

				                            **kwargs_unwrapped,

				                        )

				                    if self.export:

				                        if func is torch.ops.aten.dropout.default:

				                            torch._freeze_functional_tensor(outs_unwrapped)  # type: ignore[attr-defined]

				                    outs_wrapped = pytree.tree_map_only(

				                        torch.Tensor, wrap, outs_unwrapped

				                    )

				            finally:

				                torch._disable_functionalization()

				                torch._functionalize_enable_reapply_views(old_apply_views)  # type: ignore[attr-defined]

				                        if self.export:

				                            if func is torch.ops.aten.dropout.default:

				                                torch._freeze_functional_tensor(outs_unwrapped)  # type: ignore[attr-defined]

				                        outs_wrapped = pytree.tree_map_only(

				                            torch.Tensor, wrap, outs_unwrapped

				                        )

				                finally:

				                    torch._disable_functionalization()

				                    torch._functionalize_enable_reapply_views(old_apply_views)  # type: ignore[attr-defined]

				        is_included = torch._C._dispatch_tls_is_dispatch_key_included(

				            torch._C.DispatchKey.Functionalize

									
										13

torch/fx/_symbolic_trace.py
									
												View File
												
				@ -18,6 +18,7 @@ import torch

				import torch.utils._pytree as pytree

				from torch._C import ScriptObject  # type: ignore[attr-defined]

				from torch._library.fake_class_registry import FakeScriptObject

				from torch._library.opaque_object import is_opaque_type

				from ._compatibility import compatibility

				from ._lazy_graph_module import _make_graph_module

				@ -421,8 +422,10 @@ class Tracer(TracerBase):

				        # a get_attr to retrieve that tensor. Otherwise, we'll store away the

				        # tensor value into a special attribute on the Module s.t. we can

				        # retrieve it with a get_attr.

				        if isinstance(a, _constant_attribute_types):

				            qualname: Optional[str] = self.tensor_attrs.get(a)

				        if isinstance(a, _constant_attribute_types) or is_opaque_type(type(a)):

				            qualname: Optional[str] = self.tensor_attrs.get(

				                a

				            )  # pyrefly: ignore[no-matching-overload]

				            # Tensor was not found in the Module hierarchy, stow it away in a

				            # special attribute and set the qualname to refer to that

				@ -433,13 +436,17 @@ class Tracer(TracerBase):

				                    base_name = "_torchbind_obj"

				                elif isinstance(a, pytree.TreeSpec):

				                    base_name = "_tree_spec_constant"

				                elif is_opaque_type(type(a)):

				                    base_name = "_opaque_obj"

				                else:

				                    raise RuntimeError(

				                        f"cannot create constant arg for {a} of type {type(a)}."

				                    )

				                qualname = self.get_fresh_qualname(base_name)

				                assert isinstance(qualname, str)

				                self.tensor_attrs[a] = qualname

				                self.tensor_attrs[a] = (  # pyrefly: ignore[unsupported-operation]

				                    qualname

				                )

				                setattr(self.root, qualname, a)

				            return self.create_node("get_attr", qualname, (), {})

									
										11

torch/fx/experimental/proxy_tensor.py
									
												View File
												
				@ -84,7 +84,7 @@ if TYPE_CHECKING:

				    from torch._ops import OpOverload

				    from torch.fx._symbolic_trace import PHBase

				    from torch.types import IntLikeType

				    from torch.types import BoolLikeType, FloatLikeType, IntLikeType

				__all__ = [

				    "PythonKeyTracer",

				@ -458,7 +458,7 @@ def _sympy_handlers() -> dict[type[sympy.Expr], Callable[..., Any]]:

				def _build_proxy_for_sym_expr(

				    tracer: _ProxyTracer, expr: sympy.Expr, out: PySymType | None = None

				) -> PySymType | None:

				) -> IntLikeType | FloatLikeType | BoolLikeType | None:

				    """

				    Decompose `expr` and look for the pieces as inputs. If `out` is provided

				    then that will be the resulting SymNode (and `out.expr` must be the same as

				@ -532,6 +532,13 @@ def _build_proxy_for_sym_expr(

				        assert not out

				        return value.value

				    if isinstance(expr, (int, float, bool)):

				        return expr

				    if expr.is_Integer:

				        return int(expr)

				    if expr.is_Float:

				        return float(expr)

				    args = []

				    for arg in expr.args:

				        if (arg_value := _build_proxy_for_sym_expr(tracer, arg)) is None:

									
										55

torch/library.py
									
												View File
												
				@ -19,6 +19,7 @@ from torch._library.custom_ops import (

				    CustomOpDef,

				    device_types_t,

				)

				from torch._library.effects import EffectType

				from torch._library.infer_schema import infer_schema  # noqa: F401

				from torch._library.triton import triton_op, wrap_triton

				from torch._ops import OpOverload

				@ -398,6 +399,22 @@ class Library:

				        self.m.fallback(dispatch_key, fn, with_keyset)

				    def _register_effectful_op(self, op_name: str, effect: Optional[EffectType]):

				        """

				        Registers an effect to an operator. This is used to register an op that

				        has side effects that is not capturable by the schema.

				        Args:

				            op_name: operator name (along with the overload) or OpOverload object.

				            effect: The effect of the op.

				        """

				        from torch._higher_order_ops.effects import (

				            _register_effectful_op as hoo_register_effect,

				        )

				        handle = hoo_register_effect(op_name, effect)

				        self._registration_handles.append(handle)

				    def _destroy(self):

				        if self.m is not None:

				            self.m.reset()

				@ -1065,6 +1082,44 @@ def register_fake(

				        return register(func)

				def _register_effectful_op(

				    op: _op_identifier,

				    effect: Optional[EffectType],

				    *,

				    lib: Optional[Library] = None,

				) -> None:

				    r"""

				    To specify that an operator has side-effects, we must register an effect

				    type for the operator. This will prevent graph passes in torch.compile from

				    reordering operations with the same effect type.

				    Args:

				        op_name: Operator name (along with the overload) or OpOverload object.

				        effect: Effect type to register. None means the operator is not effectful.

				    """

				    if not isinstance(

				        op, (str, torch._ops.OpOverload, torch._library.custom_ops.CustomOpDef)

				    ):

				        raise ValueError(

				            f"register_effectful_op({op}): got unexpected type for op: {type(op)}"

				        )

				    if isinstance(op, torch._ops.OpOverload):

				        op = op._name

				    opdef = _maybe_get_opdef(op)

				    if opdef is not None:

				        opdef.register_effect(effect)

				    assert isinstance(op, str)

				    namespace, _ = torch._library.utils.parse_namespace(op)

				    if lib is None:

				        use_lib = Library(namespace, "FRAGMENT")

				        _keep_alive.append(use_lib)

				    else:

				        use_lib = lib

				    use_lib._register_effectful_op(op, effect)

				def register_autograd(

				    op: _op_identifier,

				    backward: Callable,

									
										2

torch/utils/_debug_mode.py
									
												View File
												
				@ -37,7 +37,7 @@ import functools

				import traceback

				import weakref

				from collections.abc import Callable

				from typing import Any, TYPE_CHECKING

				from typing import Any, Optional, TYPE_CHECKING  # noqa: F401

				import torch

				from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode

Compare commits

18 Commits ciflow/tru ... ciflow/pul

12 .ci/docker/build.sh Unescape Escape View File

18 .ci/pytorch/test.sh Unescape Escape View File

73 .github/workflows/attention_op_microbenchmark.yml vendored Normal file Unescape Escape View File

1 .github/workflows/docker-builds.yml vendored Unescape Escape View File

33 .github/workflows/periodic.yml vendored Unescape Escape View File

29 .github/workflows/pull.yml vendored Unescape Escape View File

29 .github/workflows/slow.yml vendored Unescape Escape View File

67 .github/workflows/trunk.yml vendored Unescape Escape View File

10 aten/src/ATen/cuda/CUDAContextLight.h Unescape Escape View File

95 aten/src/ATen/cuda/CublasHandlePool.cpp Unescape Escape View File

1 aten/src/ATen/test/CMakeLists.txt Unescape Escape View File

77 aten/src/ATen/test/cuda_cublas_handle_pool_test.cpp Unescape Escape View File

23 benchmarks/transformer/score_mod.py Unescape Escape View File

3 cmake/Dependencies.cmake Unescape Escape View File

4 test/distributed/test_c10d_gloo.py Unescape Escape View File

89 test/higher_order_ops/test_with_effects.py Unescape Escape View File

53 test/inductor/test_compile.py Unescape Escape View File

10 test/test_numpy_interop.py Unescape Escape View File

5 test/test_opaque_obj.py Unescape Escape View File

265 test/test_opaque_obj_v2.py Unescape Escape View File

21 test/test_proxy_tensor.py Unescape Escape View File

2 test/torch_np/numpy_tests/lib/test_histograms.py Unescape Escape View File

10 torch/_dynamo/graph_break_registry.json Unescape Escape View File

3 torch/_dynamo/output_graph.py Unescape Escape View File

42 torch/_dynamo/variables/builder.py Unescape Escape View File

16 torch/_dynamo/variables/misc.py Unescape Escape View File

13 torch/_dynamo/variables/script_object.py Unescape Escape View File

5 torch/_export/non_strict_utils.py Unescape Escape View File

3 torch/_functorch/_aot_autograd/frontend_utils.py Unescape Escape View File

3 torch/_functorch/aot_autograd.py Unescape Escape View File

98 torch/_higher_order_ops/effects.py Unescape Escape View File

4 torch/_inductor/codegen/wrapper.py Unescape Escape View File

5 torch/_inductor/compile_fx.py Unescape Escape View File

3 torch/_inductor/cpp_builder.py Unescape Escape View File

11 torch/_inductor/ir.py Unescape Escape View File

7 torch/_inductor/lowering.py Unescape Escape View File

399 torch/_inductor/select_algorithm.py Unescape Escape View File

4 torch/_library/custom_ops.py Unescape Escape View File

68 torch/_library/effects.py Normal file Unescape Escape View File

3 torch/_library/simple_registry.py Unescape Escape View File

6 torch/_numpy/_util.py Unescape Escape View File

17 torch/_ops.py Unescape Escape View File

132 torch/_subclasses/functional_tensor.py Unescape Escape View File

13 torch/fx/_symbolic_trace.py Unescape Escape View File

11 torch/fx/experimental/proxy_tensor.py Unescape Escape View File

55 torch/library.py Unescape Escape View File

2 torch/utils/_debug_mode.py Unescape Escape View File

18 Commits

ciflow/tru ... ciflow/pul

12

.ci/docker/build.sh

View File

18

.ci/pytorch/test.sh

View File

73

.github/workflows/attention_op_microbenchmark.yml vendored Normal file

View File

1

.github/workflows/docker-builds.yml vendored

View File

33

.github/workflows/periodic.yml vendored

View File

29

.github/workflows/pull.yml vendored

View File

29

.github/workflows/slow.yml vendored

View File

67

.github/workflows/trunk.yml vendored

View File

10

aten/src/ATen/cuda/CUDAContextLight.h

View File

95

aten/src/ATen/cuda/CublasHandlePool.cpp

View File

1

aten/src/ATen/test/CMakeLists.txt

View File

77

aten/src/ATen/test/cuda_cublas_handle_pool_test.cpp

View File

23

benchmarks/transformer/score_mod.py

View File

3

cmake/Dependencies.cmake

View File

4

test/distributed/test_c10d_gloo.py

View File

89

test/higher_order_ops/test_with_effects.py

View File

53

test/inductor/test_compile.py

View File

10

test/test_numpy_interop.py

View File

5

test/test_opaque_obj.py

View File

265

test/test_opaque_obj_v2.py

View File

21

test/test_proxy_tensor.py

View File

2

test/torch_np/numpy_tests/lib/test_histograms.py

View File

10

torch/_dynamo/graph_break_registry.json

View File

3

torch/_dynamo/output_graph.py

View File

42

torch/_dynamo/variables/builder.py

View File

16

torch/_dynamo/variables/misc.py

View File

13

torch/_dynamo/variables/script_object.py

View File

5

torch/_export/non_strict_utils.py

View File

3

torch/_functorch/_aot_autograd/frontend_utils.py

View File

3

torch/_functorch/aot_autograd.py

View File

98

torch/_higher_order_ops/effects.py

View File

4

torch/_inductor/codegen/wrapper.py

View File

5

torch/_inductor/compile_fx.py

View File

3

torch/_inductor/cpp_builder.py

View File

11

torch/_inductor/ir.py

View File

7

torch/_inductor/lowering.py

View File

399

torch/_inductor/select_algorithm.py

View File

4

torch/_library/custom_ops.py

View File

68

torch/_library/effects.py Normal file

View File

3

torch/_library/simple_registry.py

View File

6

torch/_numpy/_util.py

View File

17

torch/_ops.py

View File

132

torch/_subclasses/functional_tensor.py

View File

13

torch/fx/_symbolic_trace.py

View File

11

torch/fx/experimental/proxy_tensor.py

View File

55

torch/library.py

View File

2

torch/utils/_debug_mode.py

View File