Compare commits

...

3960 Commits

Author SHA1 Message Date
b33a283e9a [nccl-pg] Pass pg name and desc to NCCL communicator (#124149)
Summary:
Pass Process Group Name and Desc to NCCL communicator in order to access pg information in NCCL layer.
The information is passed as commDesc string(i.e. "<pg_desc>:<pg_name>")
Function only valid when NCCL_COMM_DESCRIPTION is defined.

Differential Revision: D55703310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124149
Approved by: https://github.com/shuqiangzhang
2024-04-16 15:08:38 -07:00
7a551d81e5 [c10d/nccl-pg] allow user to pass process group description (#123472)
Summary:
We need a way to allow user set a customized description for a process group, e.g. FSDP, PP.

Here are several use cases of user specified group_desc:
- Logging: we can easily match a log line and understand what's this collective/pg is used to.
- Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP.
- Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG.

Solution: Add a group_desc field to c10d

Differential Revision: D55781850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472
Approved by: https://github.com/kwen2501
2024-04-16 15:08:38 -07:00
1515a90475 [DCP] Adds ability to create a CPU state dict that is both shared and pinned (#122338)
[DCP] Adds ability to create a CPU state dict that is both shared and pinned, as well as a new utility specific to copying the state dict

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8d5c17670f16ac4fc8fcb4181cb490c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122338
Approved by: https://github.com/fegin
2024-04-16 15:08:22 -07:00
4882ec2a91 Pass and record process_group_name when creating ProcessGroupNCCL (#123117)
Summary:
Pass python c10d group_name to c++ ProcessGroupNCCL so that the pg name will be consistent across different layers.
Also record pg_name in flight recorder entry.

Differential Revision: D55597200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123117
Approved by: https://github.com/wconstab
2024-04-16 13:48:35 -07:00
972b8060bd [c10d] make monitorThread sleep when we try to dump (#123788)
Summary:
We seperated the FR dump logic from the desync debug logic,
so we no longer set collectiveDebugInfoMode_ to true when we just need FR
dump. That's why monitor thread did not sleep and try to kill the
process without waiting for the dump.

The fix is simple, we should sleep whenever shouldDump_ is true
Test Plan:
Existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123788
Approved by: https://github.com/wconstab
2024-04-11 09:19:15 -07:00
3e7683ae18 [c10d] dump on any exception (timeout + nccl error) (#123023)
Summary:
Existing flight recorder dumping logic is: dump only on timeout, but not
on NCCL error. This resulted in the faulty ranks missing dumps when NCCL
error happens.

So in this PR, we revise the logic of dump such that records are dumped
when any exception is detected. Exception could be 1. NCCL async errors.
2. watchdog timeout

Also the existing code tends to mix the logic of flight recorder dump
and desync debug, which is no desirable. We only dump the desync debug
report only when timeout is detected.
Test Plan:
Added a new unit test to trigger nccl error and dump, and make sure the
dump is triggered by the error.

Also existing dump on timeout tests should still pass.

sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (84bf9d4c)]$ python
test/distributed/test_c10d_nccl.py NcclErrorDumpTest
NCCL version 2.19.3+cuda12.0
[E329 19:15:11.775879730 ProcessGroupNCCL.cpp:565] [Rank 0] Watchdog
caught collective operation timeout: WorkNCCL(SeqNum=2,
OpType=ALLREDUCE, NumelIn=10, NumelOut=10, Timeout(ms)=10000) ran for
10028 milliseconds before timing out.
[E329 19:15:11.777459894 ProcessGroupNCCL.cpp:1561] [PG 0 Rank 0]
Exception hit in NCCL work: 2
[E329 19:15:12.660717323 ProcessGroupNCCL.cpp:1332] [PG 0 Rank 0]
Received a timeout signal from this local rank and will start to dump
the debug info. Last enqueued NCCL work: 2, last completed NCCL work: 1.
[E329 19:15:12.660932242 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 0]
ProcessGroupNCCL preparing to dump debug info.
[E329 19:15:12.661192990 ProcessGroupNCCL.cpp:1174] [PG 0 Rank 0]
ProcessGroupNCCL dumping nccl trace to /tmp/tmp06psqil3/trace_0
[F329 19:15:12.661485601 ProcessGroupNCCL.cpp:1185] [PG 0 Rank 0] [PG 0
Rank 0] ProcessGroupNCCL's watchdog detected a collective timeout from
the local rank. This is most likely caused by incorrect usages of
collectives, e.g., wrong sizes used across ranks, the order of
collectives is not same for all ranks or the scheduled collective, for
some reason, didn't run. Additionally, this can be caused by GIL
deadlock or other reasons such as network errors or bugs in the
communications library (e.g. NCCL), etc. We tried our best to dump the
debug info into the storage to help you debug the issue.

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123023
Approved by: https://github.com/wconstab
2024-04-02 15:41:15 -07:00
f2e9ec2dc5 [c10d] dump from one and only one thread (PG0's monitor thread) (#120893)
Summary:
When there are multiple PGs in a process and a hardware failure happens,
we found that multiple PGs/ threads in the same
process are competing to dump the same records at the same time. The
affects the reliability of dumps.

In this PR, we will try to make the change such that only one thread/PG
could dump: PG0's monitor thread. We use a static variable to indicate
that something (e.g., collective timeout) has triggered the dump
locally.

monitor thread would dump debug info under any one of the 3 conditions:
1: this static variable is set to true by the watchdog thread when it detects
a timeout or pipe dump signal
2: timeout signal is received from other ranks through tcpstore
3: no heartbeat of watchdog
Test Plan:
python test/distributed/test_c10d_nccl.py -k
test_timeout_dumps_on_stuck_ranks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893
Approved by: https://github.com/wconstab
2024-04-02 15:36:05 -07:00
dde4324d8e [NCCL PG] Enable ncclCommDevIdxMap unconditionally (#122049)
Differential Revision: D54993977

The initial purpose of ncclCommDevIdxMap is to support NCCL zero copy algorithms. Therefore, it is only enabled (with its values filled) if useTensorRegisterAllocatorHook_ is set to true. However, now we rely on it to support dumping NCCL information in a single PG. So we need it to be always available, regardless of whether we enabled useTensorRegisterAllocatorHook_.
Move the code of filling ncclCommDevIdxMap out of if (useTensorRegisterAllocatorHook_) statement.

See diff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122049
Approved by: https://github.com/shuqiangzhang
2024-03-26 17:14:06 -07:00
94c079104d [c10d] fix the macro definition of NCCL_COMM_DUMP (#120502)
Summary:
Only if both macros are defined, should we dump the comm dump,
otherwise, use the original definition.

The previous implementation missed the function definition when IS_NCCL_EXP is defined but NCCL_COMM_DUMP is not defined

Test Plan:
Build and unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120502
Approved by: https://github.com/dsjohns2, https://github.com/Skylion007
2024-03-26 14:09:00 -07:00
a6afee6d94 [c10d][flight recorder] dump additinal NCCL debug info (#120063)
Summary:
This PR is mainly about flight recorder side of changes that takes a
map of maps as input, and dump it as picklable. Also add functions that
should be compiled only when NCCL_COMM_DUMP is defined
Test Plan:
Integration tests with NCCL would be done later, here we only do the
c10d side of dump test, aka,NCCLTraceTest

Testing the dump function is a bit tricky as we don't have
existing C++ unit tests for them. So we still use the Python NCCLTraceTest with
the python binding of _dump_nccl_trace(), we manually fed the
dump_nccl_trace with a map of test info, and assert the pickle result and
print the converted python dict:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$  python
test/distributed/test_c10d_nccl.py NCCLTraceTest
NCCL version 2.19.3+cuda12.0
[rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL
preparing to dump debug info.
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.
----------------------------------------------------------------------
Ran 8 tests in 95.761s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063
Approved by: https://github.com/wconstab
2024-03-26 14:08:19 -07:00
d092857531 [Caffe2 CPU tests] Update CMakeLists.txt 2024-02-24 12:18:10 -08:00
6aad5e444a Fix missing MAST log when there is Unicode non-decodable text in logs (#119298)
Summary:
## Issue
When there is Unicode non-decodable text in logs, `tail_logger` will stop working afterwards, i.e. f527390102

In the example, the process stopped producing Python logs after 17:20:21 untill the job finished
```
[0]:I0201 17:20:21.338000 3429 gen_ai/genie_projects/llm/metaformers/reward_model_score.py:335] Progress: 118 batches out of 512 total batches. 23.05 % | (gpu mem: 25.8GB, free CPU mem: 1387.8GB)
I0201 17:39:14 Stopping twtask-main.service with Service Result: [success] Exit Code: [exited] Exit Status: [0]
```
At the end, `UnicodeDecodeError` was thrown at the end with no call stack.

## Fix
Use `errors="replace"` to avoid throwing exception when `UnicodeDecodeError` happens.

Test Plan: f528854819

Differential Revision: D53483644

Co-authored-by: Jack Zhang <jackzh@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119298
Approved by: https://github.com/XilunWu
2024-02-24 12:16:39 -08:00
c54ce9313b [c10d][flight recorder] store a copy of string in entry (#119837)
Summary:
Previously, we just store the char pointer in entry, the string is a
temp object and will be destructed when we want to dump/access it.

A quick fix is to store a copy of the string, but without changing the
upstream char*.

An alternative is to change every profilingTitle into std:string, this
however would needs comprehensive overhall of the code up to the
c10d::work layer above workNCCL and RecordFunction etc.

We chose the first option for this change

Resolve #119808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837
Approved by: https://github.com/zdevito, https://github.com/wconstab
2024-02-14 11:38:10 -08:00
1fe59f4ef7 [c10d][flight recorder] remove unintended assignment of entry (#119748)
Summary:
auto& entry = entries_.at(*id % max_entries_);
entry = entries_.at(*id % max_entries_);
The above line of code has unintended consequence of invoking copy/assignment
of entry objects as ref itself cannot be re-assigned.

Also what could cause the crash is that the entry ref could become invalid if entries_ are
resized by other threads. and this could result in 'copy to a garbage
location'. The fix is to use a pointer which can be re-assigned after
re-acquiring the lock

Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748
Approved by: https://github.com/wconstab, https://github.com/fegin
2024-02-14 11:38:10 -08:00
e693fb2bb1 [nccl flight recorder] record time we discover start and complete (#119249)
Some APIs like ncclCommAbort can cause nccl kernels to finish even if
they were previously stuck. Because we can gather the trace buffer after
those calls, we can end up seeing some collectives marked completed eventhough
that complete happened several minutes after they started and clearly after
the timeout. This changes how we record state so that we keep track of the time
we discover a state change, so even if eventually the collective gets marked complete,
we can observe it happened minutes after it was schedule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249
Approved by: https://github.com/wconstab
2024-02-14 11:38:10 -08:00
4fe510baf6 [NCCL PG] log NCCL comm at creation and abort (#118335)
Summary: It helps correlate NCCL PG with corresponding NCCL comm in separate logs.

Differential Revision: D53107647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118335
Approved by: https://github.com/wconstab
2024-02-14 11:38:04 -08:00
7c507b78c4 [c10d] Expose check method to Python for store via pybind (#116144)
Differential Revision: [D52310987](https://our.internmc.facebook.com/intern/diff/D52310987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116144
Approved by: https://github.com/wconstab
2024-01-31 11:08:27 -08:00
0019901601 [C10D] Fix nccl flightrecorder ignored dump timeout (#118142)
Don't call future.get() unless it's ready, because it waits.
Also, refactor the code a bit for simplicity.

We should do a follow-on PR to clean up the timeouts further, but this
should fix the glaring timeout bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118142
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118044, #118046, #118047
2024-01-26 16:48:00 -08:00
18be18535b [C10D] Make Flight Recorder report time_created in ns (#118047)
Addresses (6) from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
2024-01-26 16:48:00 -08:00
2729367313 [C10D] Add version tag to NCCL Flight Recorder Dump (#118046)
Addresses (3) from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
2024-01-26 16:48:00 -08:00
33537aae24 [C10D] Make NCCL Flight Recorder dump produce a dict (#118044)
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118044
Approved by: https://github.com/zdevito
2024-01-26 16:48:00 -08:00
dcdb1337dd [C10D] Finer-grain nccl heartbeat, avoid false positive hangs (#118016)
Summary:
Previously, heatbeat was incremented once per finishing a for loop over a list
of in-progress work items, under the assumption that either the processing
would be predictably quick, or it would hang completely.

In fact, there can be cuda API contention that causes the processing of works
to slow down arbitrarily but not truly deadlock.  To guard against this, we
bump the heartbeat at the smallest unit of progress, one work item being
successfully processed.

Test Plan: CI

Differential Revision: D52973948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118016
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
2024-01-26 16:48:00 -08:00
9cf0f2bd59 Move getDurationFromFirstEvent to USE_C10D_NCCL ifdef (#117738)
Fixes #117517

Try to move nccl related function *getDurationFromFirstEvent* to USE_C10D_NCCL ifdef (Related to https://github.com/pytorch/pytorch/issues/114575)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117738
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-01-26 16:48:00 -08:00
1d2e877c05 [ProcessGroup] Make watchdog check work queue more frequently (#117297)
Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed.

Take DDP and Ampere for example:

DDP's bucket size = 25 MB
Ampere's NVLink speed = 250 GB/s

25 MB / 250 GB/s = 100 ms.
So we are updating the interval to 100 ms.

Update:
25 MB / 250 GB/s = 0.1 ms
But let's see how it goes so far between making the checking more aggressive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297
Approved by: https://github.com/fduwjj
2024-01-26 16:48:00 -08:00
f27b979b0c [c10d] Move the timeout dump check from watchdog to monitoring thread (#117168)
To avoid potential hang in watchdog thread which will prevent us from dumping timeout debugging info, we move the check of global collective timeout signals and dumping debugging info to monitoring thread. We also need to ensure that we don't wait very long to check out the timeout signal from store; otherwise, we will miss the signal and don't get debugging info dumped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117168
Approved by: https://github.com/wconstab
2024-01-26 16:48:00 -08:00
f30d6047ad [c10d] Add a timeout check interval variable for timeout dump (#117093)
The current timeout check frequency is relied on monitoring thread's timeout thread which can be too long (even if we set it to 2mins) so let's use a separate timeout variable which users can configure it. And we only only let default PG to check TCPStore so even more frequent check should be fine. (Our stress test is performed on every half second).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117093
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-01-26 16:48:00 -08:00
75311510ef [C10D] Add duration_ms to flight recorder (#114817)
Measures the duration of a collective operation using nccl start/end
events and includes this duration (in ms) in the flight recorder data.

duration_ms will be an optional field, since it only works when
timing is enabled.  Currently timing is enabled when flight recorder
is enabled, but this is not a strict requirement.  Duration is also
not available for collectives not in a completed state.

Note: computing duration can lead to a hang due to calling cudaEventDuration when
the cuda driver queue is full.

We don't ever want dump() api to hang, since we might want dump to help
debug a hang. Hence, we only query durations from the watchdog thread,
and it's possible during dump() call, some of the most recent
collectives durations won't have been computed yet at time of dump.  We
make this tradeoff to ensure that dump() itself will never hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114817
Approved by: https://github.com/fduwjj, https://github.com/zdevito
ghstack dependencies: #116905
2024-01-26 16:48:00 -08:00
dbd6094d05 [C10D](reland) Add GIL checker to NCCL watchdog monitor (#117312)
Whenever the monitor thread kills the watchdog thread for being stuck, we do so to save cluster time and get a faster failure signal, but we want to know more about why it got stuck.

One possible reason for watchdog stuckness is GIL contention, which could be ruled out or observed by making an attempt to acquire the GIL at exit time.

If we cannot acquire the GIL within a short time window (1s) we abort the attempt and report GIL contention, otherwise we report that GIL was acquired successfully.

Reland: uses a function pointer to avoid destructor ordering issues on dlclose. (Looks like the destructor for the std::function was being run later than the libtorchpython lib was unloaded, leading to a crash).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117312
Approved by: https://github.com/zdevito
2024-01-26 16:48:00 -08:00
397b9d47e9 [ProcessGroup] Do not print NCCL_DEBUG before NCCL init (#117328)
In case /etc/nccl.conf is used, `NCCL_DEBUG` is not set to sys env until NCCL inits.
The deleted print point is before NCCL inits, hence may be inaccurate.
This PR removes it and relies on the other print point which is after NCCL comm creation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117328
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-01-26 16:48:00 -08:00
36a01a8ab9 [c10d][EZ] Add more logs in the destructor of ProcessGroupNCCL for better root cause investigation (#117291)
Add logs to the place where we inspect whether a hang happens.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117291
Approved by: https://github.com/XilunWu, https://github.com/shuqiangzhang
2024-01-26 16:48:00 -08:00
ee336cf58a [c10d] Add comments to the rest environment variable within NCCLPG (#117092)
Not every environment within NCCLPG has comments, let's add comments to each of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117092
Approved by: https://github.com/kwen2501
ghstack dependencies: #116545
2024-01-26 16:48:00 -08:00
a9e2e745d7 [c10d] Add extra sleep in waitForDumpOrTimeout to ensure enough time for all ranks dump debug info (#116545)
We added an extra sleep and make it configurable so that users can set an extra wait to ensure all ranks have dumped the debug info.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116545
Approved by: https://github.com/wconstab
2024-01-26 16:48:00 -08:00
ab4df89eea [C10D] Rename flightrecorder key vars to avoid confusion (#116905)
Key vars are strings used as dict keys (e.g. duration_s was a string
"duration_ms")

_s confused me with time (seconds) since duration_s was a key string and
duration_ms is another variable holding a time value.

Now duration_key is "duration_ms".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116905
Approved by: https://github.com/zdevito
2024-01-26 16:48:00 -08:00
9d02ebe876 [c10d] To make ProcessGroupNCCL to use globalStore for coordination (#117075)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117075
Approved by: https://github.com/wconstab
ghstack dependencies: #117074
2024-01-26 16:48:00 -08:00
b61e01cce9 [c10d] Add a recursive method to get the inner most store (#117074)
In c10d PG initialization, we wrap TCPStore with multiple layers of PrefixStore which adds layers of prefix.

One example is:
"default_pg/0//cuda//timeout_dump"
When initialized the default PG, because there is no store passed. We first add the prefix "default_pg" to the TCPStore returned from rendezvous:

bdeaaad70c/torch/distributed/distributed_c10d.py (L1240)

We then add pg_name (aka 0) bdeaaad70c/torch/distributed/distributed_c10d.py (L1376) and device (aka cuda) bdeaaad70c/torch/distributed/distributed_c10d.py (L1387)

to the prefix. Then when we call store_->set("timeout_dump"). The actual key used for writing into TCPStore is "default_pg/0//cuda//timeout_dump".

For sub-PG, things get even interesting, we put the store wrapped with default pg name to a cache:
bdeaaad70c/torch/distributed/distributed_c10d.py (L1517)

And when creating each subPG, it is append its PG name right after the cached store. The example keys are:
'default_pg/0//10//cuda//timeout_dump', 'default_pg/0//12//cuda//timeout_dump', 'default_pg/0//38//cuda//timeout_dump', 'default_pg/0//39//cuda//timeout_dump'. (10, 12, 38 and 39 are all PG names of each subPG created)

The reason why the number in the name is bumped up so high is because for each subPG creation, all ranks have to call the API together and the global variable used for PG name will be bumped up monolithically:

bdeaaad70c/torch/distributed/distributed_c10d.py (L3666)

Similar things happen for using hashing for PG names.

This has a potential issue, because each sub-PG has an instance of ProcessGroupNCCL, and if we want to set something global to notify all sub-PGs (and all ranks). This added prefix causes bugs. For example, if on sub-PG 1, we set a value to TCPStore with key ('default_pg/0//1//cuda//timeout_dump'), while we use the default PG instances to check the TCPStore, which are using the key ('default_pg/0//cuda//timeout_dump'), default PG instances will never get the notified signals. So in this PR, we added a new API in PrefixStore which we get the innermost non-PrefixStore for set and check. The next PR will make changes in NCCL watchdog.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117074
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-01-26 16:48:00 -08:00
f7ce61ba53 [C10D] Dump cpp stacktraces on heartbeat monitor timeout (#116717)
Summary:
If heartbeat monitor times out and kills the process, we want to know why.

It's convenient to use an internal tool for this, but we plan to later
integrate with torchelastic to call into pyspy or something else, which will be
both better (including py stacks) and compatible with OSS.

Test Plan: tested manually, observed c++ stacktraces were dumped

Reviewed By: fduwjj

Differential Revision: D52370243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116717
Approved by: https://github.com/zdevito
2024-01-26 16:48:00 -08:00
e7bae15ab1 [C10D] Make heartbeat_ atomic (#116702)
Summary:
Currently, the code is working. We know this becuase we observe heartbeat
timeouts.

However, there is a chance that if the code were refactored, the compiler could
optimize away the load of heartbeat_ inside heartbeatMonitor, and we wouldn't
know.

Using atomic here is not really for thread synchronization, but more to ensure
compiler optimizations (hoisting the read outside the loop) can never be
allowed to happen.  Again, we know this isn't currently happening bc if it
were, it  would not be an intermittent failure, it would be an always failure.
(at least with a fixed compiler/platform).

I previously avoided atomic bc we didn't want shared locks between heartbeat
monitor and watchdog thread.  Why? if watchdog held the lock and hung, monitor
could also hang.  However, this really can't happen (Afaik) when using an
atomic.

Test Plan: existing CI tests

Differential Revision: D52378257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116702
Approved by: https://github.com/fduwjj, https://github.com/zdevito
2024-01-26 16:48:00 -08:00
e71b422908 [C10D] Improve Heartbeat Monitor exit logs (#116268) (#116661)
Summary:

- add workMetaList_.size() so we know how many outstanding works there
  were when killing
- Print our first log before debuginfo dump instead of after, since it
  is clearer when reading the logs that we time out and then dump
- Organize the log strings- put them near where they are used

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: fduwjj

Differential Revision: D52369167

Pulled By: wconstab

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116661
Approved by: https://github.com/fduwjj
2024-01-26 16:48:00 -08:00
389940ce60 [c10d] Make DebugInfoWriter Singleton across all PG objects (#116489)
Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances.

Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489
Approved by: https://github.com/kwen2501
2024-01-26 16:48:00 -08:00
b2237a7c85 [C10d] Fix Log Prefix in NCCLPG so that each instance gets its own prefix (#116520)
Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs.

The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set `rank_` and `uid_` to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs.

<img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116520
Approved by: https://github.com/wconstab
ghstack dependencies: #116218
2024-01-26 16:48:00 -08:00
ef5dfe3f3e [c10d] Fix timeout dump path write path overlap when there are multiple PGs (#116218)
Basically we observed that if there are multiple PGs and if the timeout happens on one of the subPG, we somehow use the local rank in the dump file. We realize that:
1. For setting the timeout signal in the store, any watchdog thread from any PG can do that.
2. For checking and dump, only the watchdog thread of default PG which we will always create and contain all ranks (no file name conflict) is needed here because the store signal and dump debug info are all global.
3. Since dump is global, we want to avoid the case when ranks from sub-PG pollute logs from global ranks (local rank 0 vs global rank 0). So that we use global ranks here to initialize debug info writer. (Down the road, we are thinking about making it a singleton so that user only register it once for multi-PG case.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116218
Approved by: https://github.com/wconstab
2024-01-26 16:48:00 -08:00
e303dc3c08 [c10d] Add stream info during nccl comm abort call (#116076)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116076
Approved by: https://github.com/XilunWu
2024-01-26 16:48:00 -08:00
265efad2de [C10D] Increase TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC (#116267)
Change default from 2 min to 10 min.

Why? Many cases of heartbeat timeout were reported, but increasing
timeout led to the same job hanging in a different place, suggesting
heartbeat kill was working well and not a false positive.  However, some
others reported jobs running fine with increased timeouts.  One such
case was investigated below, and suggests that indeed a 2 min timeout is
too aggressive.  While we have not fully root caused the issue, it
is better to avoid killing jobs that would otherwise complete.

Current theory is that watchdog is not totally deadlocked, but is slowed
down in its processing of work objs due to some intermittent resource
contention.  Hence, allowing more time is more of a workaround than a
fix.

Debug/Analysis:
https://docs.google.com/document/d/1NMNWoTB86ZpP9bqYLZ_EVA9byOlEfxw0wynMVEMlXwM

Differential Revision: [D52368791](https://our.internmc.facebook.com/intern/diff/D52368791)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116267
Approved by: https://github.com/fduwjj
2024-01-26 16:48:00 -08:00
60f0455905 [C10D] Make all PGNCCL LOG usages use logPrefix() (#116060)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116060
Approved by: https://github.com/fduwjj
ghstack dependencies: #116059
2024-01-26 16:48:00 -08:00
4898313791 [C10D] Add logPrefix to abortCommsFromMap (#116059)
Prints additional info such as PG ID/Rank.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116059
Approved by: https://github.com/fduwjj
2024-01-26 16:48:00 -08:00
f4da9adf6b [C10D] Add waitForDumpOrTimeout to log on dump abandonment (#115876)
Helps call attention to any cases where the dump actually times out.

The timeout is likely to hit if we run into slow stacktrace processing.

Log any exceptions encountered in the background thread, but don't raise
them- we're already willing to abandon the debug dump, and want to
proceed with our normal execution (in the case of dumppipe) or shutdown
process (when dumping happens on timeout and shutdown is already
initiated).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115876
Approved by: https://github.com/zdevito
ghstack dependencies: #115807
2024-01-26 16:48:00 -08:00
8f7f35273e [c10d] Polish NCCL PG monitor thread log message (#115888)
We turned on monitor thread by default in https://github.com/pytorch/pytorch/pull/112518, and we want the error message that is displayed when the monitor kills the process to be more informative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115888
Approved by: https://github.com/wconstab
2024-01-26 16:48:00 -08:00
44ec9612ed [C10D] Log PG size in init log (#115807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115807
Approved by: https://github.com/XilunWu
2024-01-26 16:48:00 -08:00
4d3bea2b29 [nccl flight recorder] nullptr profiling name (#115851)
Sometimes profiling name can be a nullptr, which
throws on conversion to std::string. This adds a check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115851
Approved by: https://github.com/wconstab
2024-01-26 16:48:00 -08:00
0bcdddc3c1 [C10D] Make dumpDebuggingInfo share a mutex across PGs (#115803)
The mutex was originally added to avoid racing to dump debuginfo,
where a race in this case would result in a corrupted dump file.

The reason a mutex helps is that it forces all dump requests to be
serialized, so that an observer would either see an in-progress file, a
complete file, or no file.  Without a mutex, a fourth state is possible
(a file that has been written to by multiple threads and is invalid).

Becuase the mutex was a ProcessGroupNCCL class member, and each PG
instance has its own watchdog thread that can launch a dump, it was not
doing its job.  Making the mutex static shares it between instances of
the class and ensures serialization of dumps triggered by any PG.

(Note: dumps triggered by different PGs have the same, global contents
anyway- there is only one global flight recorder, so it doesn't matter
who triggers it.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115803
Approved by: https://github.com/kwen2501
ghstack dependencies: #115771, #115798, #115800, #115801
2024-01-26 16:48:00 -08:00
28b6220312 [C10D] Change PGNCCL logs to prefix [PG {} Rank {}] (#115801)
Adds a PG {process group uid} prefix component to logs.

This is helpful in situations where there are multiple processgroups,
and rank information by itself is confusing.  (For example rank0 on PG1
may correspond to rank3 on PG0.  People may assume 'rank0' references
the global (PG0) world, but it may reference a sub-pg.  Prefacing the PG
helps clarify this.

Does NOT change logs from inside WorkNCCL functions, since WorkNCCL
doens't know what PG ID it corresponds to. Will address these logs
separately.

Example:

```
[I ProcessGroupNCCL.cpp:787] [PG 0 Rank 0] ProcessGroupNCCL initialization ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115801
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798, #115800
2024-01-26 16:48:00 -08:00
210b7b65e2 [C10D] Refactor NCCL logs to use common prefix helper (#115800)
Put the repeated code that string formats [Rank {rank}] in one place.

Sets up for the next PR that also adds more info to this prefix.

(Does not change exception messages, which could be done as well.
Exception messages are not formatted quite the same way. Tries
instead to keep from changing log behavior (in this PR) and only
refactor code.

Did limited testing (some logs were observed OK).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115800
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798
2024-01-26 16:48:00 -08:00
4da10b5cd3 [C10D] Only open NCCL dump pipe file once per process (#115798)
The NCCL flight recorder is per-process (it is shared by all
processgroups), but individual process groups used to construct their
own pipe for being signaled to dump the flight recorder.

This ensures that only one pipe per process is created, by only creating
the pipe on the first ProcessGroup (uid_ == 0) which should be the world
group.

Filenames are still keyed off of rank, but this should now be global
rank instead of sub-pg rank, making the filenames unique across the
whole trainer process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115798
Approved by: https://github.com/zdevito
ghstack dependencies: #115771
2024-01-26 16:48:00 -08:00
f09763814f [C10D] Make DumpPipe disabled when FlightRecorder disabled (#115771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115771
Approved by: https://github.com/fduwjj
2024-01-26 16:48:00 -08:00
80923ed5a6 [C10D] Make DumpPipe pipe file configurable (#115770)
Add TORCH_NCCL_DEBUG_INFO_PIPE_FILE env, allowing separate pipe file
location from dump file location.

Defaults PIPE_FILE to empty, meaning disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115770
Approved by: https://github.com/zdevito
2024-01-26 16:48:00 -08:00
0ff155fb65 Fix SDPA for SAM (#115636)
Addresses the regression for Segment Anything Fast in https://github.com/pytorch-labs/segment-anything-fast/issues/99
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115636
Approved by: https://github.com/soulitzer, https://github.com/ani300
2023-12-12 18:52:38 +00:00
8885128dcc Fix backward for SDPA NT jagged layout (#115576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115576
Approved by: https://github.com/jbschlosser, https://github.com/ani300
2023-12-12 18:35:40 +00:00
7553c49514 [S382174] Fix distributed debug w/ non-equal split (#115483)
Summary:
In collectives, it's possible to have non-equal split that has a different implementation and the output tensor size will be different, e.g. https://www.internalfb.com/code/fbsource/[460afb1172b5]/fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp?lines=3104. However, TORCH_DISTRIBUTED_DEBUG=DETAIL will assume the output tensor size is the same and does the check and will fail the job if they don't: https://fburl.com/code/mhte9ty8. c10d code should handle this.

Ideally we should check the input size across ranks and make sure they're the same. Maybe for next diff.

Test Plan: Test torchrec's TWRW w/ non-even split and it's working now.

Reviewed By: zhangruiskyline

Differential Revision: D52010942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115483
Approved by: https://github.com/kwen2501, https://github.com/fegin, https://github.com/XilunWu
2023-12-12 18:02:05 +00:00
d521857411 Terminate handler (#101332)
Fixes #50051.
This PR is based on #50320 and I address the last feedback.
On Windows it is enabled by default. Can be enabled or disabled via USE_CUSTOM_TERMINATE env variable.

This PR adds support for overriding the terminate handler in order to log uncaught exceptions in the threads.
If an exception is thrown and not caught, it will print <Unhandled exception caught in c10/util/AbortHandler.h>
The point of doing this is that in issue #50051, exceptions were thrown but not logged. With this logging system it will be easier to debug it in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101332
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-12 17:55:27 +00:00
36b5136270 [inductor] Don't print disable_cudagraphs_reason when cudagraphs is disabled (#115489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115489
Approved by: https://github.com/yanboliang
2023-12-12 17:50:18 +00:00
670eb83573 Enable test_sparse_addmm for crossref tests (#115536)
Fixes https://github.com/pytorch/pytorch/issues/97284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115536
Approved by: https://github.com/cpuhrsch
2023-12-12 17:26:40 +00:00
a8dc9d8e35 [8/n] Update XNNPACK Version Part 8 Everything Remaining to get it to work (#115587)
> **__Note:__** XNNPACK Upgrade is too large in the range of **40k** files and **10m** Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - 6/n] Must be landed together for it to work. ***This also means If there is a revert. Please revert the Entire Stack.***

This change is everything remaining requiring XNNPACK version to work.

Differential Revision: [D52044420](https://our.internmc.facebook.com/intern/diff/D52044420/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115587
Approved by: https://github.com/digantdesai
2023-12-12 17:17:19 +00:00
e918461377 Add instructions for generating optimal Triton kernel parameters of bsr_dense_addmm (#115504)
As in the title.

In addition, enable verbose output when executing the torch/sparse/_triton_ops_meta.py script.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115504
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #115499
2023-12-12 16:44:51 +00:00
32286512cc Add tune_bsr_dense_addmm as an API to find optimal triton kernel parameters for bsr_dense_addmm (#115499)
As in the title.

In addition:
- improve the algorithm for finding a minima of operation timings: break the inner loop early when a next minima candidate is found
- add tests and fix bugs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115499
Approved by: https://github.com/cpuhrsch
2023-12-12 16:44:51 +00:00
40dc0580a6 [inductor] De-duplicate triton helper functions (#115546)
Previously if two calls to cumsum were generated in the same triton kernel
we would generate identical helper functions with different names. Now this
recognizes identical functions and only defines it once. To do this I defer
choosing the name until after codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115546
Approved by: https://github.com/lezcano
ghstack dependencies: #109132
2023-12-12 16:30:50 +00:00
02196c21ac [inductor] Parameterize ir.Scan on combine_fn (#109132)
This replaces `tl.cumsum` and `tl.cumprod` with calls to `tl.associative_scan`
where the combine function is generated from inductor IR.

So before we had:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr):
    xnumel = 20
    rnumel = 30
    RBLOCK: tl.constexpr = 32
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[None, :]
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (30*x0)), rmask & xmask, other=0).to(tl.float32)
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
    tmp2 = tl.where(rmask & xmask, tmp1, 0)
    tmp3 = tl.cumsum(tmp2, 1)
    tl.store(out_ptr0 + (r1 + (30*x0)), tmp3, rmask & xmask)
```

Now we have:
```python
@triton.jit
def _triton_helper_fn0(arg0, arg1):
    tmp0 = tmp0 + tmp1
    return tmp0

@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr):
    xnumel = 20
    rnumel = 30
    RBLOCK: tl.constexpr = 32
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[None, :]
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (30*x0)), rmask & xmask, other=0).to(tl.float32)
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
    tmp2 = tl.where(rmask & xmask, tmp1, 0)
    tmp3 = tl.associative_scan(tmp2, 1, _triton_helper_fn0)
    tl.store(out_ptr0 + (r1 + (30*x0)), tmp3, rmask & xmask)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109132
Approved by: https://github.com/lezcano
2023-12-12 16:30:50 +00:00
d5286d7ea8 [export] Add canonical form for differentiating IR (#115589)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115589
Approved by: https://github.com/suo
2023-12-12 16:21:57 +00:00
de4b2e59a7 [PyTorch] AOTI: add more basic aoti_torch getters (#112799)
Lot of simple information about tensors we couldn't get. In
particular, we didn't know the lengths of the arrays returned by sizes
and strides.

Differential Revision: [D50949929](https://our.internmc.facebook.com/intern/diff/D50949929/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112799
Approved by: https://github.com/desertfire, https://github.com/aakhundov
ghstack dependencies: #112116, #112174, #112405, #112798
2023-12-12 15:56:33 +00:00
c5c4d81b1b Switched stale workflow to linux.large.arc (#115635)
Switched stale workflow to linux.large.arc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115635
Approved by: https://github.com/jeanschmidt
2023-12-12 15:33:59 +00:00
4fafc36c33 [MPS] Fix sum and prod for complex types (#115554)
By not force-casting dtype to float

Test plan: `python -c "import torch;print(torch.linspace(-3.0, 3.0, 50, dtype=torch.cfloat, device='mps').sqrt().sin().sum())"`

Before:
```
tensor(21.1778+0.j, device='mps:0')
```
After
```
tensor(21.1778+39.1377j, device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115554
Approved by: https://github.com/lezcano
ghstack dependencies: #115512, #115513
2023-12-12 15:04:45 +00:00
07f03b4a62 [MPS] Add support for MPSDataTypeComplexFloat[16|32] (#115513)
But limit it to MacOS Sonoma +

Before the calling `torch.cat` with complex types failed, but now it works.
Before:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype.
```
After:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
tensor([[0.4857+0.0030j, 0.9375+0.8630j, 0.3544+0.9911j],
        [0.5293+0.8652j, 0.8440+0.1991j, 0.5152+0.8276j],
        [0.0136+0.7469j, 0.1403+0.4761j, 0.2943+0.0896j],
        [0.6458+0.0035j, 0.3579+0.4577j, 0.1723+0.1508j],
        [0.4420+0.3554j, 0.4396+0.7272j, 0.2479+0.1191j],
        [0.3895+0.2292j, 0.7886+0.1613j, 0.9243+0.4180j]], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115513
Approved by: https://github.com/kulinseth
ghstack dependencies: #115512
2023-12-12 15:04:45 +00:00
21cf6e76c2 Revert "Use linux.large.arc for stale workflow (#115440)"
This reverts commit dadb3694ffaa2a0bfe78516c294a46566430c1ad.

Reverted https://github.com/pytorch/pytorch/pull/115440 on behalf of https://github.com/DanilBaibak due to Did not merge properly ([comment](https://github.com/pytorch/pytorch/pull/115440#issuecomment-1852126050))
2023-12-12 14:20:29 +00:00
dadb3694ff Use linux.large.arc for stale workflow (#115440)
* Try linux.large.arc for stale workflow

* Run stale workflow on PR changes

* Added arc runner lable to the list of self hosted runners

* Added concurency linux-job

* Cleanup

* Added workflow_dispatch for testing purpose
2023-12-12 15:11:09 +01:00
7350dcb307 [CI] Fix lint errors on master (#115627)
Differential Revision: [D52073432](https://our.internmc.facebook.com/intern/diff/D52073432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115627
Approved by: https://github.com/atalman
2023-12-12 13:53:14 +00:00
bc51a0c22f Revert "[PyTorch] AOTI: add more basic aoti_torch getters (#112799)"
This reverts commit 3de2596abed9717a166635b48126302fcf46527a.

Reverted https://github.com/pytorch/pytorch/pull/112799 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112799#issuecomment-1852076887))
2023-12-12 13:52:34 +00:00
f98b0f3ebc Add bfloat16 support to torch.sparse.addmm for CPU (#115535)
Fixes https://github.com/pytorch/pytorch/issues/73145.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115535
Approved by: https://github.com/cpuhrsch
2023-12-12 13:26:33 +00:00
d6f8850653 Revert "[Export] Test non-strict mode on existing test cases (#115399)"
This reverts commit 36527df344c0c33dae8bc6c94eded8646013b736.

Reverted https://github.com/pytorch/pytorch/pull/115399 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115399#issuecomment-1851988651))
2023-12-12 13:02:18 +00:00
a8acd6c410 Add Half support for AvgPool2d on CPU (#109578)
Add Half support for AvgPool2d (both channels last and channels first) on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109578
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-12-12 12:59:47 +00:00
92fd3927b0 [export][reland] Add math.* ops to pass base (#115559)
Reland of https://github.com/pytorch/pytorch/pull/115271/
Fixes https://github.com/pytorch/pytorch/issues/115209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115559
Approved by: https://github.com/zhxchen17, https://github.com/atalman
ghstack dependencies: #115556, #115557, #115558
2023-12-12 10:46:41 +00:00
36527df344 [Export] Test non-strict mode on existing test cases (#115399)
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.

For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.

* For existing failed strict test cases, non-strict also fails.
* For passed strict but failed non-strict cases, we mark them as
`@testing.expectedFailureNonStrict`.
* Moreover, I manually check the failure reason and some of them are not
related to nn.Module asserting exception. I mark them as `# Need to fix
for non-strict mode`.

Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2023-12-12 07:11:53 +00:00
fdf814c6ca Revert "[MPS] Add support for MPSDataTypeComplexFloat[16|32] (#115513)"
This reverts commit a4bb4a237348ff8d688e43ba542ee59a9d7ed4a6.

Reverted https://github.com/pytorch/pytorch/pull/115513 on behalf of https://github.com/malfet due to Broke Mac x86 periodic builds ([comment](https://github.com/pytorch/pytorch/pull/115513#issuecomment-1851398773))
2023-12-12 06:50:47 +00:00
46694e92b7 Revert "[MPS] Fix sum and prod for complex types (#115554)"
This reverts commit 8b28380c8ed5b5bfe479392bcffeccf8b89be328.

Reverted https://github.com/pytorch/pytorch/pull/115554 on behalf of https://github.com/malfet due to Broke MacOS x86 builds ([comment](https://github.com/pytorch/pytorch/pull/115554#issuecomment-1851395982))
2023-12-12 06:47:39 +00:00
f28687dfb2 Do not use pytorchbot-env from upload-test-stats (#115606)
As it was only needed to check our token rate limits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115606
Approved by: https://github.com/huydhn
2023-12-12 06:42:33 +00:00
1eca63c6ac [DeviceMesh] Move helper function 'get_mesh_dim_by_name' to MeshEnv class (#115572)
Move helper function `get_mesh_dim_by_name ` outside of the DeviceMesh class to keep the public class cleaner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115572
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2023-12-12 06:29:46 +00:00
3de2596abe [PyTorch] AOTI: add more basic aoti_torch getters (#112799)
Lot of simple information about tensors we couldn't get. In
particular, we didn't know the lengths of the arrays returned by sizes
and strides.

Differential Revision: [D50949929](https://our.internmc.facebook.com/intern/diff/D50949929/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112799
Approved by: https://github.com/desertfire, https://github.com/aakhundov
ghstack dependencies: #112116, #112174, #112405, #112798
2023-12-12 06:19:45 +00:00
2b323e61ad [PyTorch] AOTI: Use static_cast, not dynamic_cast (#112798)
dynamic_cast is for when we aren't certain about the type. We are certain (and will crash anyway if we're wrong).

Differential Revision: [D50812978](https://our.internmc.facebook.com/intern/diff/D50812978/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112798
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel, https://github.com/khabinov
ghstack dependencies: #112116, #112174, #112405
2023-12-12 06:19:45 +00:00
ca52195112 [PyTorch] AOTI: Avoid aoti_torch_data_ptr calls for constants at inference time (#112405)
Cache aoti_torch_get_data_ptr at constants update time.

Differential Revision: [D50708982](https://our.internmc.facebook.com/intern/diff/D50708982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112405
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov
ghstack dependencies: #112116, #112174
2023-12-12 06:19:45 +00:00
24c67fe8cf [PyTorch] AOTI: Emit static constexpr int array vars when possible (#112174)
No need to populate a stack-based array for a shape/stride array when it's statically known.

Differential Revision: [D50699889](https://our.internmc.facebook.com/intern/diff/D50699889/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112174
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #112116
2023-12-12 06:19:45 +00:00
ff6f987adc [PyTorch] Replace cached thread_locals with stack allocation in AOTI (#112116)
This changes cached thread_local tensors to stack-allocated buffers. Since we were incidentally caching output in a thread_local, I had to add manual thread_local caching of outputs, which I implemented by caching a buffer and a Tensor whose storage is that buffer and then just memcpying the result into the cached buffer every time. Ideally, memory planning would be able to identify allocations that are the backing storage for outputs, but this should be good enough in the absence of planning.

Differential Revision: [D50416438](https://our.internmc.facebook.com/intern/diff/D50416438/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112116
Approved by: https://github.com/jansel, https://github.com/desertfire
2023-12-12 06:19:45 +00:00
405a0040cf Adds tool to visualize sharding (#114307)
This pull request adds a tool to visualize sharding. It uses the device_mesh and placement details to construct a visualization of the split of a torch dtensor.

Things to fix:

- [x] This implementation only uses the first element of the placement tuple, when can there be more than one elements?
- [x] The calculation of the split is happening here but maybe it is already done somewhere internally in Shard class and can we directly call that here?

Fixes #108746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114307
Approved by: https://github.com/wanchaol
2023-12-12 06:18:03 +00:00
65651d970b Optimize the copy of Half to Float and Float to Half on CPU (#103148)
### Description
Optimize the copy of Half to Float and Float to Half on CPU.

### Testing

Single core:
shape | fp16 -> fp32 / ms | fp32 -> fp16 / ms | bf16 -> fp32 / ms | fp32 -> bf16 / ms
-- | -- | -- | -- | --
size: (1, 777) | 0.00345 | 0.00344 | 0.00411 | 0.00410
size: (2, 512) | 0.00355 | 0.00344 | 0.00431 | 0.00400
size: (10, 555) | 0.00473 | 0.00391 | 0.00562 | 0.00477
size: (1, 2048, 1024) | 0.488 | 0.480 | 0.498 | 0.499
size: (32, 100, 777) | 0.584 | 0.568 | 0.571 | 0.587

28 cores:
shape | fp16 -> fp32 / ms | fp32 -> fp16 / ms | bf16 -> fp32 / ms | fp32 -> bf16 / ms
-- | -- | -- | -- | --
size: (10, 555) |  0.00472 | 0.00369 | 0.00576 |  0.00481
size: (1, 2048, 1024) |  0.0189 | 0.0188 | 0.0173 | 0.0251
size: (64, 512, 1024) | 3.159 | 2.375 |  3.152 | 2.358
size: (32, 100, 777) | 0.0225 | 0.0195 | 0.0193 | 0.0261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103148
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-12-12 05:57:52 +00:00
b6a4866330 [export][reland][refactor][3/n] Move unlift to separate file (#115558)
Reland of https://github.com/pytorch/pytorch/pull/114787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115558
Approved by: https://github.com/zhxchen17, https://github.com/atalman
ghstack dependencies: #115556, #115557
2023-12-12 05:37:07 +00:00
36199747f3 [export][reland][refactor][2/n] Move tracing logic (#115557)
Reland of https://github.com/pytorch/pytorch/pull/114768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115557
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115556
2023-12-12 05:37:07 +00:00
dd9a989b83 [export][reland][refactor][1/n] Split dynamic shapes (#115556)
Reland of https://github.com/pytorch/pytorch/pull/114764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115556
Approved by: https://github.com/zhxchen17
2023-12-12 05:36:41 +00:00
744d74c456 [inductor][optimus] enable smart fusion (#115471)
Summary: Enable gmm smart fusion in D51698686

Test Plan: buck test

Differential Revision: D52002137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115471
Approved by: https://github.com/mengluy0125
2023-12-12 05:04:36 +00:00
fbb744fd49 [dtensor] enable radam foreach optimizer (#115566)
As titled, test both non-foreach and foreach optim

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115566
Approved by: https://github.com/XilunWu
ghstack dependencies: #115297, #115564, #115565
2023-12-12 03:57:00 +00:00
c322e5b5e9 [dtensor] add test for nadam optimizer (#115565)
as titled, foreach ops already supported, just add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115565
Approved by: https://github.com/XilunWu
ghstack dependencies: #115297, #115564
2023-12-12 03:57:00 +00:00
4bd661c472 [dtensor] enable adadelta foreach optimizer (#115564)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115564
Approved by: https://github.com/XilunWu
ghstack dependencies: #115297
2023-12-12 03:56:55 +00:00
8a27352d6b [dtensor] add a implicit replication flag (#115297)
This PR adds a experimental implicit replication support for DTensor to
inter-op with torch.Tensor, basically under this context manager DTensor
could work together with torch.Tensor by assuming the torch.Tensor
sharding layout is replicated.

Note that this is risky for DTensor so we don't turn it on by default,
but for certain cases where it is for sure replicated, user can use this
to allow DTensor and Tensor computation work together

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115297
Approved by: https://github.com/awgu
2023-12-12 03:56:48 +00:00
c70f995b5c [DeviceMesh] Add mesh_dim_names to DeviceMesh __repr__ if it exists (#115579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115579
Approved by: https://github.com/wanchaol
2023-12-12 02:18:34 +00:00
0fc04e274d [inductor] Fix an aliased output bug (#115373)
Summary: for https://github.com/pytorch/pytorch/issues/97083, when

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115373
Approved by: https://github.com/jansel
2023-12-12 01:18:59 +00:00
89ee3af076 [Reland][Dynamo] Don't log compilation metrics for PyTorch unit tests (#115571)
Reland #115452, which was reverted to simplify a merge conflict with #115386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115571
Approved by: https://github.com/yanboliang
2023-12-12 01:15:54 +00:00
064846dbc2 [cpu] flash attention optimization (#115151)
### Modifications
- **EXP**: Add a fast version with a reduced accuracy (ULP20) to vec exp `exp_u20` and use it in flash attention.
- **FUSION**: Do fusion for `softmax` ops.
- **SCALE**: Move the calculation of `scaling_factor` after `gemm`.

### Performance
_Model: Stable Diffusion V2.1_

| Version | BF16 Kernel latency (s) | BF16 speedup | FP32 Kernel latency (s) | FP32 speedup |
| ----- | ----- | ----- | ----- | ----- |
| PT | 15.865 |  | 35.362 |  |
| PT + EXP | 12.518 | 21.10% | 19.327 | 45.35% |
| PT + EXP + FUSION | 11.774 | 25.79% | 18.306 | 48.23% |
| PT + EXP + FUSION + SCALE | 11.053 | 30.33% | 18.360 | 48.08% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115151
Approved by: https://github.com/jgong5, https://github.com/drisspg
2023-12-12 01:09:55 +00:00
0379c11248 [c10d] Enable PG NCCL monitor thread by default (#115577)
We added a monitor thread in NCCL PG in https://github.com/pytorch/pytorch/pull/112518. To summarize what we are doing in monitor thread: it listens to the heartbeat from watchdog thread and detect unhealthy nccl watchdog hang (due to several reasons such as nccl/cuda API bugs or unexpected blocking behaviors). This is the last resort to ensure that we don't silently keep the training job run for hours.

We didn't open this feature as default, since we want to perform more due diligence and have some customers to try it out. So far, we didn't see any obstacle which blocks turning on this feature and received positive feedback from users. We now decided to turn in on by default in this PR.

If this feature turns out not to work as expected and disturb one's training process, one can set `TORCH_NCCL_ENABLE_MONITORING=0` to disable this feature. Please kindly file an issue with us so that we can see if we missed any corner cases during the design.

Differential Revision: [D52045911](https://our.internmc.facebook.com/intern/diff/D52045911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115577
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2023-12-12 00:45:54 +00:00
6988e40b48 [quant][fx] Lower operator.matmul in convert_fx (#113954)
Summary: We support lowering `torch.matmul` but not
`operator.matmul`. This commit adds support for the latter,
which enables lowering the shorthand `@`. This address
https://github.com/pytorch/pytorch/issues/111450.

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers: jerryzh168

Subscribers: jerryzh168, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113954
Approved by: https://github.com/jerryzh168
2023-12-12 00:34:58 +00:00
0a464ad1a7 [dtensor] turn back on symbolic shape in tests (#115568)
as titled, as @jbschlosser enabled dynamic shape support for traceable
subclass, turn back on the tests with default setting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115568
Approved by: https://github.com/XilunWu
2023-12-12 00:26:23 +00:00
078773b32b [ROCm] Add owners for more HIP-specific paths (#113989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113989
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2023-12-12 00:24:38 +00:00
17de38c9af [Dynamo] Check duplication when loading dynamo tracing rules (#115059)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115059
Approved by: https://github.com/jansel
2023-12-12 00:22:20 +00:00
0692240b90 [dtensor] account for empty list when turning to OpStrategy (#115298)
Trying to fix https://github.com/pytorch/pytorch/issues/115065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115298
Approved by: https://github.com/XilunWu
2023-12-12 00:11:16 +00:00
19c67a9db5 [dynamo] Fix a closure cell empty error (#115541)
Summary: Fixes https://github.com/pytorch/pytorch/issues/97115. The solution given by @jansel in that issue works. Checking in the code so it won't get lost.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115541
Approved by: https://github.com/jansel
2023-12-12 00:01:51 +00:00
617c228fba [CI] Lower the smoketest speedup threshold for nangpt (#115562)
Summary:
https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314
shows the variance can be larger than previously expected. Lowering it
for now and if it continues to be a problem, we should switch to some
other more stable model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115562
Approved by: https://github.com/chenyang78
2023-12-11 23:46:30 +00:00
4471fe6c39 [sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_sparse_mm_search (#115178)
Summary:

cuSPARSELt has support for different alg_id, which are set via

`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.

Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.

danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).

However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```

This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
  a given matmul

_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.

We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.

Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
2023-12-11 23:08:51 +00:00
8b28380c8e [MPS] Fix sum and prod for complex types (#115554)
By not force-casting dtype to float

Test plan: `python -c "import torch;print(torch.linspace(-3.0, 3.0, 50, dtype=torch.cfloat, device='mps').sqrt().sin().sum())"`

Before:
```
tensor(21.1778+0.j, device='mps:0')
```
After
```
tensor(21.1778+39.1377j, device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115554
Approved by: https://github.com/lezcano
ghstack dependencies: #115512, #115513
2023-12-11 23:03:44 +00:00
a4bb4a2373 [MPS] Add support for MPSDataTypeComplexFloat[16|32] (#115513)
But limit it to MacOS Sonoma +

Before the calling `torch.cat` with complex types failed, but now it works.
Before:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype.
```
After:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
tensor([[0.4857+0.0030j, 0.9375+0.8630j, 0.3544+0.9911j],
        [0.5293+0.8652j, 0.8440+0.1991j, 0.5152+0.8276j],
        [0.0136+0.7469j, 0.1403+0.4761j, 0.2943+0.0896j],
        [0.6458+0.0035j, 0.3579+0.4577j, 0.1723+0.1508j],
        [0.4420+0.3554j, 0.4396+0.7272j, 0.2479+0.1191j],
        [0.3895+0.2292j, 0.7886+0.1613j, 0.9243+0.4180j]], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115513
Approved by: https://github.com/kulinseth
ghstack dependencies: #115512
2023-12-11 23:03:44 +00:00
288822c968 Increase ROCm test shards to 6 (#110997)
To reduce signal time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110997
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-12-11 22:30:16 +00:00
4307ccde99 Move ONNX's TorchModelType to pytorch_test_common to fix circ. dep. (#115353)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115353
Approved by: https://github.com/BowenBao
2023-12-11 22:23:03 +00:00
suo
ccd5bde6a3 [export] Reintroduce InterpreterModule to unflatten (#115436)
InterpreterModule is better than GraphModule codegen; it's more debuggable and
has better stack traces. The only reason we don't use it today is because
torch.compile doesn't work with it.

I work around this by constructing a GraphModule separately for usage during
dynamo tracing, but otherwise using torch.fx.Interpreter.

Differential Revision: [D51971661](https://our.internmc.facebook.com/intern/diff/D51971661/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115436
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115408
2023-12-11 22:15:32 +00:00
suo
c137335b5c [export] make UnflattenedModule not inherit from GraphModule (#115408)
UnflattenedModule doesn't really behave like a graph module; we customize `__call__` to do something completely different than what GraphModule does. So, things that test `isinstance(unflattened_module, GraphModule)` and do something with the GraphModule are often broken.

This change makes UnflattenedModule it's own thing.

Differential Revision: [D51959097](https://our.internmc.facebook.com/intern/diff/D51959097/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115408
Approved by: https://github.com/zhxchen17
2023-12-11 22:15:21 +00:00
8c1567d021 [c10d] Change watchdog inner loop function name to make it more accurate (#115404)
Function `workCleanupLoop` does not affect all things we did in watchdog thread, so proposing a new name here to reflect what we are actually doing in the watchdog thread.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115404
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2023-12-11 22:00:06 +00:00
99f06c0cc2 [BE] update errors to be more descriptive (#115443)
we call `_check_single_tensor` and `_check_tensor_list` as validation but don't print out the param types that were invalid

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115443
Approved by: https://github.com/XilunWu
2023-12-11 21:21:10 +00:00
b706c4116d [MPS] Add MacOS 14 runtime check (#115512)
Prerequisite for adding more complex type support and FFT operation

Check using `conjugateWithTensor:name:` selector defined as follows
```objc
/// Returns the complex conjugate of the input tensor elements.
///
/// - Parameters:
///   - tensor: The input tensor.
///   - name: An optional string which serves as an identifier for the operation..
/// - Returns: A valid `MPSGraphTensor` object containing the elementwise result of the applied operation.
-(MPSGraphTensor *) conjugateWithTensor:(MPSGraphTensor *) tensor
                                   name:(NSString * _Nullable) name
MPS_AVAILABLE_STARTING(macos(14.0), ios(17.0), tvos(17.0))
MPS_SWIFT_NAME( conjugate(tensor:name:) );
```

- Rename `isOnMacOS13orNewer(unsigned minor)` hook to `isOnMacOSorNewer(major, minor)`
- Replace `torch._C.__mps_is_on_macos_13_or_newer` with `torch._C._mps_is_on_macos_or_newer`
- Add `torch.backends.mps.is_macos_or_newer` public API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115512
Approved by: https://github.com/albanD
2023-12-11 21:11:42 +00:00
03ff44c958 [c10d] Fix Store check condition in NCCL PG watchdog (#115475)
In https://github.com/pytorch/pytorch/pull/115449/ somehow after turning on `DUMP_ON_TIMEOUT=1`, some existing tests failed. Upon checking, the failing is because of TCPStore check call within watchdog thread.

1. It's not because of TCPStore creation has not completed, because if I put it sleep for a long time, the test still failed. Rather, it's because we query TCPStore after we shutdown the PG.

2. The reason for that is: The `std::chrono::steady_clock::now()` function in C++ returns a `time_point` object representing the current point in time according to the steady clock. The default unit of this time_point is not directly specified in terms of seconds or nanoseconds; rather, it is dependent on the internal representation of the steady clock, which can vary between implementations. In reality it's actually nanosecs which makes the delta so big that we are checking the store every time when watchdog thread wakes up. To make things even worse, `terminateProcessGroup_` might be turned to be `True` before the next check for the outmost while but before TCPStore check, so watchdog gets stuck because we are checking a TCPStore which is already deleted. And main thread is still waiting for watchdog to join.

The solution here is:
1. Add back `std::chrono::duration_cast` to ensure the delta is indeed mil_sec, so that the timeout check logic is working as expected.
2. Check `terminateProcessGroup_` as well so that, we don't do any dump when main thread has already mark the process exited.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115475
Approved by: https://github.com/wconstab
2023-12-11 21:06:05 +00:00
ccc9e5f5bc Optimize conv2d pw quantized (#115221)
Summary:
In order to get better performance on conv2d pw its better to read the input together in a batch.

With this optimization on CUNET-enc ops:

Kernel Name              Workgroup Size         Duration P50 (ns)
===========              ==============         =================
vulkan.quantized_conv2d_pw_2x2{96, 72, 2}                       891332
vulkan.quantized_conv2d_pw_2x2{48, 36, 4}                       528528
vulkan.quantized_conv2d_pw_2x2{24, 18, 8}                       557336

Without this optimization:
Kernel Name              Workgroup Size         Duration P50 (ns)
===========              ==============         =================
vulkan.quantized_conv2d_pw_2x2{96, 72, 2}                      1633268
vulkan.quantized_conv2d_pw_2x2{48, 36, 4}                      1177228
vulkan.vulkan.quantized_conv2d_pw_2x2{24, 18, 8}                      1343264

Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.uniform_buffer_copy
...
[----------] Global test environment tear-down
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[  PASSED  ] 78 tests.

buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.zero_size_tensor
[       OK ] VulkanAPITest.zero_size_tensor (83 ms)
...
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7593: Skipped
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 395 tests from VulkanAPITest (6515 ms total)

[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[  PASSED  ] 394 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS

Reviewed By: yipjustin

Differential Revision: D50997530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115221
Approved by: https://github.com/yipjustin
2023-12-11 20:59:15 +00:00
585aea6e77 [xla hash update] update the pinned xla hash (#115528)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115528
Approved by: https://github.com/clee2000
2023-12-11 20:22:46 +00:00
505574c46a Add decomposition for torch.block_diag (#115096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115096
Approved by: https://github.com/peterbell10
2023-12-11 20:04:22 +00:00
5fe2b138e3 Revert "[inductor] Fix an aliased output bug (#115373)"
This reverts commit 1310f0bf38293b68a781287d1de8cf699a76974d.

Reverted https://github.com/pytorch/pytorch/pull/115373 on behalf of https://github.com/atalman due to Sorry for reverting your change it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/115373#issuecomment-1850792869))
2023-12-11 20:02:15 +00:00
c52b78ebc2 [ez] Remove some args from run_test.py (#115459)
Don't think anyone uses these
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115459
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-12-11 19:56:37 +00:00
b5578cb08b [ez] Remove unittest retries (#115460)
Pytest is used in CI now for reruns and I doubt people are using the env vars when running locally.  imo removing this code has the makes the run function easier to read
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115460
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-12-11 19:46:09 +00:00
5c0976fa04 Revert "[dynamo] guarded config (#111299)" (#115386)
This reverts commit 5927e9cbf2ac18aaaaecaab02258b7a35ac10969.

Differential Revision: [D51959266](https://our.internmc.facebook.com/intern/diff/D51959266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115386
Approved by: https://github.com/yanboliang, https://github.com/malfet
ghstack dependencies: #115384, #115401, #115385
2023-12-11 19:35:42 +00:00
6db7b30db4 Revert "[dynamo] Cache size calc for differing config (#111300)" (#115385)
This reverts commit 78318d024989cf86e1ede424997cd42d2d291694.

Differential Revision: [D51959268](https://our.internmc.facebook.com/intern/diff/D51959268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115385
Approved by: https://github.com/malfet
ghstack dependencies: #115384, #115401
2023-12-11 19:35:42 +00:00
f06f51b152 Revert "[Dynamo] Don't log compilation metrics for PyTorch unit tests (#115452)"
This reverts commit cd444aa075dd1e9c5d85cf3fbca9e078c74a7580.

Reverted https://github.com/pytorch/pytorch/pull/115452 on behalf of https://github.com/davidberard98 due to Merge conflict with #115385, which already landed in fbcode ([comment](https://github.com/pytorch/pytorch/pull/115452#issuecomment-1850729965))
2023-12-11 19:21:40 +00:00
f5f6618813 [executorch hash update] update the pinned executorch hash (#115311)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115311
Approved by: https://github.com/pytorchbot
2023-12-11 18:31:44 +00:00
40a14e07ef Revert "[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178)"
This reverts commit 1e5636f7915035b09dce22ad1d2170a65f344214.

Reverted https://github.com/pytorch/pytorch/pull/115178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the Window build failure looks legit 1e5636f791 ([comment](https://github.com/pytorch/pytorch/pull/115178#issuecomment-1850605711))
2023-12-11 18:07:17 +00:00
5f41fc7619 [c10d] Change NCCL PG watchdog error msg and test comments (#115403)
Address the nit comments in https://github.com/pytorch/pytorch/pull/115226/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115403
Approved by: https://github.com/wconstab
ghstack dependencies: #115226
2023-12-11 17:55:28 +00:00
794545c11f [BE]: Enable RUF015 codebase wide (#115507)
Constant time access of first value in collection. This is a constant time operation instead of converting the item to a list to get the first item which is linear. The rule is turned on which automatically autofixes and enforces this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115507
Approved by: https://github.com/malfet
2023-12-11 15:51:01 +00:00
1e5636f791 [sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178)
Summary:

cuSPARSELt has support for different alg_id, which are set via

`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.

Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.

danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).

However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```

This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
  a given matmul

_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.

We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.

Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
2023-12-11 15:47:28 +00:00
b88be1686d Revert "[export][refactor][1/n] Move dynamic shapes logic (#114764)" (#115508)
GitHub first oncall.
This reverts commit 53bf8cfcf9c966096e829247380462d0a3a61e8d.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115508
Approved by: https://github.com/malfet, https://github.com/angelayi
2023-12-11 14:54:51 +00:00
f017a1af3f [MPS] add complex_out to MPS backend (#110851)
Adds support for at::complex_out to the MPS backend

Implemented in a binary kernel using the view_as_real pattern for handling complex dtypes in the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110851
Approved by: https://github.com/kulinseth
2023-12-11 13:37:55 +00:00
de89a53df8 [benchmarking] Reduce box_detections_per_img for vision_maskrcnn (#115487)
This fixes a failure on the [perf dashboard](https://hud.pytorch.org/benchmark/compilers) with `--amp` mode.  I believe boxes 5 and 6 were getting swapped.  The existing comment explains the issue.

Before
```
$ ./benchmarks/dynamo/torchbench.py --training  --accuracy --no-translation-validatio --amp --backend=inductor --disable-cudagraphs --only vision_maskrcnn
...
[2023-12-09 13:21:27,292] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.00171, (ref-fp64): 0.00054 and shape=torch.Size([256, 256, 3, 3])
[2023-12-09 13:21:27,292] torch._dynamo.utils: [ERROR] Accuracy failed for key name backbone.fpn.layer_blocks.2.0.weight.grad
fail_accuracy
```

After
```
$ ./benchmarks/dynamo/torchbench.py --training  --accuracy --no-translation-validatio --amp --backend=inductor --disable-cudagraphs --only vision_maskrcnn
...
pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115487
Approved by: https://github.com/yanboliang
2023-12-11 08:42:25 +00:00
274fdc81f8 [Dynamo][6.3/N] Further cleanup torch.py (#114669)
A follow-up PR to clean up what I found during the refactor of torch.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114669
Approved by: https://github.com/jansel
2023-12-11 07:16:03 +00:00
fe01605830 [aotinductor] replace lld with the default ld linker (#115478)
Currently, we place constants in the .so. To avoid cases
where constants are too large (i.e. >2G), we put the
constants into .lrodata, which allows doesn't have 2G limit.
Not sure why, lld still issues errors like beow even if
those large constants data are stored in .lrodata section:

"relocation R_X86_64_PC32 out of range: 5459191920 is not in
[-2147483648, 2147483647]"

In constrast, the default gnu ld linker works fine. Let's
switch back to use ld to unblock some internal models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115478
Approved by: https://github.com/desertfire, https://github.com/htyu
2023-12-11 02:35:26 +00:00
1310f0bf38 [inductor] Fix an aliased output bug (#115373)
Summary: for https://github.com/pytorch/pytorch/issues/97083, when

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115373
Approved by: https://github.com/jansel
2023-12-10 23:52:39 +00:00
2e6b809d6b [AOTI] Fix a missing declaration for the result of item() (#115175)
Differential Revision: [D51968539](https://our.internmc.facebook.com/intern/diff/D51968539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115175
Approved by: https://github.com/chenyang78
2023-12-10 22:49:45 +00:00
9b3cb1c66c Fix environment condition for docker-release.yml
As those are run on nightlies and release tags environment should be set accordingly.

Also simplify `WITH_PUSH` condition.

Should fix https://github.com/pytorch/pytorch/actions/runs/7156407285/job/19494049140
2023-12-10 14:09:39 -08:00
38f890341d Implement pass-through state_dict and load_state_dict for dynamo OptimizedModule (#113423)
Fixes #113422
Fixes #94575

This is now possible:
```py
model = Model()
compiled_model = torch.compile(model)

model.load_state_dict(compiled_model.state_dict())  # previously key mismatch!
```

This also makes it much easier to checkpoint and load models that were wrapped like so:
```py
FSDP(torch.compile(model))
# or
DDP(torch.compile(model))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113423
Approved by: https://github.com/msaroufim
2023-12-10 22:09:19 +00:00
26266c9718 [CI] Call torch.cuda.empty_cache to release device memory (#114663)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114663
Approved by: https://github.com/eellison
2023-12-10 21:27:42 +00:00
694cc6af56 [benchmarks] Fix NameError: name 'args' is not defined (#115494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115494
Approved by: https://github.com/Skylion007, https://github.com/desertfire
2023-12-10 21:22:21 +00:00
21a1d31ed8 [caffe2] update Meta-internal googletest references (#115407)
Summary: Update test dependencies to point to the new internal googletest location.

Test Plan: CI

Differential Revision: D51951643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115407
Approved by: https://github.com/cccclai
2023-12-10 20:37:13 +00:00
24a463c46c Revert "[export][refactor][2/n] Move tracing logic (#114768)" (#115503)
Github first oncall.
This reverts commit 0ab57ee7eab5391289d30e8c49fceee3f503f539.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115503
Approved by: https://github.com/angelayi, https://github.com/kit1980
2023-12-10 19:30:15 +00:00
b4ef59f740 Revert "[dynamo] remove unused OptimizeCtx field - export (#113901)" (#115401)
This reverts commit b62230a685666e8c2b8a5cb31b16352d286bcf9f.

Differential Revision: [D52001024](https://our.internmc.facebook.com/intern/diff/D52001024)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115401
Approved by: https://github.com/malfet
ghstack dependencies: #115384
2023-12-10 18:17:24 +00:00
b36fc6790e Revert "[dynamo] Guard on HAS_GRAPH_BREAKS if graph breaks are present (i.e. cache miss if compiled object requires nopython) (#114073)" (#115384)
This reverts commit 0bb29f945079ac4c83d674f7b3ff755cfb5396cf.

Differential Revision: [D51959267](https://our.internmc.facebook.com/intern/diff/D51959267)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115384
Approved by: https://github.com/malfet
2023-12-10 18:16:02 +00:00
6c1e75e646 Revert "[HigherOrderOp] make MapHigherOrder create map_impl call_function node instead of map (#115205)"
This reverts commit 8b747358783d2411afe1136dcc9da95c01bfbdaa.

Reverted https://github.com/pytorch/pytorch/pull/115205 on behalf of https://github.com/atalman due to ghfirst broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/115205#issuecomment-1848995376))
2023-12-10 15:25:55 +00:00
100c466bff [CI][Inductor] Skip CPU tests when running on GPU (#115430)
This is just follows the standard practice for CI, when one specifies `PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda`, only tests targeting the device should be run

Do it by refactoring part of `instantiate_device_type_tests` into `get_desired_device_type_test_bases` and using it from test_torchinductor.py to skip CPU tests

Fixes https://github.com/pytorch/pytorch/issues/115423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115430
Approved by: https://github.com/seemethere
2023-12-10 15:21:24 +00:00
08d63a75a4 Revert "[HigherOrderOp] Remove additional get item calls in MapHigherOrder. (#115207)"
This reverts commit dd6ae6d3b473906d32fcb8a319895e31b039f224.

Reverted https://github.com/pytorch/pytorch/pull/115207 on behalf of https://github.com/atalman due to ghfirst broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/115207#issuecomment-1848991919))
2023-12-10 15:12:12 +00:00
fbeca60b1f Remove replace_all and make VTs mutable (#113725)
1.  Removes calls to `replace_all` and `clone` and makes VTs mutable.
2. Properly handles Tuple Iterator mutation. Previously TupleIterator variables would only be properly reconstructed if they were advanced at least once in a frame. On calls to `next`, the source information would be lost (due to constructing a new iterator without using builder), which would ensure that during codegen the variable would be reconstructed from scratch. Now that VTs are mutated, the source is never lost, so we need to properly track mutation and handle it by replaying calls to `next` at the end of the modified bytecode.
3. Added test for checking iadd side effects, this was missing in our unit test coverage.
4. Fixed two incorrect sources, DelayGraphBreakVariable, and UserMethodVariable both relied on setting the source to AttrSource(parent, name) at the callsite of `var_getattr`.
5. Fixed a bug in inplace adding for lists, it would set the resulting VariableTracker's source to `None` which would utilize a different reconstruct path in codegen. Now this is handled explicitly by reconstructing vars when allow_cache=`False`, so that during side effect replay, the mutated var is correctly updated.

In subsequent PRs:
* Refactoring side effect tracking to be significantly simpler (I think we only need an `is_modified` flag)
* Refactor `next_variables` iterator to match the signature of `next`
* Remove all references to `options` in the code
* Refactor VTs representing mutable collections to implement their own mutation update handling
* Remove clone and/or make it specific to lists for creating slices
* Add mutation tracking/replay for sets
* Add mutation tracking/replay for iter.py
* Removing setting source in builder (it's set at the top level after a var is returned)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113725
Approved by: https://github.com/jansel
2023-12-10 09:31:21 +00:00
f71d931b32 [Dynamo][6.2/N] Dump the in graph function list(~2600 ops) and add unit tests. (#114196)
This is the second PR according https://github.com/pytorch/pytorch/pull/113009#issuecomment-1804417925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114196
Approved by: https://github.com/jansel
2023-12-10 06:41:51 +00:00
4eb5838e18 Revert "Enable builtin tests for ONNX Export with ExportedProgram models (#114762)"
This reverts commit 13d2e3eba79000028291f4739a6e9c937dbe4264.

Reverted https://github.com/pytorch/pytorch/pull/114762 on behalf of https://github.com/huydhn due to Sorry for reverting your change but ONNX test is failing from this commit 13d2e3eba7 ([comment](https://github.com/pytorch/pytorch/pull/114762#issuecomment-1848831147))
2023-12-10 01:55:47 +00:00
2ee240d14a Revert "Move ONNX's TorchModelType to pytorch_test_common to fix circ. dep. (#115353)"
This reverts commit 960ad9d94e365c758b19298b45bcba5225b79e0c.

Reverted https://github.com/pytorch/pytorch/pull/115353 on behalf of https://github.com/huydhn due to Sorry for reverting your change but ONNX test is failing from the commit below in the stack 13d2e3eba7 ([comment](https://github.com/pytorch/pytorch/pull/115353#issuecomment-1848830883))
2023-12-10 01:53:50 +00:00
4490d4692b [doc] Rewrite benchmarks/dynamo/README.md (#115485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115485
Approved by: https://github.com/yanboliang
2023-12-10 00:37:53 +00:00
8ddc549c0f [BE][JIT] Do not wrap shared_ptr with optional (#115473)
While reviewing https://github.com/pytorch/pytorch/pull/115381 noticed that `torch::jit::GraphFunction::optimized_graph_` is an `std::array<c10::optional<std::shared_ptr<Graph>>, N>`, which feels excessive as `shared_ptr` is already nullable and have `operator bool()`. Looking at https://github.com/pytorch/pytorch/pull/26488 that introduced the change, also does not hint that this indirection is necessary.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115473
Approved by: https://github.com/davidberard98, https://github.com/Skylion007
2023-12-09 20:43:40 +00:00
641ec2115f [AOTI] move model runner into a library (#115220)
Summary: So that we can import it in fbcode and do some AOTI run in py env

Test Plan: existed AOTI tests

Reviewed By: chenyang78

Differential Revision: D51780021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115220
Approved by: https://github.com/desertfire
2023-12-09 19:03:32 +00:00
c039f01bd9 Increased hardcoded limit for number of GPUs. (#115368)
Fixes #115331.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115368
Approved by: https://github.com/albanD
2023-12-09 18:10:51 +00:00
cyy
99f222372b [5/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#115354)
This PR continues to fix clang-tidy warnings for headers in c10/core and c10/util.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115354
Approved by: https://github.com/Skylion007
2023-12-09 17:16:04 +00:00
937d616e82 Re-enable type checking for distributed_c10d.py (#115223)
Re-enable type checking for distributed_c10d.py

Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since.

Note: the backwards compatibility linter does not like some of these changes.  But they were incorrect before.  This needs human verification, however.

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223
Approved by: https://github.com/wconstab
2023-12-09 11:07:54 +00:00
485ea9a70a [DTensor] Add DTensor experimental op for LayerNorm backward sharding rule propogation (#115398)
Summary: This diff is only for prototype to unblock the TP work. PyTorch distributed team is working on a more generic backward op for `aten.layer_norm`. Will remove this op from the experimental file once it is ready.

Test Plan:
**Local Test**:
Accuracy:
- Dtensor + Checkpoint: first run loss: P884569822 (on-par with baseline: P884213363)
- 2nd by loading saved checkpoint: P884583429 (on-par with baseline: P884271869)

Trace:
- Collective functions are inserted automatically.
- Example: https://fburl.com/perfdoctor/l567ww1x

**MAST Test**:
With: trainer = 128, batch_size=512
- NE on-par:
(see: 4441_ep_bs512_2fsdp_tp_sp_dtensor)
 {F1155318138}

Differential Revision: D51490868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115398
Approved by: https://github.com/wanchaol
2023-12-09 09:38:56 +00:00
eb3aa424ce [Reland][Dynamo] Added support for math.radians on ints with dynamic shapes (#115477)
Reland #114507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115477
Approved by: https://github.com/larryliu0820
2023-12-09 08:58:18 +00:00
960ad9d94e Move ONNX's TorchModelType to pytorch_test_common to fix circ. dep. (#115353)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115353
Approved by: https://github.com/BowenBao
ghstack dependencies: #114407, #115281, #114762
2023-12-09 07:47:03 +00:00
13d2e3eba7 Enable builtin tests for ONNX Export with ExportedProgram models (#114762)
Fixed by https://github.com/pytorch/pytorch/pull/113982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114762
Approved by: https://github.com/BowenBao
ghstack dependencies: #114407, #115281
2023-12-09 07:46:43 +00:00
7e941a932b Store user model to simplify ONNXProgram.{adapt_torch_*,__call__} APIs (#115281)
Currently (after https://github.com/pytorch/pytorch/pull/114407), the user has must pass the original user ``model`` to APIs such as ``ONNXProgram.__call__``, ``ONNXProgram.adapt_torch_inputs_to_onnx`` and ``ONNXProgram.adapt_torch_outputs_to_onnx`` APIs.

This was needed because when the model is fakefied, a version of the non-fakefied model is needed so that the Initializers, buffers and constants can be extracted from a real model (and used as input to the ONNX model).
That approach brings an unnecessary usability burden to the user when the model is not fakefied, because the model that was already passed to ``torch.onnx.dynamo_export`` could be used to extract ``state_dict``.

This PR adds ``ONNXProgram._model_torch`` attribute to store the user model and demote ``model`` argument of the aforementioned APIs to optional, only (as opposed to required).

As a result, for the fakefied model scenario, the user still need to pass the required model, but for non fakefied models, the persisted model is implicitly used to extract the model state_dict, making it easier to use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115281
Approved by: https://github.com/BowenBao
ghstack dependencies: #114407
2023-12-09 07:46:12 +00:00
da341d0d48 [Dynamo][6.1/N] Refactor out TorchInGraphFunctionVariable and improve heuristic (#113432)
This is splitted from #113009, please check https://github.com/pytorch/pytorch/pull/113009#issuecomment-1804417925 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113432
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-12-09 05:11:44 +00:00
1c1f2bbe8a Add a space in the error message (#115465)
Summary:
As title says

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
waitforsandcastle

Sandcastle run

Reviewed By: eeggl

Differential Revision: D52000286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115465
Approved by: https://github.com/kwen2501
2023-12-09 04:35:51 +00:00
3ebf9acea1 [Triton] Replace triton.runtime.jit.get_cuda_stream with torch.cuda.c… (#115397)
triton.runtime.jit.get_cuda_stream was removed in https://github.com/openai/triton/pull/2756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115397
Approved by: https://github.com/jansel
2023-12-09 04:30:42 +00:00
cyy
516bd4a72c [1/N] Use std::in_place (#115170)
It is time to gradually replace c10::in_place with std::in_place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115170
Approved by: https://github.com/colesbury
2023-12-09 03:52:39 +00:00
2ed47fecc5 Robustify torch.multiprocessing.spawn error reporting to be less deadlock prone (#114688)
multiprocessing.Queue relies on, among other things, background threads to send messages between processes.  This works in the happy path but can cause issues if a process is exiting by bypassing atexit handlers or crashing because the writer to the Queue can terminate while the reader is blocked reading the queue.  The reader sees the queue as non-empty yet even with a timeout will actually block forever.

An example of a Queue deadlock is here: https://gist.github.com/chipturner/342f72341f087737befe9df84d0e41ce

Since the error reporting case here is a simple one-shot message from the dying child to the parent, we can just use a file-based rendezvous.  This eliminates the deadlock when a large traceback is still being flushed to the network when a child exits.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114688
Approved by: https://github.com/suo, https://github.com/yifuwang
2023-12-09 03:36:43 +00:00
2962271f58 [ONNX][dynamo_export] Extend expected fx output types for int, float, bool (#115431)
Fixes exporting ops, such as `aten::_scaled_dot_product_flash_attention` that returns int, float, bool typed outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115431
Approved by: https://github.com/titaiwangms, https://github.com/thiagocrepaldi
2023-12-09 03:24:48 +00:00
41b1919208 [nested_tensor]Python subclass NT overhead improvement (2/n): avoid getting from WeakTensorKeyDictionary twice during __init__ (#115450)
Summary:
Most NT operations end with creating a new NestedTensor, which is time-consuming. Trying to reduce overhead during the NestedTensor creation.

The ops return a new NestedTensor with the same offsets, so "tensor not in _tensor_symint_registry" would be false in most case. The "in" (__contain__) function takes ~8 us. If we use the "get" directly, then we save a few us for most NT operations.

Test Plan:
Before:
get_tensor_symint take 15us
https://pxl.cl/3XF83
After
get_tensor_symint take 10us
https://pxl.cl/3XFc9

Differential Revision: D51992836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115450
Approved by: https://github.com/soulitzer
2023-12-09 03:12:31 +00:00
d40a7c6026 Add decompositions for replication_pad (#115113)
Fixes #115395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115113
Approved by: https://github.com/peterbell10
2023-12-09 02:44:07 +00:00
d7705f325d Patch --save-xml when TEST_IN_SUBPROCESS (#115463)
Patch `--save-xml` when `TEST_IN_SUBPROCESS`

When `--save-xml` is given as a unit test argument and the test is handled by a `TEST_IN_SUBPROCESS` handler (e.g., `run_test_with_subprocess` for `distributed/test_c10d_nccl`), the `--save-xml` args were first "consumed" by argparser in `common_utils.py`. When a following subprocess in this `if TEST_IN_SUBPROCESS:` section starts, there are no `--save-xml` args, thus leaving `args.save_xml` to `None`.

Since argparser for `--save-xml` option has a default argument of `_get_test_report_path()` when the arg is `None`, it's not a problem for Github CI run. It could be an issue when people run those tests without `CI=1`. Test reports won't be saved in this case even if they passed `--save-xml=xxx`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115463
Approved by: https://github.com/clee2000
2023-12-09 02:38:31 +00:00
c9c4cdf9a9 [AOTAutograd] Do not call ctx.mark_dirty on mutations hidden from autograd (#115324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115324
Approved by: https://github.com/bdhirsh
2023-12-09 02:23:13 +00:00
3361496f96 Fix the corner case of index_add (#114929)
Fixes #114864

As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114929
Approved by: https://github.com/mikaylagawarecki
2023-12-09 01:57:25 +00:00
3c54ff6bcd Update ONNX's IO Adapter to support FakeTensor with ExportedProgram (#114407)
Currently, the ONNX exporter using torch.nn.Module as input can support
FakeTensor because the ONNX model stores all initializers

When using torch.export.ExportedProgram as input, the initializers are
lifted as inputs. In order to execute the ONNX model, we need to pass a
reference to the non-fake model to the
ONNXProgram.adapt_torch_inputs_to_onnx API, so that initializers can be
fetched from the model and fed to the ONNX model as input

ps: https://github.com/pytorch/pytorch/issues/115461 will track the API revision for the cases where additional `model_with_state_dict` are required to produce complete ONNX files exported with fake support. This is also tracked by the umbrella fake tensor issue https://github.com/pytorch/pytorch/issues/105464 FYI @BowenBao
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114407
Approved by: https://github.com/BowenBao
2023-12-09 01:48:27 +00:00
495054545c Allow preserve_rng_state=True when torch.compile + selective checkpointing + CUDA (#113718)
Fixes https://github.com/pytorch/pytorch/issues/113717.

When `preserve_rng_state=True`, we let AOTAutograd trace through `torch.random.fork_rng` op, and the tracing doesn't work under CUDA, hence the original error reported in the issue.

But since we are already doing RNG functionalization at Inductor level, we don't actually need to trace this `fork_rng` op. So we should just rewrite `preserve_rng_state` to False when we are using torch.compile (and let Inductor do its RNG functionalization which it's already been doing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113718
Approved by: https://github.com/wanchaol
2023-12-09 01:47:25 +00:00
cd444aa075 [Dynamo] Don't log compilation metrics for PyTorch unit tests (#115452)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115452
Approved by: https://github.com/zou3519
2023-12-09 01:39:36 +00:00
e1370ff80f Vectorize CPU ATen mean kernel for BF16 & FP16 dtypes (#114582)
## Summary
Since #97351, CPU ATen kernel for `mean` for BF16 & FP16 dtypes has been unvectorized (it's not even implicitly vectorized).

This PR vectorizes `mean` for BF16 & FP16 on CPU in a `cast_fp32 -> sum -> div -> cast_bf16_or_fp16` fashion.

The perf benefit would be especially pronounced on machines with `AVX512_BF16` and/or `AVX512_FP16` ISA support.

## Benchmarking data for BF16 (collected before & after the change in this PR)

**Machine:** Intel&reg; Xeon&reg; (4th generation series, formerly codenamed Sapphire Rapids) Platinum 8468H
One socket (48 physical cores) - used `numactl --membind=0 --cpunodebind=0`
libtcmalloc & Intel OpenMP were preloaded

Environment variable used -
`KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48`

**Workload:** E2E performance on BS 32 resnet50 (using BF16 via AMP) inference using oneDNN Graph JIT fuser (`mean` kernel is dispatched to eager mode ATen kernel, and is the bottleneck right now)

| **BEFORE:** Latency with unvectorized mean (lower is better)| **AFTER:** Latency with vectorized mean (lower is better)| Speedup due to vectorizing mean|
|----------------------------|-------------------------|------------|
|                19.1 ms           |                10.8  ms       | latency reduced by ~43.45%      |

**Benchmarking script for BF16 -**

 ```
import time
import torch
import torchvision

# enable oneDNN Graph JIT fuser
torch.jit.enable_onednn_fusion(True)
# AMP for JIT mode is enabled by default, and is divergent with its eager mode counterpart
torch._C._jit_set_autocast_mode(False)

# sample input should be of the same shape as expected inputs
example_input = torch.rand(32, 3, 224, 224)
# Using resnet50 from torchvision in this example for illustrative purposes,
# but the line below can indeed be modified to use custom models as well.
model = getattr(torchvision.models, "resnet50")().eval()

with torch.no_grad(), torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16):
    # Conv-BatchNorm folding for CNN-based Vision Models should be done with ``torch.fx.experimental.optimization.fuse`` when AMP is used
    import torch.fx.experimental.optimization as optimization
    # Please note that optimization.fuse need not be called when AMP is not used
    model = optimization.fuse(model)
    model = torch.jit.trace(model, (example_input))
    model = torch.jit.freeze(model)
    # a couple of warm-up runs
    model(example_input)
    model(example_input)
    # speedup would be observed in subsequent runs
    start = time.time()
    model(example_input)
    end = time.time()
    inference_time = (end - start) * 1000
    print("Inference time is ", inference_time)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114582
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-12-09 01:02:13 +00:00
f614ed78b8 [docs, dynamo] fix typos in dynamo custom backend docs (#115444)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115444
Approved by: https://github.com/eellison
2023-12-08 23:58:26 +00:00
fb19947962 Add decompositions for reflection_pad{1, 2, 3}d (#115100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115100
Approved by: https://github.com/peterbell10
2023-12-08 23:05:57 +00:00
9f7b3a4e18 Move autolabeler to "oncall: distributed" not "module:.." (#115447)
Reasoning for the change is spelled out in this issue

https://github.com/pytorch/pytorch/issues/115168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115447
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-12-08 22:53:20 +00:00
749f0c90e1 Revert "[export][refactor][3/n] Move unlift to separate file (#114787)" (#115457)
Github First Oncall: This reverts commit 967863d91dbe0a56fa7bcc4e075a25cc4ad67c81.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115457
Approved by: https://github.com/osalpekar
2023-12-08 22:33:28 +00:00
28de29fdda [releng] version 2.2 -> 2.3 (#115446)
Release 2.2 branch cut is ompleted. Hence bump nightly version to 2.3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115446
Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
2023-12-08 22:25:52 +00:00
3e47e3f441 Revert "[export] Fix graph output mismatch issue with constant outputs. (#115280)"
This reverts commit 622688fab9fc6d20ff3475a8a0a1fdb6af9d837e.

Reverted https://github.com/pytorch/pytorch/pull/115280 on behalf of https://github.com/atalman due to ghfirst issue when importing, will reland this PR ([comment](https://github.com/pytorch/pytorch/pull/115280#issuecomment-1847903624))
2023-12-08 22:10:03 +00:00
3dab46fe19 Revert "[export] Dont skip output caching for now. (#115374)"
This reverts commit fd79995fd6d9f599ff60b721ae56bb7b0aa4eb93.

Reverted https://github.com/pytorch/pytorch/pull/115374 on behalf of https://github.com/atalman due to ghfirst issue when importing, will reland this PR ([comment](https://github.com/pytorch/pytorch/pull/115374#issuecomment-1847899901))
2023-12-08 22:06:21 +00:00
aaaf5c08fb [ez] Don't run workflows on forks (#115429)
Adds the `if: github.repository_owner == 'pytorch'` to some jobs to make sure they don't run on forks, since they usually either fail or remain pending due to not having the correct machines to run.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115429
Approved by: https://github.com/huydhn, https://github.com/botmethere, https://github.com/malfet, https://github.com/atalman
2023-12-08 21:41:58 +00:00
b5d3d3ebf0 [ao] making hist_obs handle torch.inf and closeby values (#103467)
Summary: This PR does 2 things:

1) Previously this would simply error, now it will ignore any
torch.inf values that it recieves. note: The code checks for torch.inf after
aminmax that way if there are no torch.inf values found, the perf is a
relatively unchanged

2) as mentioned in https://github.com/pytorch/pytorch/issues/100051,
values close to (but not quite at) the maximum/minimum float value could
overflow to infinity in the course of _adjust_min_max() (when this large
value would be multiplied by something in the middle of a calculation
that would otherwise result in a non inf value). This was fixed by
rearranging the order of operations for the lines in question without
altering the actual equations. Specifically, where operations in lines
1095, 1098 and 1100 have multiplication and division of large values,
its better to divide the two large values before multiplying, rather
than multiplying the two large values together (creating overflow) before dividing like it had been.

Test Plan: python test/test_quantization.py
TestObserver.test_histogram_observer_ignore_infinity

python test/test_quantization.py TestObserver.test_histogram_observer_handle_close_to_infinity
Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D51489345](https://our.internmc.facebook.com/intern/diff/D51489345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103467
Approved by: https://github.com/andrewor14
2023-12-08 21:41:31 +00:00
1215f2ffe2 [dtensor] readme typo (#115383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115383
Approved by: https://github.com/awgu
ghstack dependencies: #115365
2023-12-08 21:40:40 +00:00
af925a56a1 Revert "[export] Add math.* ops to pass base (#115271)"
This reverts commit 6c0a4ced530dab78db455c37508931de2eb56239.

Reverted https://github.com/pytorch/pytorch/pull/115271 on behalf of https://github.com/atalman due to ghfirst issue when importing, will reland this PR ([comment](https://github.com/pytorch/pytorch/pull/115271#issuecomment-1847852211))
2023-12-08 21:17:56 +00:00
12d7ea19af [Indcutor][fx pass] Add sub and div pointwise ops to the post grad fusion (#115389)
Summary: Titled

Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/792c58db-c369-487d-9a42-b5da471657c0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2814749981661407
Network: Up: 74KiB  Down: 29KiB  (reSessionID-b47c266b-12d6-4e88-8dc3-4af1dd7ecbb4)
Jobs completed: 20. Time elapsed: 2:09.6s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce
OC: P899142918
MAI: P899175452
# e2e (oc)

Differential Revision: D51957242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115389
Approved by: https://github.com/dshi7, https://github.com/jackiexu1992, https://github.com/xuzhao9
2023-12-08 21:07:03 +00:00
e8e4141773 Revert "[Dynamo][6.1/N] Refactor out TorchInGraphFunctionVariable and improve heuristic (#113432)"
This reverts commit e61d6b42f0f4e4fa5bb816e03fb81e5bbcc9fa06.

Reverted https://github.com/pytorch/pytorch/pull/113432 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing dynamo tests in trunk e61d6b42f0, landrace? ([comment](https://github.com/pytorch/pytorch/pull/113432#issuecomment-1847787981))
2023-12-08 20:15:39 +00:00
d7180161b5 Revert "[SparseCsr] Remove triton sdpa skip after triton pin update (#109601)"
This reverts commit f64b10803f5fdd34e43fba7f421401bcfe247c19.

Reverted https://github.com/pytorch/pytorch/pull/109601 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing in trunk with this error ZeroDivisionError: integer division or modulo by zero ([comment](https://github.com/pytorch/pytorch/pull/109601#issuecomment-1847784383))
2023-12-08 20:12:53 +00:00
4186932bac Revert "[export] Remove runtime assertion pass (#115196)"
This reverts commit c163b3c03563c11640d4dbee504ef63101b019fe.

Reverted https://github.com/pytorch/pytorch/pull/115196 on behalf of https://github.com/atalman due to Broke internal test ([comment](https://github.com/pytorch/pytorch/pull/115196#issuecomment-1847778344))
2023-12-08 20:07:04 +00:00
317486edb0 [C10D] Decouple flight recorder from enableTiming (#115358)
RE #115301

Decoupling gives us a path to disable timing without disabling the
flight recorder.

Flight recorder is still useful for stuckness analysis without 'timing'.

Disabling timing makes it miss the 'started'
state that comes from using an extra nccl event at the start of each
collective.  It will also be missing 'duration_ms' of collectives, which
hasn't been landed yet, but is useful for timing/perf work more than
stuckness analysis.

Hopefully we can enable timing by default and leave both on, but it's
nice to have the flexiblity for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115358
Approved by: https://github.com/fduwjj
2023-12-08 19:44:45 +00:00
suo
3d999d2f2c [export] optimize unflattener (#115364)
Unflattening was slow on the APS FM model (which has thousands of nn.EmbeddingBag modules).

Quick glance at the profile shows 75% of time in unflattening was spent copying this node list, which is immutable and globally shared. So just passing it around as a tuple yields a 4x speedup lol.

Differential Revision: [D51929775](https://our.internmc.facebook.com/intern/diff/D51929775/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115364
Approved by: https://github.com/zhxchen17
2023-12-08 19:32:01 +00:00
494cb28231 [PyTorch] AOTI: add ArrayRefTensor (#112115)
This adds a shim for AOTI generated code to pretend a raw array works like an AtenTensorHandle. This allows parts of AOTI that generate uses of tensors to continue to be unaware of how those tensors are allocated. See the following diff/PR for usage.

Differential Revision: [D50570252](https://our.internmc.facebook.com/intern/diff/D50570252/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112115
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-12-08 19:31:50 +00:00
a2b89154bf New swap function (#111747)
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.

This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.

The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.

Draft right now to see what @colesbury thinks about doing this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
2023-12-08 18:49:35 +00:00
5f2ff29569 Fix typo in https://pytorch.org/docs/stable/sparse.html (#115282)
Fixes #111473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115282
Approved by: https://github.com/svekars
2023-12-08 18:31:33 +00:00
68f74dd162 Add python and C++ support for LPPool3d (#114199)
Add python and C++ support for LPPool3d to Fixes #114114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114199
Approved by: https://github.com/mikaylagawarecki
2023-12-08 18:18:44 +00:00
1c3a4a864c Remove always restore (#115317)
Removes always restore, assuming that a HOP will cleanup any leftover state from tracing fwd + bwd

This required a minor change to the autograd fn variable higher order op. If we are tracing forward DON'T add the call_function node into the main graph, since we are only tracing it for the purposes of speculation. Instead return the result directly to be passed to the backward for speculation. This was the only observable side effect on the output graph that I found.

Test plan:
test_smoke_from_test_autograd in test_autograd_function.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115317
Approved by: https://github.com/voznesenskym, https://github.com/jansel
2023-12-08 18:17:37 +00:00
a3f93dc44d [EZ] [CD] Enable Triton 3.12 conda builds (#115424)
Currently there is a chicken and egg problem with enabling triton builds for the platform, as package depends on `torch`, so I can only submit this change few days after https://github.com/pytorch/pytorch/pull/114819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115424
Approved by: https://github.com/clee2000, https://github.com/seemethere
2023-12-08 18:10:45 +00:00
81b565b142 [CI] Fix a missing write_csv_when_exception problem (#115370)
Summary: Fix a problem shown in https://github.com/pytorch/pytorch/actions/runs/7124839624/job/19400589129 when a model times out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115370
Approved by: https://github.com/eellison
2023-12-08 18:09:53 +00:00
c370450f02 [inductor] Remove hashing of tensor data for constants (#115356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115356
Approved by: https://github.com/eellison
2023-12-08 18:05:34 +00:00
e61d6b42f0 [Dynamo][6.1/N] Refactor out TorchInGraphFunctionVariable and improve heuristic (#113432)
This is splitted from #113009, please check https://github.com/pytorch/pytorch/pull/113009#issuecomment-1804417925 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113432
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-12-08 17:15:14 +00:00
898554a3a3 [torchgen] Add logic in custom ops to return empty tensor (#114143)
Summary: Add two logic:

1. If the custom op is returning a `Tensor` but also doesn't have an out tensor as input, return an empty tensor.
2. If the custom op is returning more than one Tensor and the number of out tensors is not the same as return Tensor, return a tuple of empty tensors.

Test Plan: Rely on new unit tests

Differential Revision: D51471651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114143
Approved by: https://github.com/cccclai
2023-12-08 17:03:44 +00:00
b3b5bd51ea [raas][torch][jit] Allow not storing the optimized graph (#115381)
Summary:
GraphFunction internally stores the optimized graph after generating it and then it is passed into the executor which makes a copy of it. So we store the optimized graph effectively twice.

This diff allows to set a flag to not store the optimized graph inside the GraphFunction.

The code is NoP right now until the flag is enabled.

Test Plan:
I ran SL with this on raas with good memory saving on raas server. From command line:

exmaple model run
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362

I1207 11:04:58.657143 3556226 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 255646 Kb
```

then with flag enabled:
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true
I1207 11:06:25.245779 3577383 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 165167 Kb
```
So collective with this flag and the flag from D51950418
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true --torch_jit_enable_profiling_graph_executor=false

I1207 11:09:17.502743 3592345 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 114848 Kb
```

Differential Revision: D51931895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115381
Approved by: https://github.com/malfet
2023-12-08 16:29:13 +00:00
f64b10803f [SparseCsr] Remove triton sdpa skip after triton pin update (#109601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109601
Approved by: https://github.com/desertfire, https://github.com/amjames
2023-12-08 15:49:16 +00:00
72e58a756c Set markDynamoStrictTest in functorch/test_vmap.py (#115274)
We set markDynamoStrictTest in most of functorch/test_vmap.py. This
revealed many existing failing tests, so we mark those all as expected
failures or skip them.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115274
Approved by: https://github.com/guilhermeleobas, https://github.com/kshitij12345
ghstack dependencies: #115267, #115276, #115268
2023-12-08 14:51:19 +00:00
cc8f6f56dc [quant][pt2e] Add convert callback to Observer module (#115001)
Summary:
This is to allow easier extension of quant workflow in the future, as we are seening more
diverse ways of doing quantization

putting up this for feedbacks first

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_observer_callback

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115001
Approved by: https://github.com/kimishpatel
2023-12-08 13:47:37 +00:00
ca15671c30 Fix failing test_invalid_input_csr_large (#114940)
The test introduced in #102530 has a bug:
Construction of `crow_indices` raises an exception: "value cannot be converted to type int32 without overflow" which is obviously correct.
This makes the test fail which is supposed to check for an overflow in nnz.
Fix by making the construction of `crow_indices` pass although with an invalid value which would error later but triggers the correct check.

Given that I'm not sure it is even worth checking for an overflow in nnz:
- `crow_indices[..., -1] == nnz` is already enforced
- this can only hold if `crow_indices` is able to hold `nnz` without overflow
- `col_indices` has to be of the same type as `crow_indices`
- Hence the type of `col_indices` has to be able to hold the value of `nnz`

So in conclusion: The situation being checked for cannot reasonably occur

CC @pearu as the test author for additional insight

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114940
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-12-08 11:55:21 +00:00
23fa9621e4 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099) (#115193)
Summary:

Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.

Test Plan: CI.

Differential Revision: D51861018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
2023-12-08 08:44:32 +00:00
6c585de076 [CUDA] baddmm should fall back to addmm for batch=1 (#114992)
I.e. it feels reasonable to always call `at::cuda::gemm` rather than `at::cuda::bgemm` when num_batches == 1
After the change, benchmarking torch built with CUDA-12 using  [following perf script](https://gist.github.com/malfet/6a17156d7f5663b8b12054a1beff3fe1) on A100  are as follows:
|      Shape     |  bmm_time |  mm_time  | slow down (%) |
| -------------- | --------- | --------- | ------------- |
|    1x1x4096    |   14.18   |   14.31   |     -0.89     |
|    1x1x8192    |   14.37   |   14.37   |     -0.05     |
|   1x1x16384    |   14.03   |   14.12   |     -0.68     |
|   1x1x32768    |   14.19   |   14.24   |     -0.35     |
|   1x1x65536    |   14.85   |   14.52   |     2.30      |
|   1x1x131072   |   14.03   |   14.07   |     -0.33     |
|  128x128x128   |   11.34   |   11.06   |     2.56      |
|  256x256x256   |   14.85   |   14.40   |     3.15      |
|  512x512x512   |   27.22   |   27.22   |     -0.01     |
| 1024x1024x1024 |  129.66   |  129.50   |     0.12      |
| 2048x2048x2048 |  972.18   |  973.24   |     -0.11     |
|  129x127x129   |   11.21   |   11.25   |     -0.39     |
|  257x255x257   |   14.50   |   14.43   |     0.44      |
|  513x511x513   |   29.01   |   29.01   |     0.01      |
| 1025x1023x1025 |  137.65   |  137.64   |     0.01      |
| 2049x2047x2049 |  982.58   |  982.65   |     -0.01     |
|  4097x3x4097   |   86.65   |   86.64   |     0.01      |
|  8193x3x8193   |  384.02   |  383.96   |     0.02      |
| 16385x3x16385  |  1106.73  |  1107.32  |     -0.05     |
| 32769x3x32769  |  4739.49  |  4739.48  |     0.00      |
| 65537x3x65537  | 17377.78  | 17378.74  |     -0.01     |
|  4097x5x4097   |   87.09   |   87.12   |     -0.03     |
|  8193x5x8193   |  301.38   |  301.36   |     0.01      |
| 16385x5x16385  |  1107.38  |  1108.04  |     -0.06     |
| 32769x5x32769  |  4743.73  |  4744.07  |     -0.01     |
| 65537x5x65537  | 17392.32  | 17395.42  |     -0.02     |
|  4097x7x4097   |   87.17   |   87.19   |     -0.02     |
|  8193x7x8193   |  301.94   |  302.00   |     -0.02     |
| 16385x7x16385  |  1107.17  |  1106.79  |     0.03      |
| 32769x7x32769  |  4747.15  |  4747.13  |     0.00      |
| 65537x7x65537  | 17403.85  | 17405.02  |     -0.01     |

Fixes perf problem reported in https://github.com/pytorch/pytorch/issues/114911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114992
Approved by: https://github.com/Skylion007, https://github.com/eqy
2023-12-08 07:53:17 +00:00
4d70802133 [c10d] Use TCPStore to record NCCL timeout and dump debug info (#115226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115226
Approved by: https://github.com/wconstab
2023-12-08 06:19:40 +00:00
2c84616a94 Move the shape env symint cache to a symbol cache, better routing for subclass fakification [re-pr 115227] (#115396)
*
Context:

Joel sees that unless he manually writes to the fake tensor memo, fakification seems to produce spurious symbols! Voz (me) objects, saying that not only is directly writing to memo a bad pattern, recursively invoking fakification on tensor subclass elements in dynamo should suffice! Joel says that while he morally agrees, he has a test proving otherwise, a most perplexing situation.

Digging in, I figured out that while *we were* making fake tensors correctly, with properly cached symbols and the like, we were *also* incorrectly creating spurious symbols, leading the test to fail.

Before this PR, we would only cache source->symint. This was generally fine, but meant that you would create a symbol, then potentially throw it out due to symint cache. For example, the cache hit flow was:

make a symbol (ex: s2) -> use it to make a symint -> hit the cache (my_source-s1)

Now, in this example,  you have a symbol in your val_to_var/var_to_val (s2) that is unused. This is sound, but wasteful, and furthermore, misleading.

This was causing a test added in a PR in this stack to fail, specifically, because the test was using

```
curr_var_to_val = {
    str(k): v for k, v in context.fake_mode.shape_env.var_to_val.items()
}
````

To validate that no new symbols were being created (that is, that recursively creating fake tensors for subclasses was working).

The test is correct, but the implementation of caching would make (by this method of observation) cache hits look like cache misses.

So, the fix here is to move the cache up to be a general symbol cache, rather than only a cache for symints.

The initial implementation did that! But then, it ran into some interesting errors when it came to replay. When replaying symbol creation, behaviors would diverge in the new shape env! How could that be? The answer is because creating a new shape_env resulted in us replaying symbol creation... but with a cache from a different shape env! This was short circuiting symbol creation - and so, adding an extra layer to the cache for id(shape_env) fixes the problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115396
Approved by: https://github.com/mlazos
2023-12-08 05:02:21 +00:00
d0f161eae4 [vision hash update] update the pinned vision hash (#111264)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111264
Approved by: https://github.com/pytorchbot
2023-12-08 03:33:33 +00:00
9521331ba5 [pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process (#115219)
Summary:
[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process
We have seen a handful of jobs training stuck where one of the trainer goes down
while others are stuck in c++ land and hence not handling the sigterm.

Test Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```

Differential Revision: D51862545

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115219
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-12-08 02:26:19 +00:00
459845b82d [cuDNN][cuDNN frontend] Bump cudnn_frontend submodule to 1.0 (#115218)
A prerequisite for cuDNN flash attention #113713 .

CC @malfet @atalman @drisspg @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115218
Approved by: https://github.com/drisspg, https://github.com/malfet
2023-12-08 02:24:26 +00:00
e071d6a9eb [Nested tensor]avoid using shape in python subclass NT, use _size instead (#115371)
Summary:
calling tensor.shape will call torch_dispatch which adds more overhead.

Testing overhead difference in "NT + NT" operation:
**Before:**
the add operation takes ~300us
{F1167963824}
**After:**
the add operation takes ~200us
 {F1167964056}

Test Plan: unit tests in test_nestedtensor

Differential Revision: D51949135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115371
Approved by: https://github.com/soulitzer, https://github.com/jbschlosser
2023-12-08 02:08:36 +00:00
5432088098 Adds Checkpointer Wrapper for DCP [3/N] (#114603)
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.

Instead of doing:

```
DCP.save(
    state_dict={...},
    storage_writer=StorageWriter(...)
)

DCP.load(
    state_dict={...},
    storage_reader=StorageReader(...)
)
```

We can now do:

```
checkpointer = Checkpointer(...)

checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-08 01:03:21 +00:00
3b01f30b20 Prevent invalid pointwise ops on jagged with transposed ragged dim (#115190)
TODO: tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115190
Approved by: https://github.com/soulitzer, https://github.com/ani300
2023-12-08 00:54:03 +00:00
784e20e3d7 [C10D] Make dumpPipe use async launcher (#115375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115375
Approved by: https://github.com/fduwjj
ghstack dependencies: #115332
2023-12-08 00:16:22 +00:00
bb7746275c Add is_integer to SymFloat (#114703)
Fixes #114676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114703
Approved by: https://github.com/peterbell10
2023-12-07 23:23:53 +00:00
f5919335db Fix _load_from_state_dict for num_batches_tracked in batchnorm (#115285)
I approved https://github.com/pytorch/pytorch/pull/110850 which did the following

Previously:
`num_batches_tracked` not in state_dict when doing `m.load_state_dict(state_dict)` --> always overwrite module's `num_batches_tracked` in `load_from_state_dict` with a 0 cpu tensor

Now:
`num_batches_tracked` not in state_dict loaded when doing `m.load_state_dict(state_dict)` --> only overwrite module's `num_batches_tracked`  in `load_from_state_dict` with a 0 cpu tensor if module does not have `num_batches_tracked`

This causes the following issue:

```
with torch.device('meta'):
     m = BatchNorm(...)
m.load_state_dict(state_dict, assign=True)
```

If `num_batches_tracked` is not in `state_dict`, since `modules's` `num_batches_tracked` is present on meta device, it is not overwritten with a 0 cpu tensor. When compiling, this error is raised

```
AssertionError: Does not support mixing cuda+meta
```

I am not sure whether the explicit check for meta device makes sense as a fix, will add testing if this fix is ok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115285
Approved by: https://github.com/albanD
2023-12-07 22:48:26 +00:00
18d57dde2d Remove remaining uses of copy_graphstate (#115321)
After auditing higher_order_ops.py, the graph checkpoints were only getting used in the event of an exception, so it is safe to remove because we restart analysis in this case now.

To make this clearer the current state is the following:
Checkpoint side effects
Capture subgraph
if graph break:
  restore as usual
else:
  throw away inlining translator and subgraph tracer
Restore side effects

This will change to the following after this change:
Checkpoint side effects
Capture subgraph:
if graph break:
  restart analysis
else:
  throw away inlining translator and subgraph tracer
Restore side effects

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115321
Approved by: https://github.com/jansel, https://github.com/zou3519
2023-12-07 22:35:02 +00:00
ecba053cff [quant][pt2e] XNNPACKQuantizer skip inserting observers for non-float Tensors (#114999)
Summary:
att

Test Plan:
python test/test_quantization.py -k test_add_mul_long

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114999
Approved by: https://github.com/kimishpatel, https://github.com/guangy10
2023-12-07 22:13:36 +00:00
dacf5d6e92 [DTensor] Remove assert to allow tensor sharding dimension < Shard(x).ndim (#115114)
Consolidated by changes made by @yoyoyocmu. https://www.internalfb.com/diff/D51821717
Remove assert to allow tensor dimension < Shard(x).ndim. With the current padding, we do support this already.

Follow up: we will still need to fix the size mismatch and `full_tensor()` hang when tensor is uneven-sharded.
Created issue here: https://github.com/pytorch/pytorch/issues/115310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115114
Approved by: https://github.com/yoyoyocmu, https://github.com/wanchaol
2023-12-07 21:57:30 +00:00
7562b45454 Reland "[C10D] Use future for flight recorder dump (#115176)" (#115332)
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort". The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.

Allows multiple calls to dump, which will be serialized.

Renames tryWriteDebugInfo to launchAsyncDebugDump in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.

Adds a test for dumping on timeout.

This reverts commit ac7d14baad53fa7d63119418f760190f289d8a01.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115332
Approved by: https://github.com/fduwjj
2023-12-07 21:20:58 +00:00
fd79995fd6 [export] Dont skip output caching for now. (#115374)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115374
Approved by: https://github.com/tugsbayasgalan
2023-12-07 20:31:30 +00:00
6a6a1e3ef7 [dtensor] update README to make all example runnable (#115365)
as titled, also add torchrun commands

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115365
Approved by: https://github.com/fegin
2023-12-07 20:23:37 +00:00
c06ab369e8 [OAT] toggle for forcing matmul precision matching (#115326)
Summary: Add a toggle to inductor config that will force matmul precision dtypes to match between cublas and triton backends for addmm, bmm, and mm operations.

Test Plan: CI + model launches

Reviewed By: jansel

Differential Revision: D51442001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115326
Approved by: https://github.com/jansel
2023-12-07 20:22:12 +00:00
7faa67f6ef [inductor] enable mkldnn op weight pre-packing on aarch64 (#115037)
This PR enables the fx passes and mkldnn optimizations for aarch64 It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option for aarch64.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115037
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-12-07 19:58:38 +00:00
7201edc0a5 Fix RNN class constructor signature (#115341)
Fixes #114617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115341
Approved by: https://github.com/mikaylagawarecki
2023-12-07 19:46:33 +00:00
21cca2494d Move test_multi_tensor_optimizers to use OptimizerInfos (#114797)
This PR aims for parity+ compared to the old testing for the simplest foreach test case.

Test coverage increase: we now test foreach optimizers with CPU as well as on GPU.

Before:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok

----------------------------------------------------------------------
Ran 1 test in 7.253s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```

Now, we get granular test cases at the cost of overhead!
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 22 tests in 30.954s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```

Why the increase in time?
Two reasons:
1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does.
2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before.

Next steps for the future:
- consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case)
- this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797
Approved by: https://github.com/albanD
2023-12-07 19:37:56 +00:00
16373bbc1f fix error message in pytorch (#115349)
Fixes https://dev-discuss.pytorch.org/t/typo-in-error-message/1709 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115349
Approved by: https://github.com/Skylion007
2023-12-07 19:27:29 +00:00
suo
eb4ba35b07 fix test_weak.py on mac (#115367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115367
Approved by: https://github.com/albanD
2023-12-07 19:19:56 +00:00
b0a9641815 [Inductor][fx pass] Fuse pointwise operators in the post grad (#114778)
Summary: We construct a unified API that can be easily add pointwise ops to be batched in the post grad

Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/19b3f641-782f-4f94-a953-3ff9ce2cfa7b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900251953016
Network: Up: 67KiB  Down: 32KiB  (reSessionID-c2a80f26-8227-4f78-89fc-bcbda0ae8353)
Jobs completed: 18. Time elapsed: 1:19.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
# local reproduce
### cmf
P881792289
### igctr
### dsnn
### icvr

Reviewed By: xuzhao9

Differential Revision: D51332067

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114778
Approved by: https://github.com/xuzhao9
2023-12-07 19:04:03 +00:00
3a5fb0d456 markDynamoStrictTest in functorch/test_eager_transforms.py (#115268)
We're doing some more work around the functorch-torch.compile
interaction. The current state is that these tests might not get run in
the Dynamo CI shard. Using this decorator makes them actually run (by
resetting the Dynamo state before/after each test).

Test Plan:
Wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115268
Approved by: https://github.com/voznesenskym, https://github.com/guilhermeleobas
ghstack dependencies: #115267, #115276
2023-12-07 18:42:21 +00:00
a1bfaf75dc markDynamoStrictTest: add nopython flag, set default to False (#115276)
Default should be False because in general, we're interested
in reliability and composability: we want to check that
running PyTorch with and without Dynamo has the same semantics (with
graph breaks allowed).

Test Plan:
Existing tests?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115276
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115267
2023-12-07 18:42:21 +00:00
2847045ed9 Set _dynamo.config.capture_func_transforms=False (#115267)
Due to not all tests in the Dynamo shard actually running in CI, we've
started to bitrot on this implementation. Since our plan is to trace
into the functorch implementations instead of construct a HOP
(which is what capture_func_transforms=True does), let's turn off this
config by default.

Test Plan:
- Tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115267
Approved by: https://github.com/voznesenskym, https://github.com/guilhermeleobas
2023-12-07 18:42:15 +00:00
3e66385ddd Add Work to distributed docs (#115172)
Summary:
Documenting the `Work` object

For a collective (broadcast, all_reduce, etc.) when async_op=True we return a `Work` object to which users can call `.wait()`, `.is_success()`, among other things but this class is not documented

Test Plan: Preview the docs build in OSS

Differential Revision: D51854974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115172
Approved by: https://github.com/wconstab
2023-12-07 18:12:10 +00:00
ee8b33f7d5 Fixed crash when calling pad_packed_tensor when packed with cuda tensors and ensure_sorted=false due to indexing with tensors on different devices (#115028)
Fixes #115027

Fix in csrc as done in the python code [here](https://github.com/pytorch/pytorch/blob/main/torch/nn/utils/rnn.py#L338).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115028
Approved by: https://github.com/drisspg
2023-12-07 18:09:18 +00:00
suo
686a3e0bf0 [pytorch][PR] introduce WeakHashRef (#115216)
We would like weak dictionaries that have `torch.ScriptObject` keys. Similar to tensors, we need to override the behavior of the ref to dot he right thing under comparison.

This change also makes it so that WeakIdKeyDictionary works with a pluggable ref_type.

Differential Revision: [D51828205](https://our.internmc.facebook.com/intern/diff/D51828205/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115216
Approved by: https://github.com/albanD
2023-12-07 17:48:11 +00:00
684ce1b21d Revert "Assert that output could only be the last node of the FX graph (#115179)"
This reverts commit 4a9fb9832abc00dff9729b7d7a9647b376882f38.

Reverted https://github.com/pytorch/pytorch/pull/115179 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115179#issuecomment-1845776365))
2023-12-07 17:26:27 +00:00
dd6ae6d3b4 [HigherOrderOp] Remove additional get item calls in MapHigherOrder. (#115207)
As titled, this PR removes the unnessecary getitem call from the graph that's manipulated in MapHigherOrder, where we want to get the first dim slice of original tensor for specualtion but using call_method will accidentally create a get_item call in the graph, so want to avoid it by calling unpack_var_sequence on input tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115207
Approved by: https://github.com/yanboliang
ghstack dependencies: #115115, #115204, #115205
2023-12-07 17:06:44 +00:00
8b74735878 [HigherOrderOp] make MapHigherOrder create map_impl call_function node instead of map (#115205)
We want to remove the map_wrapper and replace it with dynamo always on. This is the first step of this plan.

In this PR, we make dynamo directly generates a map_impl nodes. This hasn't touch the eager logic yet. So the execution path after this PR looks like 1. `dynamo -> map_impl` when torch.compile is on. (Before this PR, it's `dynamo -> map_wrapper -> map_impl` and 2. `map_wrapper -> map_impl` (This PR did't touch the logic here).

The added TODO(yidi) is addressed in the following pr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115205
Approved by: https://github.com/yanboliang
ghstack dependencies: #115115, #115204
2023-12-07 17:06:44 +00:00
be3efbebb6 [HigherOrderOp] make MapHigherOrder use should_flatten_output=True (#115204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115204
Approved by: https://github.com/yanboliang
ghstack dependencies: #115115
2023-12-07 17:06:35 +00:00
998c87f93c [BE][HigherOrderOp] extract redundant code that unflattens the output (#115115)
We need this function to unflatten the variable tracker for HOPs that want pytree output support, e.g. map.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115115
Approved by: https://github.com/yanboliang
2023-12-07 17:06:28 +00:00
43f42bf3cb Updated docs for deprecated torch.set_default_tensor_type (#115041)
Added deprecation note for torch.set_default_tensor_type. Updated docs that referenced this method.

Fixes #113646.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115041
Approved by: https://github.com/janeyx99
2023-12-07 16:17:36 +00:00
441ecf03e2 Update gloo submodule (#115158)
Updates to pull ROCm 6.0 related changes and few minor updates in gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115158
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2023-12-07 15:55:08 +00:00
cyy
7b8084d1c6 [5/N] Fixes clang-tidy warnings in c10/core/*.h (#115232)
This PR continues to fix clang-tidy warnings for headers in c10/core.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115232
Approved by: https://github.com/Skylion007
2023-12-07 15:48:03 +00:00
d08b20d534 Update FlashAttention too v2.3.6 (#115313)
# Summary
This PR updates the FlashAttention code from:
02ac572f3f.
Or Tag 2.3.2

To 92dd5703ec

Or tag 3.2.6.

As well I think that this should be cherry picked into 2.2.0 release since there was a temporary ~15% perf regression for causal masking. It is not technically a regression since Flash wasn't released yet but it would be nice to have in the release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115313
Approved by: https://github.com/Skylion007
2023-12-07 15:47:16 +00:00
78b945484b [c10d] Extend NCCL communicator splitting to more use cases (#114916)
Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world.

This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank.

This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number).  And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916
Approved by: https://github.com/kwen2501
2023-12-07 15:13:01 +00:00
a6736ac851 Add call to run_tests for a few tests (#115097)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115097
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-12-07 08:27:40 +00:00
3c882925da Make subclass type instances constants (like UserDefinedClasses) (#115323)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115323
Approved by: https://github.com/oulgen
2023-12-07 08:10:59 +00:00
5e3631db31 [DTensor] force re-compute sharding when normalized_shape differs in fwd layer norm (#115250)
**Summary**:
#114174 did not test the case where `elementwise_affine=False` (i.e. `weight` and `bias` are `None`) and this test would fail due to cached sharding propagation. The difference on sharding prop between these cases is, when `weight` and `bias` are None, the forward layer norm op will be recognized as a "static shape op" and `propagate_op_sharding` will be applied rather than `propagate_op_sharding_non_cached`. A fix is to force re-compute sharding when `normalized_shape` changes by setting op schema's `RuntimeSchemaInfo`'s `static_argnum` to include `normalized_shape` (i.e. 1)

**Test**:
pytest test/distributed/_tensor/test_math_ops.py -s -k layer_norm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115250
Approved by: https://github.com/wanchaol
2023-12-07 07:44:06 +00:00
622688fab9 [export] Fix graph output mismatch issue with constant outputs. (#115280)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115280
Approved by: https://github.com/tugsbayasgalan
2023-12-07 06:11:08 +00:00
e1f159e6b2 Remove rebundant api named is_int_list (#115136)
Fixes #114933

As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115136
Approved by: https://github.com/zou3519
2023-12-07 04:55:13 +00:00
5309ac1b98 Add test case to prove non-strict export supports external call (#115245)
Current non-strict test cases (added in #114697) are already supported by strict mode, so it can't demonstrate the incremental value of non-strict mode. How about adding test cases that fail in strict mode but pass in non-strict mode?

Test Plan:
python test/export/test_export.py -k test_external_call_non_strict_real_tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115245
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17
2023-12-07 04:51:15 +00:00
a93b9ee9d8 [quant][be] Add a test for per channel quant for groupwise conv (#115224)
Summary:
just making sure this works

Test Plan:
python test/test_quantization.py -k test_groupwise_per_channel_quant

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115224
Approved by: https://github.com/andrewor14
2023-12-07 04:46:20 +00:00
b7eb9b1e7e [Autotune] Enable register pressure handling logic for H100. (#115295)
I have seen the register pressure handling logic helps performance on H100 for a couple kernels. Also my local run of Huggingface and timm_models both show neutral results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115295
Approved by: https://github.com/jansel
2023-12-07 04:37:44 +00:00
f55ab176fc [OAT] move matmul precision out of system info (#115242)
Summary: move matmul precision out of the system info (system hash) and into the cache in preparation for switching precisions during compile

Test Plan: CI

Reviewed By: jansel

Differential Revision: D51442000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115242
Approved by: https://github.com/jansel
2023-12-07 04:30:06 +00:00
7ec145bfed [Quant] [PT2] Fix XNNPACKQuantizer set_module_type issue (#115252)
**Summary**
Fix the issue https://github.com/pytorch/pytorch/issues/115251, the root-cause is we pass the `filter_fn` parameter of `find_sequential_partitions` in wrong position. Use keyword arg to fix this issue.

**Summary**
```
python -u -m pytest -s -v test_quantization.py -k test_set_module_type_case_2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115252
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-12-07 03:08:20 +00:00
6c0a4ced53 [export] Add math.* ops to pass base (#115271)
Fixes https://github.com/pytorch/pytorch/issues/115209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115271
Approved by: https://github.com/ydwu4
2023-12-07 02:47:04 +00:00
d7160c9223 Handle potential ValueError exception when stringifying signals (#114696)
On some systems it is possible to receive a signal that does not have a name.  Rare, but possible.  This prevents our error handler from crashing and instead properly reports the signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114696
Approved by: https://github.com/xmfan
2023-12-07 02:10:30 +00:00
ac7d14baad Revert "[C10D] Use future for flight recorder dump (#115176)"
This reverts commit 0e07e3dbe434ce31a5aea634628c7d39747f265f.

Reverted https://github.com/pytorch/pytorch/pull/115176 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test_timeout_dumps is failing in trunk 0e07e3dbe4 ([comment](https://github.com/pytorch/pytorch/pull/115176#issuecomment-1844076455))
2023-12-07 02:09:58 +00:00
3a18211622 Guard on subclass inner tensors (#114965)
This PR introduces guarding on subclass inner tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114965
Approved by: https://github.com/voznesenskym
ghstack dependencies: #114311, #115212
2023-12-07 01:47:48 +00:00
c163b3c035 [export] Remove runtime assertion pass (#115196)
Reland of https://github.com/pytorch/pytorch/pull/111949/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115196
Approved by: https://github.com/avikchaudhuri
2023-12-07 01:44:11 +00:00
73c0035160 Add reset_storage method to FunctionalTensorWrapper (#115235)
In certain edge cases when using lazy tensors, the base tensor stored in the `FunctionalStorageImpl` and the `value_` tensor stored in the `FunctionalTensorWrapper` diverge. For instance, take this simple example
```python
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(4, 2, bias=False)

    def forward(self, x):
        return x @ self.fc1.weight.transpose(0, 1)

with torch.device("lazy"):
    model = Model()

    x = torch.ones(4)
    out = model(x)
```
The call to `transpose` on the lazily initialized weight `fc1.weight` applies a view op on the functional tensor which only gets propagated to the functional tensor wrapper and not the base tensor in the storage. Thus, causing them to diverge.

To fix this behaviour, we need to reset the functional tensor's storage. To facilitate this, we add a `reset_storage` method to `FunctionalTensorWrapper` which clears away the old storage and view metas.

CC: @behzad-a @GlebKazantaev @wconstab @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115235
Approved by: https://github.com/bdhirsh
2023-12-07 01:32:01 +00:00
cyy
4e9fe496cd Remove c10::either (#112733)
Time to remove it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112733
Approved by: https://github.com/albanD
2023-12-07 01:31:53 +00:00
240f4b2d25 make __lookup_backend return None when cache misses (#114766)
Fixes #114674. The error is because cached_backends is a thread-local object, when it's accessed from the other thread, we'll have a cache miss. The naive fix is to just return None and re-compiles when cache misses. This could also be related to making dynamo more thread-safe but I'm not sure if there an on-going effort or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114766
Approved by: https://github.com/IvanYashchuk, https://github.com/Neilblaze, https://github.com/jansel
2023-12-07 00:25:01 +00:00
7457a5f4be [inductor] adapt to the get_max_simd_tflops Triton API change (#115288)
Differential Revision: D51907617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115288
Approved by: https://github.com/hl475, https://github.com/chenyang78
2023-12-07 00:22:06 +00:00
ae5365819d [ONNX] Extend test_fx_op_consistency.py to cover ExportedProgram model type (#114886)
This PR covers `ExportedProgram` to `test_fx_op_consistency.py`, which helps us identify the necessary but missing io_steps.
Next, we should refactor the tests to actually cover all ops supported by registry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114886
Approved by: https://github.com/thiagocrepaldi
2023-12-07 00:03:23 +00:00
3642f29a64 DistributedDataParallel._post_forward, fix return (#114678)
Fix `return` in case of `_delay_all_reduce_all_params`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114678
Approved by: https://github.com/Skylion007, https://github.com/fegin
2023-12-06 23:44:52 +00:00
0e07e3dbe4 [C10D] Use future for flight recorder dump (#115176)
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort".  The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.

Allows multiple calls to dump, which will be serialized.

Renames `tryWriteDebugInfo` to `launchAsyncDebugDump` in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.

Adds a test for dumping on timeout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115176
Approved by: https://github.com/zdevito
2023-12-06 23:42:19 +00:00
0757e2ba84 [aotautograd] Fix an output shape error when inputs are aliased (#115279)
Summary: https://github.com/pytorch/pytorch/issues/97083, when an output
is marked as OutputType.is_input but a synthetic base is constructed
because of aliased inputs, we may need to update the output type to
OutputType.alias_of_input if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115279
Approved by: https://github.com/bdhirsh
2023-12-06 23:10:21 +00:00
7e0e124a5d Automated submodule update: FBGEMM (#115103)
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: dbc3157bf2

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115103
Approved by: https://github.com/malfet
2023-12-06 22:47:40 +00:00
83cb6a75ad [dynamo] add list iterator contains (#115237)
Fixes https://github.com/pytorch/pytorch/issues/115236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115237
Approved by: https://github.com/jansel
2023-12-06 22:26:16 +00:00
71bf4f3b87 [CI] Add torch/_functorch/_aot_autograd to auto-label rule (#115283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115283
Approved by: https://github.com/bdhirsh
2023-12-06 20:07:53 +00:00
1489e4bcf3 [Quant] [PT2] Enable batchnorm in _move_exported_model_to_eval (#114547)
**Summary**
Add standalone batchnorm into `_move_exported_model_to_eval` to move it from training mode into eval mode

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_bn_conv2d
python -u -m pytest -s -v test_quantize_pt2e.py -k test_bn_move_exported_model_to_eval
```

Differential Revision: [D51853407](https://our.internmc.facebook.com/intern/diff/D51853407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114547
Approved by: https://github.com/jgong5, https://github.com/andrewor14
2023-12-06 19:51:22 +00:00
c99db5617a Introduce general metadata cache to jagged layout NestedTensor (#115212)
Slight refactor to:
* lazily compute min / max seq_len used for flash. this avoids unnecessary graph breaks / specialization when we're not accessing these
* store min / max seq_len in a general `metadata_cache`. condensing these should make it easier to avoid specializing on these and others we may add in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115212
Approved by: https://github.com/soulitzer, https://github.com/ani300
ghstack dependencies: #114311
2023-12-06 19:40:35 +00:00
b6de337d16 [funcol] a few optimizations to funcol (#113324)
Apply a few optimizations to funcol:

- allgather on non-0 dim, the resulting tensor already needs to access
data in order to do torch.cat, so we sync wait here so that we don;t
need to go through ACT dispatch for chunk + cat alltogether
- have a fast return logic to aten.view as it's a commonly hit op for
view related ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113324
Approved by: https://github.com/XilunWu
2023-12-06 19:25:35 +00:00
2cf0cf8137 [dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)
Fixes https://github.com/pytorch/pytorch/issues/113812, https://github.com/pytorch/pytorch/issues/102591, Probably fixes: https://github.com/pytorch/pytorch/issues/113740, https://github.com/pytorch/pytorch/issues/113786, https://github.com/pytorch/pytorch/issues/113788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114154
Approved by: https://github.com/wconstab, https://github.com/yf225
2023-12-06 18:50:14 +00:00
967863d91d [export][refactor][3/n] Move unlift to separate file (#114787)
Differential Revision: [D51823960](https://our.internmc.facebook.com/intern/diff/D51823960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114787
Approved by: https://github.com/ydwu4
ghstack dependencies: #114764, #114768
2023-12-06 16:46:47 +00:00
0ab57ee7ea [export][refactor][2/n] Move tracing logic (#114768)
2/n of refactoring export code:

* Moved tracing logic in torch/_export/init.py to torch/export/_tracer.py

Differential Revision: [D51823961](https://our.internmc.facebook.com/intern/diff/D51823961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114768
Approved by: https://github.com/ydwu4
ghstack dependencies: #114764
2023-12-06 16:46:47 +00:00
53bf8cfcf9 [export][refactor][1/n] Move dynamic shapes logic (#114764)
1/n of refactoring export code:
* Moved dynamic shapes/constraints/dynamic_dims logic in torch/_export/__init__.py and torch/export/__init__.py to torch/export/dynamic_shapes.py

Differential Revision: [D51823962](https://our.internmc.facebook.com/intern/diff/D51823962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114764
Approved by: https://github.com/ydwu4
2023-12-06 16:46:38 +00:00
5f939e32e3 [CI] Log load_model failures in csv (#114784)
Summary: Right now when load_model fails (either because of loading error or validation eager run failure), the result won't be logged in generated csv files. Let's log them in csv so that they are monitored by the expected results checking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114784
Approved by: https://github.com/malfet
2023-12-06 15:19:16 +00:00
67c8ad7285 Fix autograd.Function x enum input x torch.compile (#115206)
Fixes https://github.com/pytorch/pytorch/issues/114777. We treat Enums
like we do ConstantVariable.

Test Plan:
New test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115206
Approved by: https://github.com/yanboliang
ghstack dependencies: #115185, #115186, #115187
2023-12-06 15:18:25 +00:00
233ce0d24b Support GPU annotations for auto-trace jobs similar on-demand support (#114638)
Summary: When using auto_trace, gpu_user_annotation is not shown in the results. Fixing this by including `GPU_USER_ANNOTATION` in `kCudaTypes`.

Differential Revision: D51597995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114638
Approved by: https://github.com/aaronenyeshi
2023-12-06 09:38:13 +00:00
d4c79a3078 Add an attention bias subclass for a lower right causal masking (#114823)
# Summary
This PR introduces a new Tensor subclass that is designed to be used with torch.nn.functional.scaled_dot_product_attention. Currently we have a boolean `is_causal` flag that allows users to do do causal masking without the need to actually create the "realized" attention bias and pass into sdpa. We originally added this flag since there is native support in both fused kernels we support. This provides a big performance gain ( the kernels only need to iterate over ~0.5x the sequence, and for very large sequence lengths this can provide vary large memory improvements.

The flag was introduced when the early on in the kernel development and at the time it was implicitly meant to "upper_left" causal attention. This distinction only matters when the attention_bias is not square. For a more detailed break down see: https://github.com/pytorch/pytorch/issues/108108. The kernels default behavior has since changed, largely due to the rise of autogressive text generation. And unfortunately this would lead to a BC break. In the long term it may actually be beneficial to change the default meaning of `is_causal` to represent lower_right causal masking.

The larger theme though is laid here: https://github.com/pytorch/pytorch/issues/110681. The thesis being that there is alot of innovation in SDPA revolving around the attention_bias being used. This is the first in hopefully a few more attention_biases that we would like to add. The next interesting one would be `sliding_window` which is used by the popular mistral model family.

Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf.
```Shell
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
|  Type   |      Speedup       | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim |     dtype      | head_dim |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
| Average | 1.2388050062214226 |            |           |           |           |           |                |          |
|   Max   | 1.831672915579016  |    128     |    32     |   1024    |   2048    |   2048    | torch.bfloat16 |    64    |
|   Min   | 0.9430534166730135 |     1      |    16     |    256    |    416    |   2048    | torch.bfloat16 |   128    |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114823
Approved by: https://github.com/cpuhrsch
2023-12-06 08:29:26 +00:00
4a9fb9832a Assert that output could only be the last node of the FX graph (#115179)
Test Plan: unit tests

Differential Revision: D51856848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115179
Approved by: https://github.com/Chillee
2023-12-06 08:17:16 +00:00
fcf6a76108 [aot_inductor][pass] fuse parallel linear based on pre grad aten IR (#114776)
Summary:
This work is for PT2 inference. Since the IR from Export will change to pre-grad aten IR in a few months. We need to start this work from now on. Here is what I do in this diff:
1) Copy the fuse parallel linear pass to fb folder and adapt it to aten IR. We still want to keep the original `group_batch_fusion.py` because it is still used in training. In future at certain time point when PT2 training decided to retire the torch IR based group_batch_fusion, we can remove it. But right now, it's better to have torch IR and aten IR version seperately.

Our plan is to gradually transform the existing and important pre-grad passes to aten IR based passes.

Differential Revision: D51017854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114776
Approved by: https://github.com/zhxchen17
2023-12-06 05:48:20 +00:00
cyy
d250b2158e [4/N] Fixes clang-tidy warnings in header files (#115163)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115163
Approved by: https://github.com/Skylion007
2023-12-06 05:00:01 +00:00
f4c67ffff4 [dynamo] Improve support for dynamic shapes str.format and _assert (#115203)
This removes a graph break in vision_maskrcnn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115203
Approved by: https://github.com/yanboliang
2023-12-06 04:54:45 +00:00
4ff4e06b5b Update xla pin (#115211)
This is to update the pin pass 062aa91a9c so flaky test can be skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115211
Approved by: https://github.com/malfet
2023-12-06 04:52:37 +00:00
534f25887b [inductor] avoid inplace for ComplexView (#115166)
Fix https://github.com/pytorch/pytorch/issues/115071
A regression introduced by https://github.com/pytorch/pytorch/pull/112875/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115166
Approved by: https://github.com/Skylion007
2023-12-06 04:52:28 +00:00
490f2d7570 Skip privateuse1's checkZeroPoints (#114117)
We want to use ``quantize_per_channel`` to create a quantized tensor, but we found that ``checkZeroPoints`` for ``privateuse1`` backend failed.

``quantize_tensor_per_channel_affine`` will ``checkZeroPoints`` for all backends expect ``CUDA``:
140c54e6cc/aten/src/ATen/native/quantized/AffineQuantizer.cpp (L162-L164)

However, our ``privateuse1`` backend will get a segmentation error if we try to cast our data to int64_t in ``checkZeroPoints``:
140c54e6cc/aten/src/ATen/native/quantized/AffineQuantizer.cpp (L82-L88)

So if we can skip ``privateuse1``'s ``checkZeroPoints`` and check this item in the actual device function? What do you think?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114117
Approved by: https://github.com/jerryzh168
2023-12-06 04:44:49 +00:00
acdd06e00f [executorch hash update] update the pinned executorch hash (#115215)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115215
Approved by: https://github.com/pytorchbot
2023-12-06 04:33:25 +00:00
a548e80536 Use test_vulkan to validate run_test without boto3 (#115233)
As `test_weak` can undergo some changes, but `test_vulkan` is a no-op for CPU builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115233
Approved by: https://github.com/suo
2023-12-06 03:45:52 +00:00
2bff36bb0e [c10d] Change set timeout API name to _set_default_timeout (#115197)
Somehow the feedback does not show up, this PR is to address the comment in https://github.com/pytorch/pytorch/pull/115141.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115197
Approved by: https://github.com/XilunWu, https://github.com/wconstab
2023-12-06 03:38:39 +00:00
b56b002842 Fix NULL dereference in binary CPU ops (#115183)
Targeted fix for https://github.com/pytorch/pytorch/issues/113037

A more fundamental one, where those functions are not even called for
empty tensors are coming later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115183
Approved by: https://github.com/drisspg, https://github.com/atalman, https://github.com/huydhn
2023-12-06 03:37:47 +00:00
892a14a450 [vision hash update] update the pinned vision hash (#111408)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111408
Approved by: https://github.com/pytorchbot
2023-12-06 03:25:52 +00:00
ef6cbf4e1f remove myself from CODEOWNERS (#115230)
Trying to reign in my notifications ;-)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115230
Approved by: https://github.com/malfet
2023-12-06 02:50:50 +00:00
b0b190f7c0 More descriptive error message for unsupported inputs to HOP (#115187)
Test Plan:
See updated tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115187
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
ghstack dependencies: #115185, #115186
2023-12-06 01:29:03 +00:00
b5b011a5cd Expand input types for HOPs that use manually_set_subgraph_inputs=False (#115186)
Previously we only supported Tensor, Constants, and SymNode. We lift
that restriction (there's not really a good reason for it). HOPs like
torch.cond, torch.map already do input validation (those are the ones
that can only support Tensor, Constant, and SymNode inputs).

Test Plan:
New test for `wrap`, which is a HOP that has
manually_set_subgraph_inputs=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115186
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
ghstack dependencies: #115185
2023-12-06 01:29:03 +00:00
bc46347152 Refactor how HOPs create new args to subgraphs (#115185)
This PR combines the logic for Tensor and SymNode.

Test Plan:
- Existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115185
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
2023-12-06 01:29:03 +00:00
f6291a5e93 [Quant] [Inductor] Enable QLinear weight prepack when input dimension size exceeds 2 (#113928)
**Summary**
Enable the qlinear weight prepack when input dimension size exceeds 2. There are extra reshape node before and after the `addmm` or `mm` node if input dimension size exceeds 2.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k input_dim_exceeds_2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113928
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #113733, #113912
2023-12-06 01:24:15 +00:00
6d0cf26c3a [Quant] [Inductor] Enable Dequant Promotion when Linear input dimension size exceeds 2 (#113912)
**Summary**
When decomposing `Linear` to `addmm` or `mm` within Inductor, if the input dimension size exceeds 2, `reshape` nodes are introduced to convert the input into a 2-dimensional form before and after the `addmm` or `mm` node. It is essential to identify and match this pattern during quantization for dequantization promotion. For instance,
```
        #            quant
        #      + - - - | - - - +
        #      |    dequant    |
        #      |       |       |
        #      |    reshape    |
        #      |    /     \    |
        #      |  node1  node2 |
        #      + - | - - - | - +
        #        reshape reshape
        #      + - | - - - | - +
        #        quant    quant
```
In this PR, we mainly do 2 things:

- Extend support for the dequantization pattern in QLinear when the input dimension size exceeds 2.
- Revise the implementation of the dequant promotion pass, as it now needs to accommodate the matching of four different patterns.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k input_dim_exceeds_2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113912
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #113733
2023-12-06 01:20:36 +00:00
4a624d1f8a [Quant] [PT2] Enable QLinear input with multi dims (#113733)
**Summary**
In the previous QLinear implementation, it was assumed that inputs have a dimension of 2. In this update, we have modified QLinear to accept inputs with a dimension greater than 2, incorporating input and output reshaping accordingly.

**Test Plan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qlinear_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113733
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-12-06 01:16:51 +00:00
b8ce05456c enable cat for cuda bits types (#115044)
It was already working for cpu, so bring parity.
Also, slightly reduce number of compiled kernels by using OpaqueType.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115044
Approved by: https://github.com/malfet
2023-12-06 00:05:18 +00:00
b9c4fb68c5 [ONNX][Bench] Fix model name retrieval and remove unused argument (#115108)
Might be some upstream updates, the previous hack starts to not pick up model names, updating to use the other more appropriate variable.
Also fix a bug with an unused argument that was supposed to be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115108
Approved by: https://github.com/thiagocrepaldi
2023-12-05 23:55:12 +00:00
ae457a2c4a [PyTorch] Change test_aot_inductor CPU test failures syntax (#115180)
This portion of D50416438 is extremely subject to merge conflicts. It can also be safely landed without full CI round trip because it changes just one test file that we can simply run to make sure it works.

Differential Revision: [D51856943](https://our.internmc.facebook.com/intern/diff/D51856943/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115180
Approved by: https://github.com/mikekgfb, https://github.com/desertfire
2023-12-05 23:55:08 +00:00
01ec71e466 [NFC][Autotune] Use device_prop.regsPerMultiprocessor instead of hardcoded reg number. (#115094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115094
Approved by: https://github.com/jansel
2023-12-05 23:49:46 +00:00
1102d37958 remove aot_config.keep_inference_input_mutations from assert_functional_graph (#115195)
We technically allow backends to aot_autograd to pass a config saying "yes I am ok with seeing input mutations in my graph".

With https://github.com/pytorch/pytorch/pull/112906 though, there can be input mutations that show up in the backward (that we need to handle for correctness), that are a large pain to keep out of the graph. The meta-point is that it's been ~a year since we added the config, and it almost always makes sense for backends to support input mutations for performance reasons (inductor does). So I just allow these input mutations in the graph in this rare backward situation, even if the backend didn't explicitly use the config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115195
Approved by: https://github.com/drisspg
2023-12-05 23:36:37 +00:00
7aac689b19 [inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)
This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation.

Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel.

Fixes https://github.com/pytorch/pytorch/issues/93631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano, https://github.com/atalman
2023-12-05 23:31:49 +00:00
d78fe039eb Introduce OptimizerInfos + add a test_errors (#114178)
Introduce OptimizerInfos + use them to refactor out the error testing.

Why OptimizerInfos?
- cleaner, easier way to test all configs of optimizers
- would plug in well with devicetype to auto-enable tests for devices like MPS, meta
- would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more.

What did I do for error testing?
- I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test.
- The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that.
- TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device).

Is there any change in test coverage?
- INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS
- DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments)

Possible but not urgent next step: test ALL possible error cases by going through all the constructors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178
Approved by: https://github.com/albanD
2023-12-05 22:58:36 +00:00
99257002fa Extend auto_functionalized to support ops that return Tensors (#115135)
We can auto-functionalize operators that mutate their inputs as long as
the outputs of the operator do not alias their inputs. The user needs
to provide an abstract impl for the operator if it has non-trivial
returns.
- We update can_auto_functionalize(op) to include ops that return (but
  do not alias) Tensors
- We update auto_functionalized(op, mutated_args_names, kwargs) to
  return (out, mutated_args), where `out = op(**kwargs)` and
  `mutated_args` are the new values of the inputs that would have been
  mutated.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115135
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114955, #114956, #115134
2023-12-05 22:43:06 +00:00
d0aad93249 Refactor can_auto_functionalize (#115134)
In preparation for the next PR up in the stack, which is going to update
"can_auto_functionalize" to support more operators than just ones that
return nothing. We are unable to auto-generate FakeTensor kernels for
operators that do not return nothing, but we are able to generate
functionalization kernels for operators that return something.

Test Plan:
Existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115134
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114955, #114956
2023-12-05 22:43:06 +00:00
4620170008 [Dynamo] Revert multiple PRs since they triggered compilation stuck internally (#115126)
Revert the following PRs to mitigate internal compilation stuck:
#113432
#114016
#114507
#114196
#114739
#114669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115126
Approved by: https://github.com/xush6528
2023-12-05 22:35:37 +00:00
80527c0cf2 [AOTInductor] Double buffering for Weights (#114446)
Summary:
This adds function to model container doing weight swapping with double buffering.

There are 2 parts for double buffering
a) Write constants into inactive buffer
b) Swap active buffer

For (a), we write the constants into the buffer that's currently not in use, and store the information in both constants map and the corresponding constant array to read.
For (b), we obtain the lock, and activate the constant map/constant array that is inactive, and flag the one that's currently in use to inactive.

Test Plan:
test/cpp/aot_inductor/test.cpp

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D51543732](https://our.internmc.facebook.com/intern/diff/D51543732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114446
Approved by: https://github.com/chenyang78, https://github.com/eellison
2023-12-05 22:31:56 +00:00
12085914b8 Replace bsr_dense_mm triton kernel with bsr_dense_addm triton kernel (#115030)
The `bsr_dense_addmm` triton kernel introduced in https://github.com/pytorch/pytorch/pull/114595 is a generalization of `bsr_dense_mm` triton kernel and a more efficient version of it because it uses an extra kernel parameter `SPLIT_N` that has notable effect to performance for r.h.s operand with a larger number of columns.

This PR eliminates the `bsr_dense_mm` triton kernel in favor of using `bsr_dense_addmm` triton kernel.

The performance increase of `bsr_dense_mm` is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is 50/71 %
- with 32x32 blocks, the average/maximal speed up is 30/63 %
- with 64x64 blocks, the average/maximal speed up is 12/26 %
- with 128x128 blocks, the average/maximal speed up is 7/17 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115030
Approved by: https://github.com/cpuhrsch
2023-12-05 22:29:24 +00:00
f35f52e4a6 Update auto_request_review.yml (#115182)
remove myself to avoid notification noise

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115182
Approved by: https://github.com/huydhn, https://github.com/albanD
2023-12-05 21:36:18 +00:00
f09e8381b7 [Inductor][fx pass] Fix a bug in batch linear fusion in the post grad (#115061) (#115131)
Summary:

Titled

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/ab4b918c-9ffa-4d00-a747-880521a27851
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16607023638890043
Network: Up: 11MiB  Down: 117MiB  (reSessionID-079402d0-8fd7-4797-9ed5-dd0f778dce1a)
Jobs completed: 189430. Time elapsed: 2:02.5s.
Cache hits: 99%. Commands: 77000 (cached: 76995, remote: 5, local: 0)
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0

Reviewed By: mengluy0125

Differential Revision: D51796899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115131
Approved by: https://github.com/mengluy0125
2023-12-05 21:20:17 +00:00
ab120e65fb Fix FSDP + TP state dict in param unflattening (#115105)
Summary:
This diff fix the param unflattening when using FSDP together with TP. Currently we hardcode the `reshape_size` to be multiplied by 2, which instead should be the size of the process group.

Before the fix, example exception: `shape '[257, 514]' is invalid for input of size 264196`, where the process group size is 4 instead of 2.

Test Plan:
**CI**:
CI test

**Unit test**:
`buck2 test mode/dev-nosan //caffe2/test/distributed/tensor/parallel:fsdp_2d_parallel`
- Passed

**Test model with WHEN**:
- Verified that checkpoint can be saved and resumed successfully;
- Verified the accuracy with window_ne, which is on-par with baseline.
https://pxl.cl/3Wp8w

Differential Revision: D51826120

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115105
Approved by: https://github.com/fegin
2023-12-05 21:19:56 +00:00
22704426c3 Expand dynamic dims support for traceable subclasses (#114311)
Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo).

Summary:
* Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors
    * Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance
    * Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors *of the same dim as outer* when `mark_dynamic(outer, ...)` is called
    * Addresses this: 6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)
* Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols)
    * Signatures now:
    ```python
    # attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr)
    # ctx is anything useful for rebuilding the class we want to guard on
    attrs, ctx = x.__tensor_flatten__()
    ...
    # inner_tensors is a dict of {attr -> tensor}
    # ctx is taken unmodified from flattening and (eventually) guarded on
    # outer_size is the expected size of the output; possibly symbolic
    # outer_stride is the expected strides of the output; possibly symbolic
    y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride)

    # at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride
    # the assert simplifies symbols when there are relationships between outer and inner symbols
    ```
    * Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least
    * Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now
* ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work)
* ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~
    * Now handled in #114469
* Next PR: add TENSOR_MATCH guards on inner tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311
Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh
2023-12-05 21:09:25 +00:00
259a99669d [NCCL flight recorder] Dump when writing to pipe (#115139)
If TORCH_NCCL_DUMP_ON_TIMEOUT is set, then along with producing a dump
file when a timeout happens, you can trigger a dump by writing to local pipe
`<TORCH_NCCL_DEBUG_INFO_TEMP_FILE>_<rank>.pipe` (by default
/tmp/nccl_trace_{rank}_<rank>.pipe).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115139
Approved by: https://github.com/wconstab
2023-12-05 20:44:23 +00:00
5fdae89c03 [docs][aoti] Link to export docs in AOTI docs (#115088)
Context: https://fb.workplace.com/groups/1075192433118967/posts/1341833143121560/?comment_id=1341841786454029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115088
Approved by: https://github.com/desertfire
2023-12-05 20:22:42 +00:00
a8bd593252 [c10d] Add _reset_nccl_collective_timeout so users can change timeout of a NCCL PG (#115141)
There are some use cases when users want to change the timeout for a NCCL process group in the middle of training. This PR enables it by adding a pybind api.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115141
Approved by: https://github.com/wconstab
2023-12-05 19:55:28 +00:00
85d4708512 HTA docs (#115060)
Added documentation for Holistic Trace Analysis

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115060
Approved by: https://github.com/aaronenyeshi
2023-12-05 19:38:09 +00:00
063423edf5 Revert "enable cat for cuda bits types (#115044)"
This reverts commit 4cf97c40f7145b1bd1ab76b2240327d7000c27d2.

Reverted https://github.com/pytorch/pytorch/pull/115044 on behalf of https://github.com/malfet due to This breaks ROCM ([comment](https://github.com/pytorch/pytorch/pull/115044#issuecomment-1841494814))
2023-12-05 19:37:25 +00:00
01afa54df5 [dynamo][FSDP] unit test: FSDP should not be lifted as fx graph attrs (#115112)
this was a SEV when FSDP modules are registered as graph attributes this unit test prevents it from happening again

without SEV fix: D48810186
```
python test/distributed/test_dynamo_distributed.py -k
test_fsdp_skip_register_attr_or_module

  File "/data/users/weif/pytorch/torch/_dynamo/repro/after_dynamo.py",
line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File
"/data/users/weif/pytorch/test/distributed/test_dynamo_distributed.py", line 897, in debug_compiler
    self.assertFalse(name in node.name, f"FSDP module {name} should not
be registered as attributes")
torch._dynamo.exc.BackendCompilerFailed: backend='debug_compiler' raised:
AssertionError: True is not false : FSDP module l__self___net_0_weight should not be registered as attributes
```

with SEV fix: D48810186
```
python test/distributed/test_dynamo_distributed.py -k test_fsdp_skip_register_attr_or_module

Ran 1 test in 6.438s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115112
Approved by: https://github.com/mlazos
2023-12-05 19:16:03 +00:00
4b8ddbbc7e [dynamo] Improve graph break message for copy.deepcopy (#115120)
I was curious what hf_T5_generate was trying to deepcopy, so I updated the errror message:
Before:
```
STATS graph_break
  ("'skip function deepcopy in file /home/jansel/conda/envs/pytorch/lib/python3.10/copy.py'', skipped according skipfiles.SKIP_DIRS'", 3)
  ...
```
After:
```
STATS graph_break
  ('copy.deepcopy UserDefinedObjectVariable(GenerationConfig)', 3)
  ...
```

Related issue: #115122

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115120
Approved by: https://github.com/oulgen
ghstack dependencies: #115095, #115046, #115057, #115119
2023-12-05 19:01:31 +00:00
522bae20df [dynamo] Support any() on SymNodeVariable (#115119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115119
Approved by: https://github.com/yanboliang
ghstack dependencies: #115095, #115046, #115057
2023-12-05 19:01:31 +00:00
88642d44d9 [dynamo] Add RestrictedListSubclassVariable (#115057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115057
Approved by: https://github.com/yanboliang
ghstack dependencies: #115095, #115046
2023-12-05 19:01:23 +00:00
a97ed2470a [dynamo] Support hasattr on dataclass (#115046)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115046
Approved by: https://github.com/yanboliang
ghstack dependencies: #115095
2023-12-05 19:01:14 +00:00
aa70e31610 [dynamo] Fix MutableSideEffects returning alias (#115095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115095
Approved by: https://github.com/yanboliang
2023-12-05 19:01:03 +00:00
5f89cedf9b Add note to set_default_device about functions with shared memory (#114825)
Fixes #114691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114825
Approved by: https://github.com/mikaylagawarecki
2023-12-05 18:52:54 +00:00
a987ad3d89 [BE]: Update ruff to v0.1.7 (#115169)
Update ruff to v0.1.7 with the latest and greatest fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115169
Approved by: https://github.com/albanD
2023-12-05 18:50:11 +00:00
4c5fe66880 [DTensor][BE] fix bug in OpStrategy for Tuple output (#115161)
**Summary**:
DTensor sharding propagation returns an `OpStrategy` object in case of a
Tuple of multiple DTensors of the same `placements` and this object will later
be expanded to a tuple of `DTensorSpec`s. However, the expansion was done
as copying the object's reference instead of copying/creating new objects and
this leads to wrong overriding issue in Tensor Meta propagation logic.

**Test**:
pytest test/distributed/_tensor/test_math_ops.py
pytest test/distributed/_tensor/test_dtensor_ops.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115161
Approved by: https://github.com/wanchaol
2023-12-05 18:28:40 +00:00
c9853ccadc Relax tensor contiguity requirement for P2P ops (#114982)
I hit the following error when performing pipeline parallel for T5:
```
    return default_pg.send([tensor], dst, tag)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Tensors must be contiguous
```

In theory, we shouldn't require the tensors to be contiguous, especially for P2P ops, because we are just doing bit-wise "copy".

Thus, this PR relaxes the requirement and instead calls out that it would be user responsibility to guarantee the source and destination tensors have the same contiguity setting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114982
Approved by: https://github.com/H-Huang
2023-12-05 18:25:42 +00:00
daf89b4101 Update oneDNN submodule to v3.3.2 (#112700)
Update oneDNN submodule to v3.3.2.
Add a macro to check the version of `third_party/ideep`.
Since we have versioning now, the changes won't break any pipeline even if `third_party/ideep` is not updated at the same time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112700
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman
2023-12-05 17:51:55 +00:00
4cf97c40f7 enable cat for cuda bits types (#115044)
It was already working for cpu, so bring parity.
Also, slightly reduce number of compiled kernels by using OpaqueType.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115044
Approved by: https://github.com/malfet
2023-12-05 17:14:42 +00:00
a827ac71f2 Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)"
This reverts commit eaa64339d640ed1d36520ada379213f8361be5ff.
2023-12-05 08:59:36 -08:00
0a9819e3e1 Prefer is_number over is_constant() (#114513)
`is_constant` tries really hard to check whether an expression is
constant. `is_number` is often enough. Note that `sympy.nan.is_number`
is true. Same for infinities

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114513
Approved by: https://github.com/peterbell10
2023-12-05 16:56:15 +00:00
5de0dff7ea Disable bugprone-unchecked-optional-access as it can cause clang-tidy to hang (#115124)
Let's see if it helps https://github.com/pytorch/pytorch/issues/114913

The issues on llvm are at https://github.com/llvm/llvm-project/issues/55530 and https://github.com/llvm/llvm-project/issues/69369.  In my CI test, I saw the following process hanged:

```
/pytorch/pytorch/.lintbin/clang-tidy -p=/pytorch/pytorch/build --extra-arg -I/usr/lib/llvm-11/include/openmp --extra-arg -I/opt/conda/envs/py_3.9/include/python3.9 --extra-arg -I/pytorch/pytorch/third_party/pybind11/include --extra-arg -I/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11 --extra-arg -I/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11 --extra-arg -I/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/backward --extra-arg -I/usr/lib/llvm-14/lib/clang/14.0.0/include --extra-arg -I/usr/local/include --extra-arg -I/usr/include/x86_64-linux-gnu --extra-arg -I/usr/include /pytorch/pytorch/torch/csrc/autograd/python_nested_functions_manual.cpp
```

and the core dump matches the description found in https://github.com/llvm/llvm-project/issues/69369 showing the stuck in `clang::tidy::bugprone::UncheckedOptionalAccessCheck::check`:

```
#0  0x00000000030c7420 in clang::dataflow::WatchedLiteralsSolverImpl::updateWatchedLiterals() ()
#1  0x00000000030c6c2a in clang::dataflow::WatchedLiteralsSolverImpl::solve() && ()
#2  0x00000000030c6572 in clang::dataflow::WatchedLiteralsSolver::solve(llvm::DenseSet<clang::dataflow::BoolValue*, llvm::DenseMapInfo<clang::dataflow::BoolValue*, void> >) ()
#3  0x00000000030b3bd3 in clang::dataflow::DataflowAnalysisContext::querySolver(llvm::DenseSet<clang::dataflow::BoolValue*, llvm::DenseMapInfo<clang::dataflow::BoolValue*, void> >) ()
#4  0x00000000030b3ca5 in clang::dataflow::DataflowAnalysisContext::flowConditionImplies(clang::dataflow::AtomicBoolValue&, clang::dataflow::BoolValue&) ()
#5  0x00000000030b1213 in clang::dataflow::(anonymous namespace)::diagnoseUnwrapCall(clang::Expr const*, clang::Expr const*, clang::dataflow::Environment const&) ()
#6  0x00000000030b1357 in std::_Function_handler<std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> > (clang::CallExpr const*, clang::ast_matchers::MatchFinder::MatchResult const&, clang::dataflow::Environment const&), clang::dataflow::(anonymous namespace)::buildDiagnoseMatchSwitch(clang::dataflow::UncheckedOptionalAccessModelOptions const&)::$_7>::_M_invoke(std::_Any_data const&, clang::CallExpr const*&&, clang::ast_matchers::MatchFinder::MatchResult const&, clang::dataflow::Environment const&) ()
#7  0x00000000030b1292 in std::_Function_handler<std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> > (clang::Stmt const*, clang::ast_matchers::MatchFinder::MatchResult const&, clang::dataflow::Environment const&), clang::dataflow::MatchSwitchBuilder<clang::dataflow::Environment const, std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> > >::CaseOf<clang::CallExpr>(clang::ast_matchers::internal::Matcher<clang::Stmt>, std::function<std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> > (clang::CallExpr const*, clang::ast_matchers::MatchFinder::MatchResult const&, clang::dataflow::Environment const&)>) &&::{lambda(clang::Stmt const*, clang::ast_matchers::MatchFinder::MatchResult const&, clang::dataflow::Environment const&)#1}>::_M_invoke(std::_Any_data const&, clang::Stmt const*&&, clang::ast_matchers::MatchFinder::MatchResult const&, clang::dataflow::Environment const&) ()
#8  0x00000000030b1995 in clang::dataflow::MatchSwitchBuilder<clang::dataflow::Environment const, std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> > >::Build() &&::{lambda(clang::Stmt const&, clang::ASTContext&, clang::dataflow::Environment const&)#1}::operator()(clang::Stmt const&, clang::ASTContext&, clang::dataflow::Environment const&) const ()
#9  0x00000000030b170c in std::_Function_handler<std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> > (clang::Stmt const&, clang::ASTContext&, clang::dataflow::Environment const&), clang::dataflow::MatchSwitchBuilder<clang::dataflow::Environment const, std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> > >::Build() &&::{lambda(clang::Stmt const&, clang::ASTContext&, clang::dataflow::Environment const&)#1}>::_M_invoke(std::_Any_data const&, clang::Stmt const&, clang::ASTContext&, clang::dataflow::Environment const&) ()
#10 0x00000000030a7c27 in clang::dataflow::UncheckedOptionalAccessDiagnoser::diagnose(clang::ASTContext&, clang::Stmt const*, clang::dataflow::Environment const&) ()
#11 0x0000000002931286 in std::_Function_handler<void (clang::Stmt const*, clang::dataflow::DataflowAnalysisState<clang::dataflow::NoopLattice> const&), clang::tidy::bugprone::analyzeFunction(clang::FunctionDecl const&, clang::ASTContext&)::$_0>::_M_invoke(std::_Any_data const&, clang::Stmt const*&&, clang::dataflow::DataflowAnalysisState<clang::dataflow::NoopLattice> const&) ()
#12 0x0000000002930b41 in clang::dataflow::runDataflowAnalysis<clang::dataflow::UncheckedOptionalAccessModel>(clang::dataflow::ControlFlowContext const&, clang::dataflow::UncheckedOptionalAccessModel&, clang::dataflow::Environment const&, std::function<void (clang::Stmt const*, clang::dataflow::DataflowAnalysisState<clang::dataflow::UncheckedOptionalAccessModel::Lattice> const&)>)::{lambda(clang::Stmt const*, clang::dataflow::TypeErasedDataflowAnalysisState const&)#1}::operator()(clang::Stmt const*, clang::dataflow::TypeErasedDataflowAnalysisState const&) const ()
#13 0x00000000030c18cc in std::_Function_handler<void (clang::CFGStmt const&, clang::dataflow::TypeErasedDataflowAnalysisState const&), clang::dataflow::runTypeErasedDataflowAnalysis(clang::dataflow::ControlFlowContext const&, clang::dataflow::TypeErasedDataflowAnalysis&, clang::dataflow::Environment const&, std::function<void (clang::Stmt const*, clang::dataflow::TypeErasedDataflowAnalysisState const&)>)::$_1>::_M_invoke(std::_Any_data const&, clang::CFGStmt const&, clang::dataflow::TypeErasedDataflowAnalysisState const&) ()
#14 0x00000000030bf069 in clang::dataflow::transferBlock(clang::dataflow::ControlFlowContext const&, std::vector<llvm::Optional<clang::dataflow::TypeErasedDataflowAnalysisState>, std::allocator<llvm::Optional<clang::dataflow::TypeErasedDataflowAnalysisState> > >&, clang::CFGBlock const&, clang::dataflow::Environment const&, clang::dataflow::TypeErasedDataflowAnalysis&, std::function<void (clang::CFGStmt const&, clang::dataflow::TypeErasedDataflowAnalysisState const&)>) ()
#15 0x00000000030bfaa5 in clang::dataflow::runTypeErasedDataflowAnalysis(clang::dataflow::ControlFlowContext const&, clang::dataflow::TypeErasedDataflowAnalysis&, clang::dataflow::Environment const&, std::function<void (clang::Stmt const*, clang::dataflow::TypeErasedDataflowAnalysisState const&)>) ()
#16 0x00000000029301b3 in llvm::Expected<std::vector<llvm::Optional<clang::dataflow::DataflowAnalysisState<clang::dataflow::UncheckedOptionalAccessModel::Lattice> >, std::allocator<llvm::Optional<clang::dataflow::DataflowAnalysisState<clang::dataflow::UncheckedOptionalAccessModel::Lattice> > > > > clang::dataflow::runDataflowAnalysis<clang::dataflow::UncheckedOptionalAccessModel>(clang::dataflow::ControlFlowContext const&, clang::dataflow::UncheckedOptionalAccessModel&, clang::dataflow::Environment const&, std::function<void (clang::Stmt const*, clang::dataflow::DataflowAnalysisState<clang::dataflow::UncheckedOptionalAccessModel::Lattice> const&)>) ()
#17 0x000000000292fbe8 in clang::tidy::bugprone::UncheckedOptionalAccessCheck::check(clang::ast_matchers::MatchFinder::MatchResult const&) ()
#18 0x00000000022e1572 in clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor::MatchVisitor::visitMatch(clang::ast_matchers::BoundNodes const&) ()
#19 0x0000000002797a1c in clang::ast_matchers::internal::BoundNodesTreeBuilder::visitMatches(clang::ast_matchers::internal::BoundNodesTreeBuilder::Visitor*) ()
#20 0x00000000022e0dc6 in clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor::matchWithFilter(clang::DynTypedNode const&) ()
#21 0x00000000022e3b57 in clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor::TraverseDecl(clang::Decl*) ()
#22 0x00000000022e4c0c in clang::RecursiveASTVisitor<clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor>::TraverseDecl(clang::Decl*) ()
#23 0x00000000022e3b62 in clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor::TraverseDecl(clang::Decl*) ()
#24 0x00000000022e4c0c in clang::RecursiveASTVisitor<clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor>::TraverseDecl(clang::Decl*) ()
#25 0x00000000022e3b62 in clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor::TraverseDecl(clang::Decl*) ()
#26 0x00000000022e4c0c in clang::RecursiveASTVisitor<clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor>::TraverseDecl(clang::Decl*) ()
#27 0x00000000022e3b62 in clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor::TraverseDecl(clang::Decl*) ()
#28 0x00000000022e4c0c in clang::RecursiveASTVisitor<clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor>::TraverseDecl(clang::Decl*) ()
#29 0x00000000022e3b62 in clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor::TraverseDecl(clang::Decl*) ()
#30 0x00000000022e8791 in clang::RecursiveASTVisitor<clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor>::TraverseDecl(clang::Decl*) ()
#31 0x00000000022e3b62 in clang::ast_matchers::internal::(anonymous namespace)::MatchASTVisitor::TraverseDecl(clang::Decl*) ()
#32 0x00000000022c017a in clang::ast_matchers::MatchFinder::matchAST(clang::ASTContext&) ()
#33 0x000000000370ad3c in clang::MultiplexConsumer::HandleTranslationUnit(clang::ASTContext&) ()
#34 0x00000000038ed4bb in clang::ParseAST(clang::Sema&, bool, bool) ()
#35 0x000000000369eda7 in clang::FrontendAction::Execute() ()
#36 0x000000000360d3f6 in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) ()
#37 0x00000000027c475c in clang::tooling::FrontendActionFactory::runInvocation(std::shared_ptr<clang::CompilerInvocation>, clang::FileManager*, std::shared_ptr<clang::PCHContainerOperations>, clang::DiagnosticConsumer*) ()
#38 0x00000000022ad486 in clang::tidy::runClangTidy(clang::tidy::ClangTidyContext&, clang::tooling::CompilationDatabase const&, llvm::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, llvm::IntrusiveRefCntPtr<llvm::vfs::OverlayFileSystem>, bool, bool, llvm::StringRef)::ActionFactory::runInvocation(std::shared_ptr<clang::CompilerInvocation>, clang::FileManager*, std::shared_ptr<clang::PCHContainerOperations>, clang::DiagnosticConsumer*) ()
#39 0x00000000027c44c6 in clang::tooling::ToolInvocation::runInvocation(char const*, clang::driver::Compilation*, std::shared_ptr<clang::CompilerInvocation>, std::shared_ptr<clang::PCHContainerOperations>) ()
#40 0x00000000027c360b in clang::tooling::ToolInvocation::run() ()
#41 0x00000000027c5bb1 in clang::tooling::ClangTool::run(clang::tooling::ToolAction*) ()
#42 0x00000000022a90c7 in clang::tidy::runClangTidy(clang::tidy::ClangTidyContext&, clang::tooling::CompilationDatabase const&, llvm::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, llvm::IntrusiveRefCntPtr<llvm::vfs::OverlayFileSystem>, bool, bool, llvm::StringRef) ()
#43 0x0000000001ebc7f2 in clang::tidy::clangTidyMain(int, char const**) ()
#44 0x0000000004c54ba0 in __libc_start_main ()
#45 0x0000000001eb76ae in _start ()
```

Another note is that clang-tidy is CPU-bound.  So we could consider running lintrunner job on 4xlarge if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115124
Approved by: https://github.com/kit1980, https://github.com/Skylion007, https://github.com/malfet
2023-12-05 16:27:56 +00:00
ee96399bb4 Revert "[Reland2] Update NVTX to NVTX3 (#109843)"
This reverts commit dcb486232d3eb61024ad9e76cca367c60019c84c.

Reverted https://github.com/pytorch/pytorch/pull/109843 on behalf of https://github.com/atalman due to Diff broke internal builds and tests ([comment](https://github.com/pytorch/pytorch/pull/109843#issuecomment-1841105398))
2023-12-05 16:10:20 +00:00
e06bff8bbe [AOTI] Handle empty input args (#114682)
Summary: When the model takes no inputs, AOTInductor relies on checking weights to figure out which device to compile the model into. Currently recording buffer device type happens too late, and this PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114682
Approved by: https://github.com/chenyang78
2023-12-05 15:02:17 +00:00
3d8c174069 Tie some torch.library def/impls to library objects in testing (#114956)
This should deflake some of the tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114956
Approved by: https://github.com/williamwen42
ghstack dependencies: #114955
2023-12-05 14:53:32 +00:00
cfa4370c07 torch.compile should auto-functionalize certain mutable ops (#114955)
Users may wish to torch.compile custom ops that mutate their inputs
and return nothing (this is a common class of operators).
torch.compile will automatically support this op without anyone needing
to provide a functionalization kernel for it. Here's how.

Let's say we have a hypothetical mylib::sin_(Tensor(a!) x) -> ()
op. First, when FakeTensor sees this op, it can just return None.
This is the case because custom ops are not allowed to mutate input
metadata, so the FakeTensor rule for one that returns nothing is trivial.

Next, when Python FunctionalTensor sees the op, it will functionalize
it by emitting a call to an auto_functionalize(op, ["x"], {"x": ...})
HOP and replacing the mutated inputs with the outputs of this HOP.
This HOP effectively runs the functional version of the op when
called: it clones inputs that will be mutated, runs the op, and
then returns Tensors with the new values.

In the future we can teach Inductor how to do re-inplacing when it sees
this HOP (like how triton kernels do it) but this isn't urgent (and is
more of a performance problem).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114955
Approved by: https://github.com/bdhirsh
2023-12-05 14:53:08 +00:00
94faba5224 [nccl-pg] Revert accidental renaming of env variables (#115082)
Summary:

In [9cc040fef64154a2424b2ccd2c0909641e245cf0], we accidentally changed some of the environment variable names to the non-deprecated form.  The intent was to support both the deprecated and the new form of the env variables (with a warning thrown for the deprecated form).

Test Plan:

OSS CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115082
Approved by: https://github.com/zdevito
2023-12-05 14:52:30 +00:00
0ee1e469cb Revert "Modify pointwise cat heuristic to only apply when inputs are all pointwise and outputs are all pointwise (#114520)"
This reverts commit 3d47b92dfbe19362fb6e98f142b2c79b9db7645c.

Reverted https://github.com/pytorch/pytorch/pull/114520 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114520#issuecomment-1840890210))
2023-12-05 14:24:30 +00:00
1224acc018 [3/N] Fixes clang-tidy warnings in header files (#114431)
This PR series tries to enable clang-tidy for headers in torch/csrc and c10/util.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114431
Approved by: https://github.com/Skylion007
2023-12-05 12:58:27 +00:00
89569be2bd Pin z3-solver on Windows to 4.12.2.0 (#115150)
Windows trunk jobs start to fail with the new versions 4.12.3.0 published today https://pypi.org/project/z3-solver/ (Dec 4th 2023)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115150
Approved by: https://github.com/kit1980
2023-12-05 10:48:57 +00:00
58809e8914 [Inductor][Optimus]Move group/batch fusion logic out of inductor (#115128)
Summary:
As discussed D51695982, fusion may not be always good. We want to let the user customize the fx passes.

Some example for new configs:
* Use batch_fusion config: this will automatically use the following batch fusions, including batch linear, layernorm, relu, tanh, sigmoid and post grad batch linear fusion
* use config:
```
"pre_grad_fusion_options": {
            "batch_linear": {"min_fuse_set_size": 10},
            "batch_linear_lhs": {},
            "batch_layernorm": {"max_fuse_search_depth": 100},
            "batch_tanh": {},
            "batch_relu": {},
            "batch_sigmoid": {}
          },
```

Test Plan:
with flag: f509168388

with config: f509168595

Reviewed By: frank-wei, mengluy0125

Differential Revision: D51817314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115128
Approved by: https://github.com/mengluy0125
2023-12-05 08:19:17 +00:00
d5af6b0301 Dont pad broadcasting bias dimension in pad mm (#115098)
Fix for https://github.com/pytorch/pytorch/issues/99649. As title - we shouldn't pad a broadcasting dimension.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115098
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-12-05 08:02:51 +00:00
1dc4588c6a Add an SDPA dispatcher for nested tensors with jagged layouts (#114164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114164
Approved by: https://github.com/jbschlosser
2023-12-05 06:33:45 +00:00
Chi
fb92983c9b Added More Information About Adadelta Optimizer (#106290)
I have added more information about Adadelta Optimizer to developers understand faster ways to what is doing that.
It's my changes code looks like this:
![Screenshot from 2023-07-31 10-01-54](https://github.com/pytorch/pytorch/assets/93595990/72d7cd00-8acb-4ab0-820b-7ece4943c7c1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106290
Approved by: https://github.com/janeyx99
2023-12-05 05:55:16 +00:00
eaa64339d6 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.

Test Plan: CI.

Differential Revision: D51825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-05 05:44:52 +00:00
e199b769b6 Unbreak vectorization (#115086)
Summary: Unbreak vectorization

Test Plan: sandcastle

Differential Revision: D51818065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115086
Approved by: https://github.com/malfet, https://github.com/seemethere
2023-12-05 04:15:54 +00:00
7843df60e4 [executorch hash update] update the pinned executorch hash (#115116)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115116
Approved by: https://github.com/pytorchbot
2023-12-05 04:09:48 +00:00
1d0e70ad65 Add get_mutation_names to ir.Wait (#115104)
`ir.Wait` generates the last 2 lines of this code:
```python
buf1_work = dist.all_gather_into_tensor(buf1[0], buf1_inputs[0], async_op=True, group=buf1_pg)
fun_col_impl._register_tensor_work(buf1, buf1_work)
buf2 = buf1[0]
del buf1

buf2 = _wait_tensor(buf2)  #  <- generated by ir.Wait
buf3 = buf2;  # reuse  <- generated by ir.Wait
```
`_wait_tensor` technically is a "mutation" op that changes `buf2` in place. So we should mark `ir.Wait` as a mutation op (by overriding its `get_mutation_names()`).

This fixes a very peculiar issue when inductor comm reordering is used for llama model: downstream nodes that uses the all-gather comm output sometimes takes dependency on `buf2` (the node before `ir.Wait`) instead of on `buf3` (`ir.Wait`) (it's still unclear why it behaves like this). To work around the issue, we add the missing annotation that `buf3` is a mutation of `buf2`, so that the scheduler knows to schedule `buf3` before any of the `buf2` users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115104
Approved by: https://github.com/wanchaol
2023-12-05 03:54:33 +00:00
3cf5348239 [inductor] Replace rand[n].generator with inductor prim if generator=None (#115051)
This fixes the "should have been handled in replace_random.py" error
raised during lowering.

I also fixed `test_randn_generator` to catch any regressions.
Previously, it did not use the result of randn(), so dynamo tracing
omitted that node entirely.

Fixes #114203.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115051
Approved by: https://github.com/eellison
2023-12-05 01:53:41 +00:00
3d0bbb24a1 [dynamo] Improve support for list subclasses (#115052)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115052
Approved by: https://github.com/oulgen, https://github.com/eellison
ghstack dependencies: #114830, #115047, #115048
2023-12-05 01:31:33 +00:00
fe690f430a [dynamo] Fix dict.get with no default (#115048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115048
Approved by: https://github.com/eellison, https://github.com/oulgen
ghstack dependencies: #114830, #115047
2023-12-05 01:31:33 +00:00
f6b6fad136 Fix torch.inductor._utils.get_device_tflops on ROCm (#115102)
That caused numerous test regressions after https://github.com/pytorch/pytorch/pull/114772 changed triton APIs a bit to use `nvsmi` function, which is not available on `hip` platform

Fixes https://github.com/pytorch/pytorch/issues/115087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115102
Approved by: https://github.com/desertfire, https://github.com/huydhn
2023-12-05 00:56:31 +00:00
c56d91ba39 Log pt2_compliant custom ops used with torch.compile (#115083)
Summary:
We already log non-pt2_compliant ops. This PR extends the logging to
include pt2_compliant custom ops. We do not log all pt2_compliant ops
(i.e. including builtin ops) because it would probably take too much
memory

Test Plan:
Tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115083
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
2023-12-05 00:51:33 +00:00
288b1acaa9 [dtensor] fix empty shape init for dtensor constructors (#115091)
As titled, this PR fixes the empty shape init case, where if we pass in
things like `torch.dtensor.zeros([])`, it should call `torch.zeros([])`
under the hood not `torch.empty(0)`, this makes dtensor constructor and
torch constructor aligns

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115091
Approved by: https://github.com/XilunWu
2023-12-05 00:51:29 +00:00
5cfda9b7f8 Revert "Add an SDPA dispatcher for nested tensors with jagged layouts (#114164)"
This reverts commit aafa8233a4a1f336014cb122d16941e5b593706c.

Reverted https://github.com/pytorch/pytorch/pull/114164 on behalf of https://github.com/malfet due to Broke ROCM, see aafa8233a4 ([comment](https://github.com/pytorch/pytorch/pull/114164#issuecomment-1839798986))
2023-12-05 00:35:20 +00:00
aa6920c542 Fix hang in VonMises rejection sampling for small values of concentration (#114498)
Fixes #88443

Forces the internal `dtype` of `torch.distributions.von_mises.VonMises` to be `torch.double` and mirrors the numpy implementation of the second order Taylor expansion for `concentration < 1e-5`. Samples and log probs are returned with `dtype` of argument `loc`.

In principle one could also use masking in the rejection sampler to return uniformly distributed numbers for `concentration < 1e-8`, as in numpy. This may be slightly more efficient, but isn't required to solve the hanging issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114498
Approved by: https://github.com/fritzo
2023-12-04 23:07:06 +00:00
1474dad28c [quant][pt2e][xnnpack] Add support for QAT dynamic quantization for linear in XNNPACKQuantizer (#113288)
Summary:
FX graph mode quant workflow and also pt2e flow relies on the `is_dynamic` flag in observer/quantizationspec to
convert an observer to dynamic quantization patterns (choose_qparams -> q -> dq), this PR added is_dynamic flag
for all observers so that it's possible to convert these observers to the pattern.

However, this dynamic quantization pattern (choose_qparams -> q -> dq) is actually only valid for MovingAverageObserver(averaging_constant=1)
for the computation before convert and after convert to match in the context of QAT. So we'll have some sanity
checks in other observers to make sure the is_dynamic is False.

Test Plan:
python test/test_quantization.py TestXNNPACKQuantizer.test_qat_dynamic_linear

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D51124725](https://our.internmc.facebook.com/intern/diff/D51124725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113288
Approved by: https://github.com/kimishpatel
2023-12-04 23:06:38 +00:00
a7bcc78bff Make it clearer that current selective AC is PT2-only and private (#115081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115081
Approved by: https://github.com/albanD
2023-12-04 23:01:22 +00:00
4ba37e1804 Add tests for bsr_dense_addmm and bsr_dense_mm triton kernels (#114800)
As in the title.

In addition,
- resolve https://github.com/pytorch/pytorch/pull/114757#discussion_r1409547917 re triton-contiguous inputs
- support non-contiguous inputs and outputs in triton kernels
- fix a couple of minor bugs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114800
Approved by: https://github.com/cpuhrsch
2023-12-04 22:07:47 +00:00
aafa8233a4 Add an SDPA dispatcher for nested tensors with jagged layouts (#114164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114164
Approved by: https://github.com/jbschlosser
2023-12-04 21:54:02 +00:00
43e3242490 [BE] Remove test corner cases for CUDA older than supported 11.8 (#114989)
Remove deprecated CUDA use cases from tests.
Similar to: https://github.com/pytorch/pytorch/pull/112873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114989
Approved by: https://github.com/malfet
2023-12-04 21:41:03 +00:00
8ef44e6110 [autograd.Function] Fix torch.compile w/ once_differentiable leads to opaque graph break (#113625)
Fixes #106893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113625
Approved by: https://github.com/zou3519
2023-12-04 21:37:06 +00:00
8dbae73e62 Use 2d weight and bias texture for conv2d quantized op (#114902)
Summary:
The performance with 2D texture for weight and bias is better for quantized conv2d, the un-quantized version of conv2d also uses 2D texture.
The performance gain is:

With 3D:
Kernel Name              Workgroup Size         Duration P50 (ns)
===========              ==============         =================
vulkan.quantized_conv2d  {96, 72, 2}                      5965440
vulkan.quantized_conv2d  {96, 72, 2}                     11316968
vulkan.quantized_conv2d_dw{96, 72, 2}                      2735564
vulkan.quantized_conv2d_pw_2x2{96, 72, 2}                      1645696

With 2D:
vulkan.quantized_conv2d  {96, 72, 2}                      4295772
vulkan.quantized_conv2d  {96, 72, 2}                      7874620
vulkan.quantized_conv2d_dw{96, 72, 2}                      2658552
vulkan.quantized_conv2d_pw_2x2{96, 72, 2}                      1632020

Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest
....
[----------] 78 tests from VulkanAPITest (1519 ms total)
[----------] Global test environment tear-down
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[  PASSED  ] 78 tests.

buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
......
----------] 395 tests from VulkanAPITest (6515 ms total)

[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[  PASSED  ] 394 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS

Reviewed By: yipjustin

Differential Revision: D50997534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114902
Approved by: https://github.com/yipjustin
2023-12-04 20:54:40 +00:00
6317a0350e [PyTorch][Vulkan] Refactor performance test binary (#114712)
Summary:
We create two files `vulkan_perf_utils.h` and `vulkan_perf_utils.cpp` which hosts several shared functions among the `perf_test` source files:
- `makeStack`
- `callOpByHandle`
- `callOpByName`
- `extractTotalShaderResultsAndSetState`
- `extractTotalOpResultsAndSetState`

so that they can be used for all perf tests.

Test Plan:
We test `vulkan_conv_arithmetic_perf_test`, `vulkan_layernorm_perf_test` and `vulkan_mm_perf_test` respectively as below.
- build binary, at `fbsource`
```
buck2 build  -c ndk.debug_info_level=0  -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_layernorm_perf_test_binAndroid  --show-output  -c pt.vulkan_full_precision=1
buck2 build  -c ndk.debug_info_level=0  -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_conv_arithmetic_perf_test_binAndroid  --show-output  -c pt.vulkan_full_precision=1
buck2 build  -c ndk.debug_info_level=0  -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_mm_perf_test_binAndroid  --show-output  -c pt.vulkan_full_precision=1
```
- push to device
```
adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_conv_arithmetic_perf_test_binAndroid__/pt_vulkan_conv_arithmetic_perf_test_binAndroid /data/local/tmp
adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_mm_perf_test_binAndroid__/pt_vulkan_mm_perf_test_binAndroid /data/local/tmp
adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_mm_perf_test_binAndroid__/pt_vulkan_mm_perf_test_binAndroid /data/local/tmp
```
- test on device

```
adb shell /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid
adb shell /data/local/tmp/pt_vulkan_layernorm_perf_test_binAndroid
adb shell /data/local/tmp/pt_vulkan_conv_arithmetic_perf_test_binAndroid
```
full results:
vulkan_mm_perf_test: P887658084
vulkan_layernorm_perf_test P887687924
vulkan_conv_arithmetic_perf_test P887689880

Reviewed By: yipjustin, liuk22

Differential Revision: D51451751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114712
Approved by: https://github.com/yipjustin
2023-12-04 19:49:50 +00:00
62df4f3428 Revert "Update oneDNN submodule to v3.3.2 (#112700)"
This reverts commit afbaa0c1650cf15100fb5dc579ceeba24fb8665a.

Reverted https://github.com/pytorch/pytorch/pull/112700 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/112700#issuecomment-1839350284))
2023-12-04 19:41:12 +00:00
a70c85ce90 [dynamo] Improve support for inspect.signature().parameters (#115047)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115047
Approved by: https://github.com/oulgen
ghstack dependencies: #114830
2023-12-04 19:08:36 +00:00
40218436c4 Remove size asserts from fx_insert_profiling (#114830)
These are pretty old, don't work with dynamic shapes, and are failing
with --coverage mode in torchbench.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114830
Approved by: https://github.com/oulgen
2023-12-04 19:08:36 +00:00
8bb3cd192f Revert "Assert that output could only be the last node of the FX graph (#114973)"
This reverts commit a85df9eb0b35ed8c03e7db3c3cee01c2180fa3ed.

Reverted https://github.com/pytorch/pytorch/pull/114973 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114973#issuecomment-1839290400))
2023-12-04 19:07:48 +00:00
dcb486232d [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10
2023-12-04 19:02:07 +00:00
753c07bbe0 All gather keys before processing Stateful objects in save/load [2/N] (#114304)
Accounts for the case where `state_dict` keys may present in different orders. Since users may be calling collectives in `state_dict` and `load_state_dict` call, different ordered keys could cause a deadlock. This is mostly a defensive move, meant to match the feature in TSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114304
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-04 18:31:14 +00:00
f1c8c427da Fix https://github.com/pytorch/pytorch/issues/114892 (#115054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115054
Approved by: https://github.com/bdhirsh
2023-12-04 18:29:33 +00:00
a9e9590934 FF inductor failure (#114980)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114980
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2023-12-04 18:26:34 +00:00
4cb7dd0fc9 [sparse][quant] Add support for vector alpha in cusparselt mm (#112056)
Summary:

This PR adds in support for passing in a alpha Tensor, which represents
a tensor of alpha values to fuse into the matmul.

```
cusparselt_sparse_mm = alpha A @ B + bias
```

This operation is necessary for quantization, where we would like to
fuse one of the dequant matmuls into the sparse op.

Test Plan:

```
python test/test_sparse_semi_structured -k alpha
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112056
Approved by: https://github.com/cpuhrsch
2023-12-04 16:56:06 +00:00
f101426790 Revert "Move class definition of DebugInfoWriter to TraceUtil as well (#114901)"
This reverts commit fb325bbd46f69bea8b2debd3ab5830c9eedadc0d.

Reverted https://github.com/pytorch/pytorch/pull/114901 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114901#issuecomment-1838815178))
2023-12-04 14:55:39 +00:00
453d509b73 [xla hash update] update the pinned xla hash (#114586)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114586
Approved by: https://github.com/pytorchbot
2023-12-04 11:24:17 +00:00
bfa2c844a8 [inductor][cpp] avoid redundant lowp type cast for direct load/store (#115006)
Fix https://github.com/pytorch/pytorch/issues/114879. See https://github.com/pytorch/pytorch/issues/114879#issuecomment-1836977610 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115006
Approved by: https://github.com/jansel
2023-12-04 06:39:27 +00:00
3da67ffad1 [Inductor] Do not promote int to float for torch.mm (#115043)
This PR fixes inductor silently promoting int to float and causing behavior difference

Fixes #98978

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115043
Approved by: https://github.com/jansel
2023-12-04 06:36:55 +00:00
3fbfa8cd0a [dynamo] support dict.copy() / OrderedDict.copy() / defaultdict.copy() (#115012)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115012
Approved by: https://github.com/jansel
ghstack dependencies: #115010, #115011
2023-12-04 01:50:10 +00:00
917a52d2a2 [dynamo] support dict.update(seq2) / OrderedDict.update(seq2) / defaultdict.update(seq2) (#115011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115011
Approved by: https://github.com/jansel
ghstack dependencies: #115010
2023-12-04 01:50:10 +00:00
2e8ac5ea93 [dynamo] support dict.fromkeys() / OrderedDict.fromkeys() / defaultdict.fromkeys() (#115010)
Add support for `dict.fromkeys`, `OrderedDict.fromkeys`, and `defaultdict.fromkeys`.

Fixes #114963

- #114963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115010
Approved by: https://github.com/jansel
2023-12-04 01:49:59 +00:00
541591dd79 Add the appropriate check on div_value to the cpp frontend (#114671)
Fixes #114334

As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114671
Approved by: https://github.com/mikaylagawarecki
2023-12-04 01:28:11 +00:00
50833021dd [Inductor] We re-enable the batch_fusion and group_fusion flags in order not to disturb the current production model implementation (#114841)
Summary:
We did two things:
1. We add back the batch_fusion and group_fusion flags to keep the current production model implementation

2. We tell batch and group fusion in the post grad since group need fbgemm.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/13d152d2-5d4d-4c7a-ab88-51f8e8218942
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900253044737
Network: Up: 376KiB  Down: 44KiB  (reSessionID-c508aedc-8cc2-434a-8c17-bbe075a05562)
Jobs completed: 17. Time elapsed: 1:23.1s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D51695982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114841
Approved by: https://github.com/jackiexu1992
2023-12-03 23:59:10 +00:00
491f3c8037 [CI] Small follow up for triton conda builds
Forgot to modify conda upload rules in https://github.com/pytorch/pytorch/pull/115039

Also remove redundant parentheses
2023-12-03 15:55:00 -08:00
bf16fec463 Fix up triton builds (#115039)
Follow ups after https://github.com/pytorch/pytorch/pull/114772 and https://github.com/pytorch/pytorch/pull/108187

- Triton builds should be published from `main` rather than `nightly` branch, as:
   - They are independent of any PyTorch changes
   - Every nightly is pinned to a specific commit therefore publishing updated triton binaries will not affect previous nightlies
   - If this is not the case, nightly promotion will never happen as binary builds on main will continue to fail in perpetuity searching for new triton binary
- `patch_setup_py` is still needed to modify name of the package for ROCm builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115039
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/huydhn
2023-12-03 23:14:41 +00:00
7979ba7b43 [inductor] Add dropout type check to match eager (#115040)
Fixes #98970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115040
Approved by: https://github.com/oulgen
2023-12-03 23:05:02 +00:00
69a8f9b07e [inductor] Fix shape mismatch in sdpa pattern matcher (#115038)
Fixes #100316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115038
Approved by: https://github.com/oulgen
2023-12-03 22:32:12 +00:00
55064a4ef9 [BE] add parentheses to kwargs unpacking func(*args, **(kwargs or {})) (#115026)
This PR adds parentheses to kwargs unpacking `func(*args, **(kwargs or {}))` for better code readability.

With/without the parentheses are semantic equivalent because they produce the same bytecode.

```console
$ echo "func(*args, **kwargs or {})" | python3 -m dis -
  0           0 RESUME                   0

  1           2 PUSH_NULL
              4 LOAD_NAME                0 (func)
              6 LOAD_NAME                1 (args)
              8 BUILD_MAP                0
             10 LOAD_NAME                2 (kwargs)
             12 JUMP_IF_TRUE_OR_POP      1 (to 16)
             14 BUILD_MAP                0
        >>   16 DICT_MERGE               1
             18 CALL_FUNCTION_EX         1
             20 POP_TOP
             22 LOAD_CONST               0 (None)
             24 RETURN_VALUE

$ echo "func(*args, **(kwargs or {}))" | python3 -m dis -
  0           0 RESUME                   0

  1           2 PUSH_NULL
              4 LOAD_NAME                0 (func)
              6 LOAD_NAME                1 (args)
              8 BUILD_MAP                0
             10 LOAD_NAME                2 (kwargs)
             12 JUMP_IF_TRUE_OR_POP      1 (to 16)
             14 BUILD_MAP                0
        >>   16 DICT_MERGE               1
             18 CALL_FUNCTION_EX         1
             20 POP_TOP
             22 LOAD_CONST               0 (None)
             24 RETURN_VALUE
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115026
Approved by: https://github.com/Skylion007
2023-12-03 20:03:26 +00:00
4d8b9964e1 [aotinductor] support at::convolution for AOTInductor (#114961)
This PR adds support to at::convolution for AOTInductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114961
Approved by: https://github.com/desertfire
2023-12-03 07:52:28 +00:00
7f49603ed3 Fix https://github.com/pytorch/pytorch/issues/114899 (#114985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114985
Approved by: https://github.com/ydwu4
2023-12-03 05:24:02 +00:00
3cdfba0a7c Make DynamicShapes*Tests show up properly in the test failure repro string (#115019)
Set their `__module__` attributes so that Python thinks the test classes
are defined in test_dynamic_shapes and not in torch._dynamo.testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115019
Approved by: https://github.com/Skylion007
ghstack dependencies: #115003
2023-12-03 04:48:43 +00:00
c808a84680 Better logging for "cannot fuse" reasons (#115003)
This was invaluable when I was debugging #114917. Without the node names
in the log message, it was difficult to make sense of them.

However, I did not want to bloat the number of LOC with this change.
Thus, instead of calling `debug()` directly with the node arguments, I
made a new callable class WhyNoFuse to partially apply the node
arguments at the top of each fusion-checking method. WhyNoFuse generates
the logging string only when its `__str__` method gets called, so there
is minimal overhead when logging is disabled.

I also removed the various logging 'tags' like "vert:1" / "triton:1" --
the log messages themselves are unique enough that the user can identify
them without the tag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115003
Approved by: https://github.com/Skylion007
2023-12-03 04:48:43 +00:00
a797821fd6 [executorch hash update] update the pinned executorch hash (#115021)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115021
Approved by: https://github.com/pytorchbot
2023-12-03 04:20:41 +00:00
3f366aa317 [audio hash update] update the pinned audio hash (#114997)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114997
Approved by: https://github.com/pytorchbot
2023-12-03 03:45:49 +00:00
a6294d8b9f [RelEng] Enable Py312 conda builds (#114819)
Once [sympy-1.12](https://anaconda.org/anaconda/sympy/files?version=1.12) has been added it can be build across the board

Majority of the changes are in the builder repo:
* 6b8c73fecb tweaks numpy and openssl deps
* fc773dde97 <- tweak MLK requirements for Windows
* ca378c16f8 do not depend on Triton
* 3c7404d80c <- build without GLOO_SSL

And finally, to workaround chicken-and-egg problem from [smoke_test.bat:97](b92da8cd64/windows/internal/smoke_test.bat (L97))
```cmd
call conda install -yq numpy pytorch %CONDA_EXTRA_ARGS%
```

Manually upload binaries to pytorch-nightly channel (will fix it akin to Nova in followup PRs)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114819
Approved by: https://github.com/huydhn
2023-12-03 01:30:03 +00:00
2391f3717e [BE] Same install command for aarch64 and x86_64 wheels (#115017)
`--extra-index-url` should be no longer necessary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115017
Approved by: https://github.com/kit1980
2023-12-03 00:33:52 +00:00
3cbe7a53a9 Automated submodule update: FBGEMM (#114444)
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 84c7b278be

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114444
Approved by: https://github.com/malfet
2023-12-02 22:15:06 +00:00
d7b303dcf8 [BE]: Enable a PLC0131, PLC0132, PLC0205. Fix PLC0132 bug. (#115015)
Enable pylint rules `PLC0131` and `PLC0132`. There was a violation of the `PLC0132` so this commit also fixes it and enables the rules so the violation do not occur again. `PLC0205` checks accidentally setting your `__slots__` to a string which is almost always a bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115015
Approved by: https://github.com/jansel, https://github.com/malfet
2023-12-02 20:35:10 +00:00
13410d0eda Moving target/code path to non-pytorch repo (#114095)
Differential Revision: D51460806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114095
Approved by: https://github.com/digantdesai
2023-12-02 19:27:09 +00:00
8a90249bc2 [inductor] Update triton pin (#114772)
Differential Revision: [D51761353](https://our.internmc.facebook.com/intern/diff/D51761353)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114772
Approved by: https://github.com/shunting314, https://github.com/atalman
2023-12-02 19:13:56 +00:00
3a2e2044cd Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)"
This reverts commit 729ac7317a50a6a195b324cf6cefd748bf4f5498.

Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))
2023-12-02 17:55:51 +00:00
af5a3bda45 [merge rule] add CPU quantization (#114994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114994
Approved by: https://github.com/jerryzh168, https://github.com/malfet
2023-12-02 08:34:55 +00:00
28925902fa [TP] fully rewrite Tensor Parallel APIs (#114732)
This PR rewrites Tensor Parallel implementation. Tensor Parallel APIs
supposed to be a very thin-wrapper to DTensor APIs, but the current
implementation got too messy and buggy. It's really hard to debug what
went wrong when using it. It's crucially important for advanced users or
developers to understand the API and its implementation easily without
going through all different types of functions and utils, so that
they could trust what happen under the hood.

In particular this PR:

* Make ParallelStyle to be a real contract API for parallelize_module to
  take, each concrete ParallelStyle only needs to implement `apply` to
apply the sharding to nn.Module, remove all non-necessary fields. This
also enable easier ParallelStyle authoring going forward.
* Keep the ColwiseParallel and RowwiseParallel public interface, but
  refactor them in a way that makes the parameter sharding, inputs and
outputs handling lives within the style itself, so that it's easy to
understand how Linear/Embedding layers are sharded and how the inputs/outputs
transformations are performed.
* remove all those private _prepare_input/_prepare_output_fn fields for
  both ColwiseParallel/RowwiseParallel. Since we throw deprecation
messages in nightly for a while and TP is on prototype release, the
fields are also private, it should be safe to remove them
* Refactor the recently landed PrepareModuleInput/Output style, change
  output_layouts to desired_input/output_layouts, group
  the function inside the style itself, no default arguments for these
two styles and user need to specify them to think about the sharding
layouts. Fixed bugs about not handling
`use_local_output` flag.
* Make default arguments be None instead of Placement object, this is
  standard python practice to not have custom object instance as default
argument
* Remove all dead APIs (i.e. PairwiseParallel and SequenceParallel
  style, all prepare input/output functions) as we throw deprecation
 msgs for a while, and in the progress of removing all of them from the tests.
* throw deprecation warning for `tp_mesh_dim` as we recomemnd use device
  mesh slice/indexing instead of manually specify mesh dim
* Rewrite all documentations for every ParallelStyle and make the
  documentation more clear about what each style is doing

TODOs:
* Rewrite TP tests to adjust for the changes we have in this PR
* add more tests to guard the bug fixes

Differential Revision: [D51761183](https://our.internmc.facebook.com/intern/diff/D51761183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114732
Approved by: https://github.com/wz337, https://github.com/fduwjj
2023-12-02 08:18:12 +00:00
729ac7317a [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)
Summary:

Same content of changes as https://github.com/pytorch/pytorch/pull/114710

Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.
ghstack-source-id: 208980207
exported-using-ghexport

Test Plan: CI.

Reviewed By: wanchaol

Differential Revision: D51629761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin
2023-12-02 04:39:41 +00:00
0fef82b3df [dcp] fix fsdp state_dict to use run_check=False (#114995)
from_local with replicate placement would run mesh_broadcast if
run_check=True, by default from_local have run_check=True, but for FSDP
state_dict case we are for sure that these are replica already, so we
don't need to check/force check it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114995
Approved by: https://github.com/fegin, https://github.com/XilunWu, https://github.com/wz337
2023-12-02 04:16:37 +00:00
1f51f977ae misc visualization/utility improvements (#114984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114984
Approved by: https://github.com/weifengpy
ghstack dependencies: #114520
2023-12-02 04:02:39 +00:00
3d47b92dfb Modify pointwise cat heuristic to only apply when inputs are all pointwise and outputs are all pointwise (#114520)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114520
Approved by: https://github.com/eellison
2023-12-02 04:02:39 +00:00
a5a1f0a6b1 [executorch hash update] update the pinned executorch hash (#114996)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114996
Approved by: https://github.com/pytorchbot
2023-12-02 03:57:47 +00:00
f1fd02503b Reland #113487 and #112527 (sdpa shim & fp8 AOTInductor support) (#114974)
This is a backout of #113747 which reverted the above two commits. Now that
#113997 has landed, this diff can be landed safely without breaking ABI compatibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114974
Approved by: https://github.com/chenyang78
2023-12-02 03:25:51 +00:00
fe08d995ef [vision hash update] update the pinned vision hash (#111523)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111523
Approved by: https://github.com/pytorchbot
2023-12-02 03:04:19 +00:00
2882d7fdaf [BE] Remove stale workaround for CUDA<=11.2 (#114979)
It's been a dead code for the last 3+ releases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114979
Approved by: https://github.com/Skylion007
2023-12-02 02:41:41 +00:00
a9aad4ea21 [AOTInductor] Generate Triton header even if scheduler is not invoked. (#114972)
Summary:
Generate Triton header for profiling.
If Triton header isn't generated through Scheduler, generate it directly
when in wrapper codegen.

Test Plan:
Test included in commit.
(test_aot_inductor.py:test_with_no_triton_profiler)

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114972
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-12-02 02:03:38 +00:00
fb806f487f [AOTInductor] Add method to get storage size in shim (#114976)
Summary:
Add a method to get storage size.

Test Plan:
N/A, for FC, test will come after packaged.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114976
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-12-02 01:54:18 +00:00
8f164017ee [quant][pt2e][xnnpack] XNNPACKQuantizer skip quantization for input and output to workaround histogram observer problem (#113405)
Summary:
att, this is because histogram observer does not work for a corner case in mobilebert (observing a scalar tensor of float32 max value)
because histc operator errors out when the value is larger than certain number

Test Plan:
python test/test_quantization.py -k test_mul_float32_max

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113405
Approved by: https://github.com/mcr229
2023-12-02 00:44:42 +00:00
7bbc19adc4 [dynamo] Unskip DALLE2_pytorch (#114960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114960
Approved by: https://github.com/eellison
ghstack dependencies: #114959
2023-12-02 00:40:25 +00:00
4cfe997490 [dynamo] handle setting .data on a tensor (#113080)
**Dynamo**

We don't want setattr in the graph. Setting data has interesting implications on both aliasing and on the autograd engine.

The safe recipe is:

1) Disable grad
2) Call set_()
3) Manually lower the version counter on the object to hide it from the autograd engine

This is effectively the same exact thing as setting .data, and it composes properly with aot_autograd and inductor.

**aot_autograd**

For aot_autograd, there's another snag.

Specifically, when we invoke aot_autograd, we call `fake_mode.from_tensor()`, relying on memo to get the right tensor out. For .data mutations, this doesn't work, because the memoized fake_tensor is in the state it will be in at the end of the trace, not at the beginning. This means that the .data call is already applied, and the tensor shape (as in the case of these tests) mismatches. aot_autograd produces an invalid graph, with illegal calls like `torch.ops.aten.view.default(primals_2, [0])` where primals is actually sized `([6])` on input.

The new plan here is to:
1) Record tensor fakification policy in dynamo
2) provide a fresh fake mode to all backends
3) Invoke from_tensor with the stored policy to get fresh new fake tensors in aot_autograd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113080
Approved by: https://github.com/bdhirsh
2023-12-02 00:35:44 +00:00
77c4565d58 [ONNX][Bench] Remove double export and session init in perf test (#114907)
Previously both `optimize_ctx` call and `experiment` call will do export and session creation, ending up doubling the resource cost. This PR makes `experiment` call re-use the onnx model created by `optimize_ctx`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114907
Approved by: https://github.com/thiagocrepaldi
ghstack dependencies: #110178
2023-12-02 00:17:07 +00:00
b0a36944cc [ONNX] Add sanity check in CI for onnxbench (#110178)
ONNX CI to run benchmark with `--quick` to validate the onnxbench infra.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110178
Approved by: https://github.com/thiagocrepaldi
2023-12-02 00:17:07 +00:00
1fce51037e Add profiler/unwind to the package (#114981)
Needed by `torch/csrc/profiler/combined_traceback.h`
Fixes https://github.com/pytorch/pytorch/issues/114978

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114981
Approved by: https://github.com/atalman
2023-12-01 23:55:01 +00:00
d47f715d29 Expose Flash attn to autograd (#114378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114378
Approved by: https://github.com/drisspg
2023-12-01 23:42:06 +00:00
80d8a2a237 improve mkldnn_linear_pointwise performance for contiguous tensor with non default contiguous strides (#114939)
This PR will convert the stride to the default contiguous stride in `mkldnn_linear_pointwise` before calling oneDNN to run into an optimization path similar to https://github.com/pytorch/pytorch/pull/99511. Also refactored the code to provide a common utility function.

https://github.com/pytorch/pytorch/pull/111976 will ignore Dims of value 1 in Require_Stride_order. For a tensor with `size = [1, 1280]`, `stride = [0, 1]`:
**Before the above PR**, it is considered as non-contiguous, thus in the below call, it is converted to `size = [1, 1280]`, `stride = [1280,1]`:
25b83521be/torch/_inductor/ir.py (L5263)

**While after the above PR**, dims of value 1 are ignored so this tensor is already contiguous and we'll feed a tensor with `stride = [0, 1]` to oneDNN, which results in poor performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114939
Approved by: https://github.com/jgong5
2023-12-01 23:30:07 +00:00
e666159e2f Fix lint in group_batch_fusion.py (#114993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114993
Approved by: https://github.com/janeyx99
2023-12-01 23:17:12 +00:00
c546ca9f80 AOTAutograd: support mutations on buffers that happen during the bw (#114953)
Re-land of https://github.com/pytorch/pytorch/pull/112906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114953
Approved by: https://github.com/zou3519, https://github.com/drisspg
2023-12-01 23:09:37 +00:00
a85df9eb0b Assert that output could only be the last node of the FX graph (#114973)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114973
Approved by: https://github.com/Chillee
2023-12-01 23:04:19 +00:00
3c78ea4c9d [DDP][Compile] Test to Ensure torch.compile works w/static_graph=True (#114621)
Resolves https://github.com/pytorch/pytorch/issues/93672. This was
actually fixed by https://github.com/pytorch/pytorch/pull/103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114621
Approved by: https://github.com/wconstab
2023-12-01 22:18:45 +00:00
6e495eef60 [tgif] allow preserving non-forward methods during deepcopy (#114849)
Summary:
bypass-github-export-checks
force-merge-on-github

Reviewed By: sayitmemory

Differential Revision: D51629520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114849
Approved by: https://github.com/houseroad
2023-12-01 21:51:05 +00:00
4ee80fd7f4 [dynamo] Support UNPACK_SEQUENCE nn.ModuleList (#114959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114959
Approved by: https://github.com/oulgen, https://github.com/yanboliang
2023-12-01 21:42:23 +00:00
68a8d74f3f [inductur] benchmark epilogue fused matmul template (#114809)
Want to be a able to benchmark epilogue fused triton matmul kernel for a couple of reasons
1. @eellison  found that certain TB models (resnet50, resnet152, moco) fails sometimes in maxautotune mode on the dashboard. The issue is quite hard to repro due to flakiness. The issue only get triggered when certain triton config for certain epilogue fused kernel get picked. (disable epilogue fusion bypass the issue) It would be nice if we can have a runnable script that directly run that kernel to ease further debugging
2. this is a necessary piece to do benchmark fusion for triton matmul kernels. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler  for this

Example runnable kernel script: https://gist.github.com/shunting314/00bdbc1b6b46bfa73d1389d8f40cd669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114809
Approved by: https://github.com/eellison
2023-12-01 21:05:01 +00:00
8a51845b38 [C10D] Add filename to dump finished log (#114957)
Just shows you where to look..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114957
Approved by: https://github.com/fduwjj
2023-12-01 20:38:02 +00:00
9cc040fef6 Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880)
Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in #114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880
Approved by: https://github.com/kwen2501
2023-12-01 20:08:23 +00:00
1bcefaf575 [inductor] post_grad batched linear fusion (#112504)
Summary: Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112504
Approved by: https://github.com/yanboliang
2023-12-01 19:26:29 +00:00
f073dcd4f7 Stateful Checkpointing for Distributed [1/N] (#113867)
First pass at adding a save/load API, as well as definition of Stateful objects.

Amongst a couple todo's, we still need to explore adding an `all_gather` & potentially a `barrier` while iterating through state keys.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113867
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-01 19:21:03 +00:00
6f32eb7eef Add decomp for replication_pad2d and use for CUDA deterministic (#111590)
Fixes #95578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-12-01 18:56:09 +00:00
c6e975bc0e Revert "[Quant] [PT2] Enable batchnorm in _move_exported_model_to_eval (#114547)"
This reverts commit bab054063c7fd6c4b3b8d55a932f2e7fa0a057bb.

Reverted https://github.com/pytorch/pytorch/pull/114547 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114547#issuecomment-1836612143))
2023-12-01 18:52:51 +00:00
afbaa0c165 Update oneDNN submodule to v3.3.2 (#112700)
Update oneDNN submodule to v3.3.2.
Add a macro to check the version of `third_party/ideep`.
Since we have versioning now, the changes won't break any pipeline even if `third_party/ideep` is not updated at the same time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112700
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman
2023-12-01 18:40:07 +00:00
93b1e47586 [inductor][Observability] Add log for Optimus to enable easier debug (#110452)
Summary: The log breaks one of ads-model export flows, and we change the log to debug

Test Plan: see details in D49710166

Differential Revision: D49844303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110452
Approved by: https://github.com/jackiexu1992
2023-12-01 18:25:56 +00:00
32b928e582 Tests have main linter (#114882)
The linter uses libcst to check for a call to run_tests or a raised exception when the test file is run as main to ensure that all test files either get run in OSS CI or don't run and are expected to not run.

A better option instead of making this into a linter might be to add this code in run_test since there's also a list of blocklisted tests there that needs to be updated when a test file raises an exception.

This is possibly overkill since run on its own, the code takes ~1 minutes to run without the multiprocessing on all the files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114882
Approved by: https://github.com/kit1980
2023-12-01 17:24:08 +00:00
3fc58a6bbe Revert "Make offsets dynamic by default (#113734)" (#114889)
This reverts commit 7c38b76efec65249e39ae2b8fd8280dfebd1d415.

if a graph has a lot of inputs which are views (with nonzero storage offset), then the check for overlapping tensor views will add a lot of guards (n^2?)

b35ca2cb94/torch/_functorch/_aot_autograd/input_output_analysis.py (L256-L260)

this was causing very slow compilations on an internal model.

Differential Revision: [D51733774](https://our.internmc.facebook.com/intern/diff/D51733774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114889
Approved by: https://github.com/ckluk2, https://github.com/YuqingJ, https://github.com/aaronenyeshi
2023-12-01 16:49:42 +00:00
ec124b90b8 [pytree] hardcode values for none_is_leaf and namespace in C++ pytree (#114858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114858
Approved by: https://github.com/zou3519
2023-12-01 15:01:33 +00:00
5eb36166f8 Fix hard-coded cuda device in ConstructorMoverPass. (#114932)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114932
Approved by: https://github.com/eellison
ghstack dependencies: #114626
2023-12-01 14:23:48 +00:00
833200c54f s390x: fix build (#114508)
Follow up to d18e6b07aa61

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114508
Approved by: https://github.com/huydhn
2023-12-01 14:23:44 +00:00
76362cc9a0 [BE] Do not use AT_ERROR (#114883)
As later is just an alias to `TORCH_CHECK(false,)`

Proposed as suggestion to https://github.com/pytorch/pytorch/pull/110303 but it wasn't noticed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114883
Approved by: https://github.com/atalman
2023-12-01 13:44:17 +00:00
d90d67a146 Added a check to prevent accessing blocksize during Tensor.to_sparse … (#114905)
…conversion if empty. The main problem was that blocksize is an `optional<ArrayRef>`, so checking for `.has_value()` will be true even if the containing `ArrayRef` is empty.

Fixes #114865.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114905
Approved by: https://github.com/malfet
2023-12-01 12:36:15 +00:00
9e94c951a8 Fix missing meta for proxy.node (#114659)
Hello community,

There is node like SDPA which has basic type meta field that extract_val failed to capture. Otherwise some third-party modules plugged in Dynamo which trying to analyse node.meta['val'] will fail. See https://github.com/nod-ai/SHARK-Turbine/issues/206

Thanks !
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114659
Approved by: https://github.com/Chillee
2023-12-01 12:17:23 +00:00
57083542ee Added support for custom pre-grad passes (#113823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113823
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #113913
2023-12-01 12:10:03 +00:00
25b83521be [c10d] Log NCCL trace buffer size (#114926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114926
Approved by: https://github.com/zdevito
ghstack dependencies: #114901
2023-12-01 08:06:10 +00:00
9a075d9a8f Update expected values after #114828 (#114918)
This is failing in trunk 7b3429d97c, updating the value after chatting with @jansel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114918
Approved by: https://github.com/jansel
2023-12-01 07:55:13 +00:00
67562c8cf8 Add DALLE2_pytorch to skips (#114924)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114924
Approved by: https://github.com/huydhn
2023-12-01 07:15:59 +00:00
38e1440bae [MPS] Remove redundant topk test and move all pad tests inside a class (#113313)
Summary:
1. The removed `topk` test is essentially very similar to the following test, so I remove it:
```python
def test_topk(self):
        def helper(shape):
            cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=False)
            x = cpu_x.detach().clone().to('mps')
            for largest_val in [True, False]:
                if (type(shape) == tuple):
                    for curr_dim in range(0, len(shape)):
                        dim_size = shape[curr_dim]
                        for k in range(1, dim_size + 1):
                            topk_values, topk_indices = torch.topk(x, k, dim=curr_dim, largest=largest_val)
                            topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=curr_dim, largest=largest_val)
                            self.assertEqual(topk_values, topk_values_cpu)
                            self.assertEqual(topk_indices, topk_indices_cpu)
                else:
                    for k in range(1, shape):
                        topk_values, topk_indices = torch.topk(x, k, dim=0, largest=largest_val)
                        topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=0, largest=largest_val)
                        self.assertEqual(topk_values, topk_values_cpu)
                        self.assertEqual(topk_indices, topk_indices_cpu)

        helper(2)
        helper((5, 1))
        helper((1, 5))
        helper((5, 9, 7, 4))
        helper((50, 20, 7, 4))
```
297c26bb8e/test/test_mps.py (L8054-L8091)

2. Move all pad tests to one standalone class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113313
Approved by: https://github.com/kulinseth
ghstack dependencies: #113312
2023-12-01 06:52:07 +00:00
88a659e752 [MPS] Move non-nll loss tests outside TestNLLLoss (#113312)
The diff looks messy but this PR essentially does one thing: Move non-nll loss tests in `TestNLLLoss` class to `TestMPS` class. After doing so, it ends up having two stack tests the same name `test_stack` ; therefore, I rename one of them to `test_stack_storage_offset`, which is what the test actually does.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113312
Approved by: https://github.com/kulinseth
2023-12-01 06:52:07 +00:00
4875e4d63f [tp] delete dead code (#114731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114731
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-01 06:35:42 +00:00
1b27eae65e [MPS] Fix out-of-bounds fill to sliced tensor (#114838)
This fixes regression introduced by https://github.com/pytorch/pytorch/pull/81951 that caused out-of-bounds access when sliced tensor is filled with zeros

Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc) arguments are location and length rather than start and end offset.

In `fill_mps_tensor_`:
- Pass `value` argument to `MPSStream::fill`
- Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero

Add regression test

Fixes https://github.com/pytorch/pytorch/issues/114692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114838
Approved by: https://github.com/atalman, https://github.com/kulinseth
2023-12-01 06:24:42 +00:00
aa390cec21 [profiler] Fix description to use nelems rather than size (#114735)
We were storing the number of elements in the tensor, rather than the actual bytes.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114735
Approved by: https://github.com/aaronenyeshi, https://github.com/yoyoyocmu, https://github.com/kwen2501, https://github.com/fduwjj
2023-12-01 06:21:47 +00:00
373f2060ba fix extending torch native API docs (#114863)
Couldn't think of a better `release notes:` label. Feel free to set a more fitting one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114863
Approved by: https://github.com/mikaylagawarecki
2023-12-01 06:09:35 +00:00
5687285ca5 Skip quantization tests running from BaseTestQuantizePT2EQAT_ConvBn (#114829)
Summary: This is a follow-up from D51428979.  These tests should be run only from `TestQuantizePT2EQAT_ConvBn1d` and `TestQuantizePT2EQAT_ConvBn2d`. The base class doesn't have the necessary setup to run them and will fail expectedly.  I previously ignored the failures on D51428979, and these failed tests have been disabled.

Test Plan:
Run an example test there and confirm that two versions from `TestQuantizePT2EQAT_ConvBn1d` and `TestQuantizePT2EQAT_ConvBn2d` are run while the one from `BaseTestQuantizePT2EQAT_ConvBn` is skipped

```
$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --run-disabled 'caffe2/test/quantization:test_quantization - test_qat_conv_bn_fusion_literal_args'
File changed: fbcode//caffe2/test/quantization/pt2e/test_quantize_pt2e_qat.py
↷ Skip: caffe2/test/quantization:test_quantization - test_qat_conv_bn_fusion_literal_args (caffe2.test.quantization.pt2e.test_quantize_pt2e_qat.BaseTestQuantizePT2EQAT_ConvBn) (0.0s)

/data/users/huydo/fbsource/buck-out/v2/gen/fbcode/689edf96bfbb5738/caffe2/test/quantization/__test_quantization__/test_quantization#link-tree/torch/_utils_internal.py:230: NCCL_DEBUG env var is set to None
/data/users/huydo/fbsource/buck-out/v2/gen/fbcode/689edf96bfbb5738/caffe2/test/quantization/__test_quantization__/test_quantization#link-tree/torch/_utils_internal.py:239: NCCL_DEBUG is WARN from /etc/nccl.conf
INFO:2023-11-29 19:20:33 3049620:3049620 CuptiActivityProfiler.cpp:225] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
/data/users/huydo/fbsource/buck-out/v2/gen/fbcode/689edf96bfbb5738/caffe2/test/quantization/__test_quantization__/test_quantization#link-tree/torch/_utils_internal.py:158: DeprecationWarning: This is a NOOP in python >= 3.7, its just too dangerous with how we write code at facebook. Instead we patch os.fork and multiprocessing which can raise exceptions if a deadlock would happen.
  threadSafeForkRegisterAtFork()
test_qat_conv_bn_fusion_literal_args (caffe2.test.quantization.pt2e.test_quantize_pt2e_qat.BaseTestQuantizePT2EQAT_ConvBn) ... skipped 'Skipping test running from BaseTestQuantizePT2EQAT_ConvBn'

----------------------------------------------------------------------
Ran 1 test in 0.001s

OK (skipped=1)

Skipped: Skipping test running from BaseTestQuantizePT2EQAT_ConvBn

Buck UI: https://www.internalfb.com/buck2/7b70fb33-44cb-4745-92e1-64031bb413b8
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924660765251
Network: Up: 12KiB  Down: 0B  (reSessionID-0399f0c3-e671-4770-a41c-75c06ae709d5)
Jobs completed: 11. Time elapsed: 1:07.2s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 1. Build failure 0
```

Differential Revision: D51694959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114829
Approved by: https://github.com/clee2000
2023-12-01 05:13:27 +00:00
d6c0d1b58b [pytree] support collections.deque type for Python pytree (#113256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113256
Approved by: https://github.com/zou3519
ghstack dependencies: #112485, #113255
2023-12-01 05:12:09 +00:00
0019196f1b Refactor move_constructor_to_cuda. (#114626)
Follow-up: #114539

This PR introduces a minor change to the `move_constructor_to_cuda` implementation, while
refactoring the whole pass into a class. Here's a brief summary of the changes:

- Create a new `ConstructorMoverPass`
- Rephrase the condition:

```python
if not isinstance(
    node.target, torch._ops.OpOverload
) or node.target.namespace not in ("prims", "aten"):
    ...

if not (
    isinstance(node.target, torch._ops.OpOverload)
    and node.target.namespace in ("prims", "aten")
):
    ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114626
Approved by: https://github.com/eellison
2023-12-01 05:09:29 +00:00
9267ab9032 [executorch hash update] update the pinned executorch hash (#114915)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114915
Approved by: https://github.com/pytorchbot
2023-12-01 04:32:35 +00:00
ab5385fc50 [Dynamo][6.3/N] Further cleanup torch.py (#114669)
A follow-up PR to clean up what I found during the refactor of torch.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114669
Approved by: https://github.com/jansel
2023-12-01 04:08:29 +00:00
64fd706b21 [quant][pt2e] Add generate_numeric_debug_handle pass (#114315)
Summary:
This is a util for numeric suite in pt2 export so that we can build
a more streamlined UX for numerical debugging in quant + executorch stack

Test Plan:
python test/test_quantization.py TestGenerateNumericDebugHandle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114315
Approved by: https://github.com/zhxchen17
2023-12-01 03:38:17 +00:00
2dd2fb91d9 [DeviceMesh] Add get_local_rank() API to DeviceMesh (#114709)
As title.

Differential Revision: [D51625152](https://our.internmc.facebook.com/intern/diff/D51625152/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114709
Approved by: https://github.com/wanchaol, https://github.com/fegin
ghstack dependencies: #114708
2023-12-01 03:28:55 +00:00
fb325bbd46 Move class definition of DebugInfoWriter to TraceUtil as well (#114901)
Since we moved the implementation of the class to TraceUtils in https://github.com/pytorch/pytorch/pull/114367, maybe we also want to move the implementation here as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114901
Approved by: https://github.com/XilunWu
2023-12-01 03:28:16 +00:00
2a2f74727a [dynamo, test] add test for backend registration API (#114908)
Add tests for backend registration API, per https://github.com/pytorch/pytorch/pull/114820.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114908
Approved by: https://github.com/eellison
ghstack dependencies: #114820
2023-12-01 03:10:56 +00:00
033f98b7e0 Remove confusing warning message from SDPA about mask alignment (#114909)
# Summary
Users have reported that this warning message leads to confusion about the correctness of the mask even though it is only concerned with performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114909
Approved by: https://github.com/Chillee
2023-12-01 03:02:20 +00:00
235eaabfed [inductor][easy] print out exception message upon failing to write to a file (#114836)
To address Oleg's internal review feedback.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114836
Approved by: https://github.com/khabinov
2023-12-01 02:40:43 +00:00
1aa54bdebf [ONNX] Fix op level debug on complex dtype support (#114885)
Previous to this PR, op level debug mismatches whenever it comes to complex dtype matching, because in ONNX, we support real representation. This PR makes sure we use real representation to compare the results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114885
Approved by: https://github.com/BowenBao
2023-12-01 02:17:27 +00:00
1d95644740 [Execution Trace] record root rank for broadcast/gather/reduce/scatter (#113828)
Summary:
collective like broadcast/gather/reduce/scatter need root rank info in order to be replayed in PARAM benchmarks. Log root rank instead of local rank in RECORD_PARAM_COMMS_DATA

Reference: distributed/c10d/Types.hpp

Test Plan: Tested in HPC

Differential Revision: D51381196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113828
Approved by: https://github.com/fduwjj
2023-12-01 01:28:49 +00:00
6cba8b584d [Dynamo] Support torch.cuda.amp.custom_fwd/custom_bwd by inlining (#114891)
Fixes #114693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114891
Approved by: https://github.com/zou3519
2023-12-01 01:23:51 +00:00
7f40640342 [Dynamo] Support torch.amp.autocast as decorator (#114845)
Fixes #114818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114845
Approved by: https://github.com/jansel
2023-11-30 23:54:57 +00:00
ad09d81694 Allow functionalization to work with optional mutable (#114803)
Summary: - Added functionalization to allow Optionals

Test Plan: CI tests.

Reviewed By: zou3519

Differential Revision: D51209981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114803
Approved by: https://github.com/zou3519
2023-11-30 23:48:03 +00:00
7b3e45be59 [DeviceMesh] Rename get_dim_groups to get_group (#114708)
Rename get_dim_groups to get_group and update all callsites.

Differential Revision: [D51629801](https://our.internmc.facebook.com/intern/diff/D51629801/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114708
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin
2023-11-30 23:40:14 +00:00
38ae17d166 [dynamo, docs] update dynamo backend registration docs (#114820)
Update docs to reflect current backend registration API. Add `lookup_backend` to root `dynamo` module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114820
Approved by: https://github.com/eellison
2023-11-30 21:41:05 +00:00
1f845d5898 [CI] Fix a REQUIRE_HIGHER_TOLERANCE comparison bug (#114870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114870
Approved by: https://github.com/jansel
2023-11-30 21:11:15 +00:00
386b9c2adc build small pip wheels for CUDA 11.8 (#114620)
As discussed, we would like to start building all wheels using the CUDA PyPI dependencies.
Adding the "small wheel" workflow for CUDA 11.8 as it's already used for 12.1U1.

CC @malfet @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114620
Approved by: https://github.com/atalman, https://github.com/malfet
2023-11-30 20:50:31 +00:00
2ab2e8e1c0 [pytree] support collections.defaultdict type for Python pytree (#113255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113255
Approved by: https://github.com/zou3519
ghstack dependencies: #112485
2023-11-30 20:46:25 +00:00
baeb0705fe [ONNX][Bench] Add warmup for onnx cuda runs (#114821)
Increases perf accuracy especially for low iteration runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114821
Approved by: https://github.com/thiagocrepaldi
ghstack dependencies: #112179, #114767
2023-11-30 20:41:44 +00:00
c867fddab5 [inductor] Fix in CppPrinter._print_Pow (#114872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114872
Approved by: https://github.com/lezcano
2023-11-30 20:21:44 +00:00
81adbb6131 Sort the output of TORCH_LOGS=help (#114657)
Previously the order was random because it was based on the order of dictionary keys.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114657
Approved by: https://github.com/lezcano
2023-11-30 20:13:51 +00:00
b35ca2cb94 Better error message for misconfigured torchbench model (#114827)
```
  File "/home/jansel/pytorch/./benchmarks/dynamo/torchbench.py", line 381, in load_model
    benchmark_cls.name = model_name
AttributeError: 'NoneType' object has no attribute 'name
```
becomes
```
  File "/home/jansel/pytorch/./benchmarks/dynamo/torchbench.py", line 381, in load_model
    raise NotImplementedError(f"{model_name}.Model is None")
NotImplementedError: torchrec_dlrm.Model is None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114827
Approved by: https://github.com/xuzhao9, https://github.com/yanboliang
2023-11-30 19:11:01 +00:00
57e482010a Fix build-deps in benchmarks/dynamo/Makefile (#114815)
This works around a missing git-python-versioning error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114815
Approved by: https://github.com/yanboliang
2023-11-30 19:10:56 +00:00
7b3429d97c Fix error with int+SymBool (#114828)
Fixes #104797

```
  File "/home/jansel/pytorch/torch/_dynamo/utils.py", line 1486, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/home/jansel/pytorch/torch/_dynamo/utils.py", line 1591, in run_node
    raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e
  File "/home/jansel/pytorch/torch/_dynamo/utils.py", line 1570, in run_node
    return node.target(*args, **kwargs)
  File "/home/jansel/conda/envs/pytorch/lib/python3.10/site-packages/einops/packing.py", line 153, in unpack
    n_unknown_composed_axes = sum(x == -1 for x in lengths_of_composed_axes)
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function unpack at 0x7f644b962710>(*(FakeTensor(..., device='cuda:0', size=(1, s0*s1, 128)), [(s0, s1)], 'b * c'), **{}):
unsupported operand type(s) for +: 'int' and 'SymBool'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114828
Approved by: https://github.com/lezcano
2023-11-30 18:30:36 +00:00
2a3d8e50fb [pytree] test aligned API signature for C++ and Python pytree (#112485)
Add tests to ensure the C++ and Python pytree provide the same APIs with identical signatures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112485
Approved by: https://github.com/zou3519
2023-11-30 17:50:06 +00:00
e6b3a8ce5f [export] Refactor export() and separate the non-strict part. (#114697)
Summary: Refactor torch.export to separate strict part and non strict part. Adding an option to torch.export called `strict=True`.

Test Plan: buck2 test mode/opt caffe2/test:test_export -- -r non_strict

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114697
Approved by: https://github.com/ydwu4, https://github.com/tugsbayasgalan
2023-11-30 16:47:50 +00:00
e3c42d3fb3 Inductor cpp wrapper: fix buffer free in non-AOT mode (#114741)
We found performance regression when using cpp wrapper in non-AOT mode due to the change in https://github.com/pytorch/pytorch/pull/110892.
https://github.com/pytorch/pytorch/pull/110892 only handles the buffer cache in AOT mode but removes the `reset` call without checking whether AOT mode is on or off. This PR updates the buffer free change to only happen when `V.graph.aot_mode is True`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114741
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-11-30 16:46:55 +00:00
f93ea14309 [dynamo] Added support for math ops on ints with dynamic shapes (#114507)
Fixes #114218

```
import math
import torch

def func(x, a):
    b = math.floor(a + 0.5)
    b = math.radians(a) + b
    y = x + b
    return y

cfunc = torch.compile(func, dynamic=True, fullgraph=True, backend="eager")
x = torch.tensor([0, 1, 2, 3], dtype=torch.float32)
a = 12

out = cfunc(x, a)
```

```
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  ===== __compiled_fn_0 =====
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]     def forward(self, L_a_ : torch.SymInt, s1 : torch.SymInt, L_x_ : torch.Tensor):
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_a_ = L_a_
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_x_ = L_x_
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:7, code: b = math.floor(a + 0.5)
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add = l_a_ + 0.5
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         floor = math_floor(add);  add = None
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: /pytorch/torch/_dynamo/polyfill.py:28, code: return math.pi / 180.0 * x
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         mul = 0.017453292519943295 * l_a_;  l_a_ = None
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:9, code: b = math.radians(a) + b
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_1 = mul + floor;  mul = floor = None
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:13, code: y = x + b
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         y = l_x_ + add_1;  l_x_ = add_1 = None
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         return (y,)
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-29 18:10:08,385] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114507
Approved by: https://github.com/lezcano
2023-11-30 14:11:57 +00:00
69f112d586 Call triton bsr_dense_mm/bsr_dense_addmm kernels on mm/addmm float32 inputs when appropiate (#114757)
As in the title.

In addition, this PR fixes a bug in `bsr_dense_mm` and `bsr_dense_addmm` return value handling where computations are performed on `make_triton_contiguous` return value while `bsr_dense_mm`/`bsr_dense_addmm` return a tensor that is an input to `make_triton_contiguous`. If `make_triton_contiguous` makes a copy of the input, the return values of `bsr_dense_mm`/`bsr_dense_addmm` will contain garbage.

The PR increases the performance of nn.linear as follows (float32, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is 67/78 %
- with 32x32 blocks, the average/maximal speed up is 72/79 %
- with 64x64 blocks, the average/maximal speed up is 71/79 %
- with 128x128 blocks, the average/maximal speed up is 62/76 %

The performance increase is illustrated also by the following sparsity-speedup graphs (before and after this PR):
<img src="https://github.com/pytorch/pytorch/assets/402156/55ce0bf7-8ef2-47ab-99e8-8878f159037d" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/df256175-a594-4bd7-b244-90867fb9a45e" width="48%">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114757
Approved by: https://github.com/cpuhrsch
2023-11-30 13:38:07 +00:00
d4128b164d Fix nn.utils.parametrizations.weight_norm for BFloat16 (#114785)
Fixes #107914

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114785
Approved by: https://github.com/lezcano
2023-11-30 13:18:47 +00:00
272e38e78b [DeviceMesh] Update DeviceMesh's hash (#114812)
Currently, when we create two DeviceMesh with the same mesh_tensor, the hash of the DeviceMesh will be the same.

To follow the pattern of `dist.new_group()`, the two DeviceMesh should be different. Therefore, adding an id field for DeviceMesh creation to distinguish different DeviceMesh.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114812
Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu, https://github.com/fegin
2023-11-30 12:14:19 +00:00
db698f733d Update fbgemm_gpu pin (#114847)
Should have been landed together with https://github.com/pytorch/pytorch/pull/101995
Includes de731af65b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114847
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-11-30 09:53:50 +00:00
92cd78b1df [C10D] logging/comment clean ups (#114625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114625
Approved by: https://github.com/fduwjj, https://github.com/XilunWu
ghstack dependencies: #114810
2023-11-30 07:46:32 +00:00
5c3f03e2dd [inductor] add a config to specify the shape attribute for the generated svg graphs (#114811)
We draw our fx graphs with the "record" shape attribute by default.
Sometimes, when the graph is very complex, we may hit dot errors like below:
  "flat edge between adjacent nodes one of which has a record shape -
   replace records with HTML-like labels"
and thus fail to generate a graph. So, let's give the user an option
to specify the shape attribute for the dot graph. For example, passing
INDUCTOR_DOT_GRAPH_SHAPE_SVG = "none" would let us generate HTML-like lables
to workaround the above failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114811
Approved by: https://github.com/weifengpy
2023-11-30 06:10:37 +00:00
e97e2ff445 [CI][MacOS] Cleanup left over local site-packages (#114843)
Once a janitor always a janitor!

Partially addresses https://github.com/pytorch/pytorch/issues/114840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114843
Approved by: https://github.com/yanboliang
2023-11-30 05:37:53 +00:00
8ae3835323 further deprecate PairwiseParallel and SequenceParallel from test (#114402)
**Remaining Issue**
When replace SequenceParallel, tests would pass even setting `input_layouts=Replicate()`. Still looking into it...

**Summary**
This is a follow-up PR to #114314.

**Test Plan**
`python test_files.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114402
Approved by: https://github.com/wanchaol
2023-11-30 05:06:08 +00:00
c1e51fcbfc [ONNX][Bench] Relax tolerance for cuda accuracy check (#114767)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114767
Approved by: https://github.com/thiagocrepaldi
ghstack dependencies: #112179
2023-11-30 04:43:46 +00:00
fd7201029a [Quant] [PT2] Enable Inplace Dropout in _move_exported_model_to_eval (#114725)
**Summary**
Enable Inplace Dropout replacement in `_move_exported_model_to_eval`

**Test Plan**
```
python -u -m pytest -s -v test_quantize_pt2e.py -k test_move_exported_model_to_eval
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114725
Approved by: https://github.com/andrewor14, https://github.com/jgong5
ghstack dependencies: #114547
2023-11-30 04:43:22 +00:00
06eb28c32a [executorch hash update] update the pinned executorch hash (#114814)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114814
Approved by: https://github.com/pytorchbot
2023-11-30 04:35:53 +00:00
bab054063c [Quant] [PT2] Enable batchnorm in _move_exported_model_to_eval (#114547)
**Summary**
Add standalone batchnorm into `_move_exported_model_to_eval` to move it from training mode into eval mode

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_bn_conv2d
python -u -m pytest -s -v test_quantize_pt2e.py -k test_bn_move_exported_model_to_eval
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114547
Approved by: https://github.com/jgong5, https://github.com/andrewor14
2023-11-30 04:31:27 +00:00
4ed9e65038 [C10D] Add time_created_us to flight recorder (#114810)
time_created_us is the cpu-side epoch_time (in usec) when a flight-recorder
event was created. It loosely corresponds to the time the c10d collective
API was called and a work object was created.  It does NOT correspond to
the time the collective started on the GPU.

We follow the precedent of us epoch time from this PR adding timestamps
to the cuda caching allocator:
https://github.com/pytorch/pytorch/pull/112266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114810
Approved by: https://github.com/zdevito
2023-11-30 04:15:56 +00:00
1f5726708b [PyTorch][ET] Collect Execution Traces in Chakra schema (#114753)
Summary:
Collect execution traces in the Chakra schema

Created a new diff to change email address: D48030418

Test Plan:
```
$ cd ~/fbcode
$ binary_path=$(buck2 build //param_bench/train/compute/python:pytorch_run_benchmark --show-output | tail -1 | awk '{print $2}')
$ cd ~/fbsource
$ $binary_path -c ~/fbcode/param_bench/train/compute/python/examples/pytorch/configs/alex_net.json --et

$ cat ~/is_json.py
import json
import sys

def is_json_file(filename):
    try:
        with open(filename, 'r') as f:
            json.load(f)
        return True
    except Exception as e:
        return False

if len(sys.argv) != 2:
    print("Usage: python check_json.py [filename]")
    sys.exit(1)

filename = sys.argv[1] # get filename from command-line argument
print(is_json_file(filename))

$ python3 ~/is_json.py ~/fbsource/benchmark_result_2244333_1691065899_et.json
True
```

Differential Revision: D51662384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114753
Approved by: https://github.com/aaronenyeshi
2023-11-30 04:07:11 +00:00
3b7d60b6ff Fix keep-going (#112098)
New function for continue on error

Another solution might be to run the entire suite to the end and use last failed, but I'm worried about concurrent processes writing to the same last failed cache entry, it's a bit different than the usual test rerunning strategy we use especially regarding segfaults and other ways the test suite can suddenly end, and there are some cases where the entire test suite should immediately get rerun in a new process (ex cuda error that causes sync to fail).

Find example logs on commit 2f1510839727f6ef2631040d5f0edde26265015d

TODO: continue on error for --subprocess and test_distributed aren't working fully
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112098
Approved by: https://github.com/huydhn
2023-11-30 04:01:57 +00:00
d5544125a0 [distributed] NCCLflight recorder timeout fix (#114804)
Because isCompleted() returns true on an exception, a timeout exception
will cause the flight recorder to consider the event completed even though it timed out.

This changes the logic to explicitly query the completion events on "retirement"
when the work item leaves the workMetaList. We mark events as retired so
we can distinguish between an event still in the queue but not completed and one
that timed out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114804
Approved by: https://github.com/wconstab
2023-11-30 03:46:48 +00:00
e70a7c3296 [CI] Update torchbench pin (#114694)
Summary: also revert the regressed graph breaks count for DALLE2_pytorch in https://github.com/pytorch/pytorch/pull/114598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114694
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-11-30 03:41:03 +00:00
f1fe0b685c [export] Remove combine_args_kwargs (#114782)
Test Plan: CI

Differential Revision: D51676479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114782
Approved by: https://github.com/zhxchen17
2023-11-30 02:49:21 +00:00
165f4f6ccf [PyTorch] Redirect c10::optional to std::optional (#101995)
We have C++17 now!

I am intentionally dropping the `c10::optional<c10::ArrayRef>` size optimization. It was intended to improve dispatch, but thanks to D34602980 / #70864 we don't use `optional<ArrayRef>` in function arguments anymore anyway.

Differential Revision: [D46079028](https://our.internmc.facebook.com/intern/diff/D46079028/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101995
Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/ezyang
2023-11-30 02:46:41 +00:00
013675ff59 Revert "Add decomp for replication_pad2d and use for CUDA deterministic (#111590)"
This reverts commit f1286161a637e9fc0797a22a7b7d90eaa04ddc4f.

Reverted https://github.com/pytorch/pytorch/pull/111590 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing XLA job.  The job is also failing on the PR, but the log classifier failed to find the failed test which lead to it being marked wrongly as flaky ([comment](https://github.com/pytorch/pytorch/pull/111590#issuecomment-1833004794))
2023-11-30 02:28:14 +00:00
9f3ec2ad45 deprecate PairwiseParallel from test (#114314)
**Summary**
To solve issue #113706:
1. replace `PariwiseParallel` with `ColwiseParallel` and `RowwiseParallel`.
2. replace the input of ColwiseParallel from `make_input_replicate_1d` and `make_output_replicate_1d` to `input_layouts` and `output_layouts`.
3. deprecate the tests for `_parallelize_mlp` as it only supports `PariwiseParallel`.

**Test Plan**
`pytest pytorch/test/distributed/tensor/parallel/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114314
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2023-11-30 02:19:30 +00:00
5262484ece [easy][aotinductor] fix typos & add static typing (#114728)
```
// check all references
$ grep -rl 'cpp_kernel_overlad_name' *
ir.py
```

```
$ lintrunner --take MYPYINDUCTOR torch/_inductor/codegen/wrapper.py torch/_inductor/ir.py
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114728
Approved by: https://github.com/Skylion007, https://github.com/chenyang78
2023-11-30 02:10:56 +00:00
4ba649e207 [FSDP][state_dict] Avoid assigning the root _device_mesh to the children _device_mesh (#114384)
Assigning the root _device_mesh to the children _device_mesh is not correct as each FSDP state can have a different DeviceMesh. We are also replacing fully_shard with a new implementation. So there is no need to worry about the fully_shard behavior.

Differential Revision: [D51507959](https://our.internmc.facebook.com/intern/diff/D51507959/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114384
Approved by: https://github.com/wz337
2023-11-30 02:08:31 +00:00
8cfc95368f [Experimental][ONNX] Export with symbolic shapes in proto (#112179)
Experimental feature to store symbolic shapes produced by torch dynamo inside the exported onnx model.
There is no official ONNX spec to support nodes within FunctionProto to have value info, https://github.com/onnx/onnx/issues/5487. The names for value info are generated uniquely to be retrievable based on the call site and call stack.
This requires onnxscript with https://github.com/microsoft/onnxscript/tree/bowbao/export_symbolic_shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112179
Approved by: https://github.com/titaiwangms, https://github.com/thiagocrepaldi
2023-11-30 02:03:32 +00:00
f0cc6364ed [export] Remove convert_to_cpu flag (#114775)
Test Plan: CI

Differential Revision: D51674158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114775
Approved by: https://github.com/zhxchen17, https://github.com/SherlockNoMad
2023-11-30 01:59:52 +00:00
34ea0a2bdc [Pytoch][Vulkan] Create context for layernorm (#114701)
Summary:
`Layernorm` has two arguments weight and bias which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this op to avoid the repeated passing. Specifically, we
- created `create_layernorm_context` and `run_layernorm_context` in `Layernorm.h` and `Layernorm.cpp`
- registered them in `Register.cpp`
- rewrote the graph representation of the op in `vulkan_rewrite.cpp`

Test Plan:
## Numerical test
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (b6ccc956c)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm*"
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.

If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa

  Targets matching .buckconfig buck2.supported_projects:
  {'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}

  To suppress this warning: touch ~/.config/.dont_hint_buck2

Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.packed_layer_norm_2d
[       OK ] VulkanAPITest.packed_layer_norm_2d (342 ms)
[ RUN      ] VulkanAPITest.packed_layer_norm_3d
[       OK ] VulkanAPITest.packed_layer_norm_3d (284 ms)
[ RUN      ] VulkanAPITest.packed_layer_norm_4d
[       OK ] VulkanAPITest.packed_layer_norm_4d (5 ms)
[ RUN      ] VulkanAPITest.layer_norm_invalid_inputs
[       OK ] VulkanAPITest.layer_norm_invalid_inputs (28 ms)
[ RUN      ] VulkanAPITest.layer_norm_2d
[       OK ] VulkanAPITest.layer_norm_2d (1 ms)
[ RUN      ] VulkanAPITest.layer_norm_3d
[       OK ] VulkanAPITest.layer_norm_3d (2 ms)
[ RUN      ] VulkanAPITest.layer_norm_4d
[       OK ] VulkanAPITest.layer_norm_4d (4 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_2d
[       OK ] VulkanAPITest.native_layer_norm_2d (1 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_3d
[       OK ] VulkanAPITest.native_layer_norm_3d (2 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_4d
[       OK ] VulkanAPITest.native_layer_norm_4d (6 ms)
[----------] 10 tests from VulkanAPITest (679 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (679 ms total)
[  PASSED  ] 10 tests.
```
Full test result in P888496077, summary as below
```
[----------] 419 tests from VulkanAPITest (21652 ms total)

[----------] Global test environment tear-down
[==========] 419 tests from 1 test suite ran. (21652 ms total)
[  PASSED  ] 418 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```

## Graph representation comparison
We created a model using `layer_norm` and traced it as below
```
class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.layer_norm = torch.nn.LayerNorm(normalized_shape=10)

    def forward(self, x):
        return self.layer_norm(x)

# Create an instance of the model
model = MyModel()

# Create a dummy input tensor for tracing
input_tensor = torch.randn(1, 10)

# Use torch.jit.trace to trace the model and generate a graph
traced_model = torch.jit.trace(model, input_tensor)
```
Then we converted the traced model to Vulkan backend using `optimize_for_mobile`
```
from torch.utils import mobile_optimizer

vulkan_model = mobile_optimizer.optimize_for_mobile(
    traced_model, backend="vulkan", preserved_methods=to_preserve
)
```
Then we can print the graph of the `vulkan_model` as `print(vk_model.graph)`

- Before this diff
```
  %4 : bool = prim::Constant[value=1](), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0
  %5 : float = prim::Constant[value=1.0000000000000001e-05](), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0
  %14 : int[] = prim::Constant[value=[10]]()
  %33 : Tensor = aten::to(%x, %53, %30, %31, %31)
  %10 : Tensor = aten::layer_norm(%33, %14, %self.layer_norm.weight, %self.layer_norm.bias, %5, %4), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0
```

- after this diff
```
  %14 : int[] = prim::Constant[value=[10]]()
  %47 : Tensor = aten::to(%x, %78, %44, %45, %45)
  %16 : Tensor = vulkan_prepack::run_layernorm_context(%47, %14, %17)
```

Reviewed By: SS-JIA

Differential Revision: D51530478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114701
Approved by: https://github.com/yipjustin
2023-11-30 01:33:50 +00:00
597d3fb86a Add additional guard for index_put fallback for bfloat16 on whether it's accumulating or not (#114788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114788
Approved by: https://github.com/cpuhrsch
2023-11-30 00:33:50 +00:00
80ae00d11a [AOT Refactor] jit compile runtime wrappers (#114564)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Total reduction in lines: 5200 lines -> 1100 lines

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114564
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552, #114553, #114554, #114555, #114556, #114557, #114558, #114559, #114561, #114562, #114563
2023-11-30 00:28:57 +00:00
741414b739 [AOT Refactor] dispatch compile graph (#114563)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114563
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552, #114553, #114554, #114555, #114556, #114557, #114558, #114559, #114561, #114562
2023-11-30 00:28:43 +00:00
abb84051a3 [AOT Refactor] alias runtime wrappers (#114562)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114562
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552, #114553, #114554, #114555, #114556, #114557, #114558, #114559, #114561
2023-11-30 00:24:43 +00:00
4d4093a5de [AOT Refactor] traced function transforms pt. 2 (#114561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114561
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552, #114553, #114554, #114555, #114556, #114557, #114558, #114559
2023-11-30 00:24:05 +00:00
dab89d546c [AOT Refactor] traced function transforms pt. 1 (#114559)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Current progress: 5200 lines -> 2400 lines

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114559
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552, #114553, #114554, #114555, #114556, #114557, #114558
2023-11-30 00:24:05 +00:00
0f41a0e99d [AOT Refactor] (missed) graph signature to i/o analysis (#114558)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114558
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552, #114553, #114554, #114555, #114556, #114557
2023-11-30 00:23:59 +00:00
5ab61c1ae1 [AOT Refactor] runtime wrappers (#114557)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114557
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552, #114553, #114554, #114555, #114556
2023-11-30 00:23:52 +00:00
7eafdee4d6 [AOT Refactor] input/output analysis (#114556)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Current progress: 5200 lines -> 3000 lines

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114556
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552, #114553, #114554, #114555
2023-11-30 00:21:00 +00:00
7cb2e8387b [AOT Refactor] collect metadata analysis (#114555)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114555
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552, #114553, #114554
2023-11-30 00:21:00 +00:00
e9b03ac36d [AOT Refactor] subclass utils (#114554)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114554
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552, #114553
2023-11-30 00:17:57 +00:00
721d99181e [AOT Refactor] schemas (#114553)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Current progress: 5200 lines -> 4200 lines

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114553
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551, #114552
2023-11-30 00:15:28 +00:00
1971eda1db [AOT Refactor] functional utils (#114552)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114552
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550, #114551
2023-11-30 00:12:41 +00:00
850887b0de [executorch hash update] update the pinned executorch hash (#114717)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114717
Approved by: https://github.com/pytorchbot, https://github.com/malfet
2023-11-30 00:08:43 +00:00
ec4b59305b [AOT Refactor] logging utils (#114551)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114551
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114550
2023-11-30 00:06:34 +00:00
41c1090e48 [AOT Refactor] utils (#114550)
---

Part _ of https://github.com/pytorch/pytorch/issues/114548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114550
Approved by: https://github.com/bdhirsh
2023-11-30 00:02:40 +00:00
b5c4b1d9fe Make Float8 types serializeable (#114662)
By finally breaking FC promise on new dtypes by serializing untyped
storage and tensor dtypes

- Add `_rebuild_tensor_v3` that takes an extra dtype argument
- In `Tensor.__reduce_ex__` serialize tensor using untyped storage for
  v3_dtypes (which are at the moment limited to float8 dtypes)

Test plan: `python -c "import torch;x=torch.arange(10).to(dtype=torch.float8_e4m3fn);torch.save(x, 'pt.pt');print(torch.load('pt.pt'))"`

Fixes https://github.com/pytorch/pytorch/issues/114634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114662
Approved by: https://github.com/ngimel
2023-11-29 23:23:23 +00:00
fe7b845c8d [tgif] preserve non-forward method during torch package serialization (#114702)
Reviewed By: terrycsy, sayitmemory

Differential Revision: D51607058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114702
Approved by: https://github.com/houseroad
2023-11-29 22:31:35 +00:00
f1286161a6 Add decomp for replication_pad2d and use for CUDA deterministic (#111590)
Fixes #95578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-11-29 21:50:46 +00:00
0ced55e06c Optimize inspect.stack() call in caffe2/torch/library.py (#114700)
Summary: Same optimization as https://github.com/pytorch/pytorch/pull/105940.

Test Plan:
Wait for tests

Verify that the new code extracts the same module in a simple test case:
```
import inspect
import sys

def inside_frame() -> None:
    frame = inspect.stack()[0]
    print(f"Via inspect.stack(): {inspect.getmodule(frame[0])}, extracted frame = {frame[0]}")

    frame = sys._getframe(0)
    print(f"Via sys._getframe: {inspect.getmodule(frame)}, extracted frame = {frame}")

if __name__ == "__main__":
    inside_frame()
```

Output:
```
[jsd115@devbig1161 /tmp/test]$ python3 ./getmodule.py
Via inspect.stack(): <module '__main__' from './getmodule.py'>, extracted frame = <frame at 0x7fc9db9c4dd0, file './getmodule.py', line 6, code inside_frame>
Via sys._getframe: <module '__main__' from './getmodule.py'>, extracted frame = <frame at 0x7fc9db9c4dd0, file './getmodule.py', line 9, code inside_frame>
```

Differential Revision: D51629733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114700
Approved by: https://github.com/zou3519
2023-11-29 20:54:02 +00:00
acdb278144 [BE]: Enable more ruff PLW checks. Disable one PLR that is preview. (#114759)
Enables a couple more `PLW` checks and disables one that was added that was still in preview mode `PLR6201`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114759
Approved by: https://github.com/jansel
2023-11-29 20:53:26 +00:00
7c1a5012f0 [BE][SparseAdam] cleaner way to verify no sparse params (#114425)
Context:

https://github.com/pytorch/pytorch/pull/47724 fixed the problem that SparseAdam could not handle generators by using the `list(...)` construct. However, this meant that SparseAdam deviated from other optimizers in that it could _accept_ a raw Tensors/Parameter vs requiring a container of them. This is not really a big deal.

So why this PR?

I do think this PR is cleaner. It uses the fact that the Optimizer parent class already containerizes parameters into parameter groups, so we could reuse that here by calling `super().__init__` first and then filter the param_groups after. This change would also make SparseAdam consistent with the rest of our optimizers in that only containerized params are accepted, which technically is BC breaking SO I've added a deprecation warning that we should remove in May 2024.

(But is it really BC breaking when we've said in the docs that params should be an iterable this whole time? Maybe this is just a bug fix....😛)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114425
Approved by: https://github.com/drisspg
2023-11-29 19:47:03 +00:00
febbc48f43 [DeviceMesh] Make our mesh_dim kwarg naming consistent (#114707)
Changing size(self, dim: Optional[int] = None) to def size(self, mesh_dim: Optional[int] = None) so it is consistent with the rest of our APIs.

We also update this API usage change in both PT and internal (pyper, APS).

Differential Revision: [D51602986](https://our.internmc.facebook.com/intern/diff/D51602986/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114707
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin
2023-11-29 19:43:23 +00:00
d197f5c72b Remove unused call to inspect.stack() in torch/_custom_op/impl.py (#114698)
Summary: Fetching the stack isn't free and this variable isn't used. Let's not do the work.

Test Plan: Wait for tests

Differential Revision: D51629732

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114698
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2023-11-29 19:33:52 +00:00
a9d5133207 [ez][doc] Fix sample code in onnx_dynamo.rst (#114770)
By adding `import torch.nn as nn`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114770
Approved by: https://github.com/atalman, https://github.com/thiagocrepaldi
2023-11-29 19:27:52 +00:00
ffa974b940 [CI] Dump more detailed error msg in PT2 integration tests (#114683)
Summary: Sometimes a PT2 CI test shows as both pass and infra_error, e.g. https://github.com/pytorch/pytorch/actions/runs/7015184949/job/19086433407. Add more logging to investigate what has happened.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114683
Approved by: https://github.com/eellison
2023-11-29 18:44:23 +00:00
e38a3a6079 Revert "[dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)"
This reverts commit 3f574eadb4d8a4c9cf9eb2fcd91a2944f3555886.

Reverted https://github.com/pytorch/pytorch/pull/114154 on behalf of https://github.com/clee2000 due to reverted internally, broke internal builds, not sure why bot isn't working ([comment](https://github.com/pytorch/pytorch/pull/114154#issuecomment-1832496040))
2023-11-29 18:43:17 +00:00
83c0763dda [CI] Use linux.12xlarge for cpu_inductor integration tests (#114729)
Summary: use linux.12xlarge for larger memory to avoid OOM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114729
Approved by: https://github.com/huydhn
2023-11-29 18:39:53 +00:00
c1f7d4ad6a [Inductor][fx pass] Refactor code to easily add pointwise op to do the batch fusion (#113381)
Summary:
1. We refactor the code to have a unified API to add pointwise op

2. Add one more op sigmoid since we observed it in MC models

Test Plan:
# local reproduce for CMF

```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch -c
```
P876977403
P876996776

diffing: https://www.internalfb.com/intern/diffing/?paste_number=876999623

Differential Revision: D51142990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113381
Approved by: https://github.com/xuzhao9
2023-11-29 18:29:57 +00:00
ba4285bd9e Deprecate primTorch module, replace it with decompositions in module Owners (#114754)
Context: pt2 oncall is revamping its labeling system. One of the guidelines is to remove duplicate labeling in our system. Both primTorch and decomposition labels are referring to the same thing. primTorch was the legacy name (and we no longer have a primTorch project), so using decomposition as the label name makes more sense.

Right now, the only open issues that use "module: primTorch" are the ones generated by the DISABLED bots. Once we replace the label in the bot, we can safely remove the primTorch label.

Here an example of the issue that has primTorch label :
https://github.com/pytorch/pytorch/issues/112719

Torchbot uses following logic to auto extract module owners:
https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/flaky-tests/disable.ts#L391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114754
Approved by: https://github.com/huydhn
2023-11-29 18:27:20 +00:00
b6df841460 Fixed an issue where a user-specified default device clashed with the… (#114560)
… device placement of the RNG. This PR now ignores the user-specified default device, allocates the tensor on the CPU and then moves the tensor to the device of the input tensor. This was more or less already the standard procedure in case the default device wasn't set.

Fixes #114536.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114560
Approved by: https://github.com/soulitzer
2023-11-29 17:45:49 +00:00
b20330ef81 [CI] Test PyTorch on M1 using OpenMP (#114738)
Baby step towards https://github.com/pytorch/pytorch/issues/114721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114738
Approved by: https://github.com/DanilBaibak, https://github.com/atalman
2023-11-29 17:41:35 +00:00
e891a3bba9 [releng] Add release 2.2 to Release Compatibility Matrix for PyTorch releases (#114758)
Update RELEASE.md for release 2.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114758
Approved by: https://github.com/DanilBaibak
2023-11-29 16:27:59 +00:00
4a4c9fb0b8 [ROCm] Add ROCm AMDGPU support for inductor cpp codegen (#105141)
Follows from previous enablement attempt: https://github.com/pytorch/pytorch/pull/101797

Adds support for hsaco binaries in inductor's cpp_wrapper codegen and enables the CUDA tests in test_cpp_wrapper.

This PR also brings in additional required hipify mappings for the wrapper codegen file.

NOTE: we can unskip some of these tests when we enabled MI210 runners.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105141
Approved by: https://github.com/jansel, https://github.com/malfet
2023-11-29 15:11:24 +00:00
a3bbf9ce3e [BE][RelEng] Remove dynamo extra (#114720)
As all dynamo dependencies are part of the default requirements, see
```
% curl -s https://pypi.org/pypi/torch/2.1.1/json | jq '.info.requires_dist'
[
  "filelock",
  "typing-extensions",
  "sympy",
  "networkx",
  "jinja2",
  "fsspec",
  "nvidia-cuda-nvrtc-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cuda-runtime-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cuda-cupti-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cudnn-cu12 (==8.9.2.26) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cublas-cu12 (==12.1.3.1) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cufft-cu12 (==11.0.2.54) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-curand-cu12 (==10.3.2.106) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cusolver-cu12 (==11.4.5.107) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cusparse-cu12 (==12.1.0.106) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-nccl-cu12 (==2.18.1) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-nvtx-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "triton (==2.1.0) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "jinja2 ; extra == 'dynamo'",
  "opt-einsum (>=3.3) ; extra == 'opt-einsum'"
]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114720
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-11-29 15:08:27 +00:00
b6a30bbfb6 [Dynamo] Forward fix dynamo trace rule test failure due to landing race (#114739)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114739
Approved by: https://github.com/janeyx99, https://github.com/huydhn
2023-11-29 09:31:12 +00:00
d2f4215dbb [quant][pt2e] Fix the order for implicit sharing code (#114704)
Summary:
Current order of implicit sharing breaks common annotation patterns of SharedQuantizationSpec, so we changed the order here.
But it's not going to work in all possible annotation cases, so quantizer implementors still need to be careful.
In general if people only refer to node/edges that comes before the current node/edge in SharedQuantizationSpec, it should work I think

Test Plan: CI, make sure this Fixed some internal tests

Differential Revision: D51605918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114704
Approved by: https://github.com/andrewor14
2023-11-29 08:58:28 +00:00
7692595834 Use different conv layout optimization heuristics for inference (#114600)
While many models regress in training when converted to channels last, in inference the results are quite different. Almost all of the models experienced a speedup when converted to channels last. There were a few big regressions in torchbench - `timm_regnet` from `1.4343 → 1.0573` and `timm_resnet` from `1.7484 → 1.2868`.

 I used a modified script of the operator benchmarks [here](https://gist.github.com/eellison/e11dc645412f52e8b45fb26ba6f9f6a1) to measure the average speedup of convolutions across all of the input shapes found in torchbench according to the existing classifications that @shunting314 used - grouped convs, small channel convs, convolution with larger in-channel than out-channel. Only grouped convolutions benchmarked as a slowdown in inference.

I updated the inference heuristic to multiply the flops of each conv with its predicted speedup/slowdown in channels last. With this heuristic the two previously regressing models no longer regress.

Speeds up inference for torchbench ~8% and timm ~6%. The motivating model here was SDXL which now hits channels last and improves 10%.

There were some models that were sped up in training when forcing channels last (along with a number of regressions). It's possible there is some speedup in training to be had with additional heuristics. We could also have more granular classification/predictions which might benefit both training and inference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114600
Approved by: https://github.com/jansel, https://github.com/shunting314
2023-11-29 07:53:59 +00:00
cyy
4e38178bb8 [Reland] [1/N] Fixes clang-tidy warnings in header files (#114668)
Reland of #113608 after fixing the problematic parts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114668
Approved by: https://github.com/huydhn
2023-11-29 07:11:51 +00:00
c10893654e [export] Fix run_decomps to work with fake mode (#114714)
Fixes https://github.com/pytorch/pytorch/issues/114711
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114714
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2023-11-29 06:52:13 +00:00
a076a74f11 [Nested Tensor] Add xpu device in assertion for nested tensor creation (#114664)
Add xpu device checking in nested tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114664
Approved by: https://github.com/jgong5, https://github.com/xunnanxu
2023-11-29 05:59:35 +00:00
69c4819f53 Add bsr_dense_addmm triton kernel (#114595)
As in the title.

The `bsr_dense_addmm` kernel implemented in this PR is a generalization of `bsr_dense_mm` in the following respects (in addition of having input, beta, and alpha parameters):
- it implements `SPLIT_N` kernel parameter that enables efficient kernel launches in the case of wide inputs. For instance, the timing of nn.linear with 256x256 BSR weights having 16x16 blocks and 256x131072 strided input reduced about 16x (this corresponds to the 94 % speed up value listed below).
- it supports rectangular blocks in sparse BSR tensor weights

The performance increase of nn.linear is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is  55/94 %
- with 32x32 blocks, the average/maximal speed up is  33/63 %
- with 64x64 blocks, the average/maximal speed up is  23/42 %
- with 128x128 blocks, the average/maximal speed up is  15/39 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114595
Approved by: https://github.com/cpuhrsch
2023-11-29 05:29:25 +00:00
57a5a687b0 [Dynamo][6.2/N] Dump the in graph function list(~2600 ops) and add unit tests. (#114196)
This is the second PR according https://github.com/pytorch/pytorch/pull/113009#issuecomment-1804417925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114196
Approved by: https://github.com/jansel
2023-11-29 05:09:48 +00:00
05f071d922 [export] Fix state dict device serialization (#114695)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/114000
Will check with SherlockNoMad on why we need to convert to cpu after his PTO

Test Plan: CI

Differential Revision: D51629068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114695
Approved by: https://github.com/ydwu4
2023-11-29 05:05:22 +00:00
7c8d3639cf Revert "[fx] log the node when it's get eliminated (#112684)"
This reverts commit 6256d3710e18f08af8588d1aae88c758bd9c6b30.

Reverted https://github.com/pytorch/pytorch/pull/112684 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112684#issuecomment-1831198778))
2023-11-29 04:31:15 +00:00
64ccdd4afb AOTAutograd: keep input mutations in the graph if they are under no_grad, even if they require_grad (#114646)
Quick recap of events:

(1) https://github.com/pytorch/pytorch/pull/111347, which fixed a perf regression in 2.1 compared to 2.0, introduced a correctness problem around input mutations on inputs that require grad that show up in an inference-only graph (the specific case where this can happen is rare and nobody reported the issue, but it was fixed a few weeks later)

(2) That fix happened here: https://github.com/pytorch/pytorch/pull/113584, which makes sure to keep input mutations outside of the graph, so the autograd engine can set metadata properly on them

(3) That in turn caused a slight regression compared to (1), which is what this PR attempts to fix. In particular, code like the below is safe to keep the mutations in the graph for:

```
@torch.compile
def f(x):
    x.mul_(2)

x = torch.ones(2, requires_grad=True).clone()
# x requires_grad, so the input mutation will change some autograd metadata, like the version counter
# However, the mutation is under no_grad, so we don't have to worry about e.g. aliases of x having their .grad_fn fields changed
with torch.no_grad():
    f(x)
```

This particular case is pretty important to the shampoo optimizer code, which is run under `torch.compile`, and mutates parameters (which require grad).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114646
Approved by: https://github.com/zou3519
2023-11-29 04:29:32 +00:00
ce00c8fb45 [PyTorch] Remove hardcoded device=cuda in test_aot_inductor (#112797)
All the other tests use self.device, so this seems like an oversight? Cost me a lot of time debugging the minimal arrayref interface, which is only intended for CPU.

Differential Revision: [D50949928](https://our.internmc.facebook.com/intern/diff/D50949928/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112797
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov
ghstack dependencies: #113997
2023-11-29 03:12:33 +00:00
5b9add666f [PyTorch] AOTI: Emit CACHED_TORCH_TYPE only as needed (#113997)
Avoids potential compatibility issues where a new dtype is supported by the DSO but not the binary loading it.

Differential Revision: [D51434335](https://our.internmc.facebook.com/intern/diff/D51434335/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113997
Approved by: https://github.com/int3
2023-11-29 03:12:32 +00:00
73a661abf1 Stop using excess memory in generate_opcheck_tests, re-enable fbgemm TBE tests (#114641)
Summary:
1. We stop using excess memory in generate_opcheck_tests. This is safe because
   all the individual test utils already ensure that they do not modify the
   inputs.
2. We re-enable the fbgemm TBE tests (see internal diff, but all of this is open
   source). They were previously removed because they OOM'ed when run serially;
   (1) and (3) cut down the memory usage to ~20gb peak.
3. I needed to skip some newly failing generated tests and also some that had an
   impact on the memory usage.

Test Plan: - run tests

Reviewed By: sryap

Differential Revision: D51601964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114641
Approved by: https://github.com/williamwen42
2023-11-29 02:21:13 +00:00
6256d3710e [fx] log the node when it's get eliminated (#112684)
Summary: ATT

Test Plan: CI

Reviewed By: strisunshinewentingwang

Differential Revision: D50912413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112684
Approved by: https://github.com/zyan0
2023-11-29 01:43:04 +00:00
24f06c7783 [no ci] Add .watchman to .gitignore (#114718)
Followup after https://github.com/pytorch/pytorch/pull/114716

TODO: should the old filename be deleted, or it just depends on Atom/VSCode version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114718
Approved by: https://github.com/kit1980
2023-11-29 01:37:40 +00:00
48820c928c Revert "[test] AOTAutograd: support mutations on buffers that happen during th bw (#112906)"
This reverts commit c8974d649d684a33a5c02a0b112a6e0743201d97.

Reverted https://github.com/pytorch/pytorch/pull/112906 on behalf of https://github.com/huydhn due to There are lots of failure after this change c8974d649d, this is probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/112906#issuecomment-1831016362))
2023-11-29 00:49:57 +00:00
4bfb19827e Cleanup .watchman file (#114716)
This seems to be an artifact from an fb tool that snuck into a commit (#113117)? CC @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114716
Approved by: https://github.com/mikaylagawarecki, https://github.com/yanboliang, https://github.com/malfet
2023-11-29 00:48:58 +00:00
ae593d0393 [sparse][semi-structured][inductor] meta registrations for _cslt_sparse_mm + additional stride checking in test. (#114685)
_cslt_sparse_mm + additional stride checking in test.

Summary:

This PR adds in meta registrations for _cslt_sparse_mm.

Based on the work @drisspg did
in #114370.

Additionally, it updates the tests by checking that the strides of the
spare result and the result returned by sparse+compile are the same, to
avoid errors like those found in

https://github.com/pytorch/pytorch/pull/114477.

Test Plan:
```
python test/test_sparse_semi_structred -k compile_cusparselt
python test/test_sparse_semi_structred -k compile_cutlass
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114685
Approved by: https://github.com/alexsamardzic, https://github.com/drisspg
2023-11-29 00:31:52 +00:00
43d0659d74 [C10D] Fix DUMP_ON_TIMEOUT env (#114699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114699
Approved by: https://github.com/kwen2501, https://github.com/XilunWu, https://github.com/fduwjj
2023-11-29 00:15:45 +00:00
bc34f02c38 [BE][Easy]: Apply RUF019: remove duplicate checks for dict access (#114478)
Applies RUF019 nightly preview rule to the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114478
Approved by: https://github.com/mikaylagawarecki
2023-11-29 00:14:02 +00:00
c8974d649d [test] AOTAutograd: support mutations on buffers that happen during th bw (#112906)
I can hold off on reviews / landing until I talk to Driss and we confirm that we need this for FP8. This PR also needs testing and probably shouldn't land until Tugsuu's input mutation handling [PR](https://github.com/pytorch/pytorch/pull/111046) goes through.

What this PR tries to solve is when you have a model that tries to mutate some nn module state (a buffer), but during the **backward**. It appears that this might be necessary for FP8's delayed scaling.

Today, AOTAutograd will just not realize if you happened to mutate any graph inputs when running the backward pass, and functionalize them away but not realize that they were input mutations. This PR tries to:

(a) detect this situation (input mutations during the backward)

(b) put `copy_()`'s in the graph to properly handle the input mutation when we can. In cases where we can't keep the copy_() in the graph, we just error loudly (I imagine that these cases will be extremely rare, but we can fix them if they ever come up).

This is mostly a prototype for now, not ready for review.

I made this example locally to test out:
```
import torch

class MutatingAutogradFn(torch.autograd.Function):

    @staticmethod
    def forward(ctx, x, buf):
        ctx.save_for_backward(buf)
        return x

    @staticmethod
    def backward(ctx, x_grad):
        buf = ctx.saved_tensors[0]
        buf.add_(x_grad)
        return x_grad * 3, None

class Mod(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.buf = torch.ones(2)

    @torch._dynamo.allow_in_graph
    def backward_mutating_fn(self, x, buf):
        return MutatingAutogradFn.apply(x, buf)

    def forward(self, x):
        tmp = self.backward_mutating_fn(x, self.buf)
        return tmp + self.buf

m = Mod()

x = torch.ones(2, requires_grad=True)
out = m(x)
# After the fw, buf should not have been mutated
print(m.buf)
out.sum().backward()
# bw has run, so buf should now be mutated
print(m.buf)
print(x.grad)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112906
Approved by: https://github.com/ezyang
2023-11-28 23:59:21 +00:00
11277cc510 [CI] Remove an exception catching for Triton compiler error (#113064)
Summary: The workaround was there when Triton compiler was at its early stage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113064
Approved by: https://github.com/eellison
2023-11-28 23:46:30 +00:00
3fccc0446c Add dtensor and fsdp/2d tests to inductor_distributed CI (#114642)
Smuggle important and not too slow tests to run on this trunk job,
instead of just on the periodic job where they currently reside.
 - test_dtensor_compile took 70sec, test_fsdp_2d_parallel took 198sec
   locally

As a follow up, organize the distributed-mgpu tests better and maybe
rename this job to reflect its more 'general dist mgpu'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114642
Approved by: https://github.com/wanchaol, https://github.com/malfet
2023-11-28 23:06:18 +00:00
765d4599ee Give users control over packages in torch.utils.collect_env (#112993)
I'm looking to repurpose some logic in `torch.utils.collect_env` for the `geowatch` package. I'm mostly able to just use this script as a library, which is great because it reduces code in my package. However, the issue is that the package patterns that are relevant to torch are hard-coded inside of `get_conda_packages` and `get_pip_packages`.

The changes I made are simple. I defined the default package patterns as two global sets, and I added an argument to each function that lets the user customize exactly what package patterns are relevant. If they are not specified the defaults are used.

I was considering extending the power of the patterns by utilizing `fnmatch`, `re` (or [xdev.pattern](https://github.com/Erotemic/xdev/blob/main/xdev/patterns.py) which abstracts them both), but instead I opted to just use the existing `__contains__` test to keep things simple.

From torch's perspective this should make maintaining this file slightly easier because to update relevant packages, the developer now updates two neighboring top-level globals instead of two separated local variables. However, it does add an argument to two functions, and that argument isn't used in torch itself, so there is an argument for removing that, and then users *could* still have some control by modifying globals, but I think the way I did it balances the tradeoffs well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112993
Approved by: https://github.com/zou3519
2023-11-28 22:35:25 +00:00
ce4bff4013 [dynamo] fix functools.wraps on nested functions (#114279)
Updated version of #108885 addressing the review. In this PR:
- We add a VT.can_reconstruct utility that checks if VT.reconstruct()
  does something.
- If functools.wraps(fn) is passed a `fn` that either has a source or
  has .can_reconstruct() == True, then we stash the source (or the VT)
- Later on, we use the source (or VT.reconstruct) to actually
  reconstruct the object in codegen.

Test Plan:
- New tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114279
Approved by: https://github.com/voznesenskym
2023-11-28 22:34:59 +00:00
a26d747615 [PyTorch][Vulkan] Fix matrix multiplication performance test binary (#114624)
Summary:
Due to recent changes in D51421256 and D51379737,
- shaders of `mm`, `addmm`, `bmm`, `baddbmm` are reduced into just `mm`,
- height and width packing logic is applied to linear operations

so the current perf testings of `addmm` and `create_linear_context` and `run_linear_context` are no longer valid (0 latency will be printed, see test plan). Specifically, the original test extracts latency of `vulkan.addmm` which doesn't exist any more. Instead the current implementation of `addmm` invokes
```
vulkan.convert_channels_to_height_packed
vulkan.convert_channels_to_width_packed
vulkan.mm
vulkan.mul_scalar
vulkan.add
```
To deal with this
- for `addmm` and `run_linear_context`, we apply a new function `extractTotalShaderResultsAndSetState` which aggregates latency of all invoded shaders except `nchw_to_image` and `image_to_nchw`;
- for `create_linear_context`, besides `nchw_to_image` and `image_to_nchw`, we also aggregate `vulkan.convert_channels_to_height_packed`

Test Plan:
- build binary, at `fbsource`
```
buck2 build  -c ndk.debug_info_level=0  -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_mm_perf_test_binAndroid  --show-output  -c pt.vulkan_full_precision=1
```
- test on android device
```
adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_mm_perf_test_binAndroid__/pt_vulkan_mm_perf_test_binAndroid /data/local/tmp
adb shell /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid
```
## Before
addmm_benchmark
```
(base) luwei@luwei-mbp ~ % adb shell /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid
2023-11-16T06:48:18+00:00
Running /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid
Run on (4 X 1708.8 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
...
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4334408
vulkan.nchw_to_image     {500, 500, 1}                    4327648
vulkan.nchw_to_image     {500, 500, 1}                    4322760
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1233960
vulkan.convert_channels_to_width_packed{125, 500, 1}                    1286896
vulkan.mm                {125, 125, 1}                   76186084
vulkan.mul_scalar        {500, 500, 1}                    1132924
vulkan.mul_scalar        {500, 500, 1}                    1128556
vulkan.add               {500, 500, 1}                    4285788
vulkan.image_to_nchw     {500, 500, 1}                    1421576
...
addmm_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1                      0.000 ms         77.2 ms            5
```
create_linear_context_benchmark
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4336696
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1229384
...
create_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1       8.57 ms         32.9 ms            5
```
run_linear_context_benchmark
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4305548
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1196104
...
run_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1         0.000 ms         86.2 ms            5
```

## After
addmm_benchmark
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4332016
vulkan.nchw_to_image     {500, 500, 1}                    4321356
vulkan.nchw_to_image     {500, 500, 1}                    4314908
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1195896
vulkan.convert_channels_to_width_packed{125, 500, 1}                    1273428
vulkan.mm                {125, 125, 1}                   77055680
vulkan.mul_scalar        {500, 500, 1}                    1111708
vulkan.mul_scalar        {500, 500, 1}                    1111032
vulkan.add               {500, 500, 1}                    4236024
vulkan.image_to_nchw     {500, 500, 1}                    1429480
...
addmm_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1                       51.1 ms         76.0 ms            5
```
create_linear_context_benchmark
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4332432
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1235884
...
create_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1       9.74 ms         30.6 ms            5
```
run_linear_context_benchmark
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4289740
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1227928
...
run_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1          50.4 ms         86.0 ms            5
```
full result in P887658084

Reviewed By: liuk22

Differential Revision: D51506293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114624
Approved by: https://github.com/yipjustin
2023-11-28 22:27:26 +00:00
d114f31b30 add testcase when bytecode hook changes the bytecode; fix code map (#114487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114487
Approved by: https://github.com/jansel
2023-11-28 22:14:57 +00:00
47e6cc4d22 Remove yet more type-ignores in dynamo/inductor (#114684)
Probably the last big batch for a while

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114684
Approved by: https://github.com/Skylion007
2023-11-28 22:09:38 +00:00
9f073ae304 [BE][Easy]: add some PLR pylint checks and exclusions to ruff (#114519)
Add a couple of additional checks and exclusions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114519
Approved by: https://github.com/jansel
2023-11-28 20:49:03 +00:00
74e10f0f60 [inductor] Fix torch.split bug on unbacked symint (#113406)
torch.split(x, l) fails when l's shape is the unbacked symint.

E.g. l =
y.tolist() makes l the unbacked shape, because l depends on the
data access of y. The downdtream call `SliceView.create()`
evaluates the shape even if the input shape is unbacked symint,
which brings up the bug.

Test Plan:
python test/inductor/test_unbacked_symints.py -k test_split_with_sizes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406
Approved by: https://github.com/aakhundov, https://github.com/ezyang
2023-11-28 20:45:13 +00:00
4aa2c51a09 [doc] fix typo on graph 3 that is recorded (#114666)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114666
Approved by: https://github.com/eellison
2023-11-28 20:40:13 +00:00
4a35ec3c0e [docs] correct the code for cudagraph trees integration (#114583)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114583
Approved by: https://github.com/eellison
2023-11-28 20:28:52 +00:00
44c9e4cbf0 [C10D] Decouple PGNCCL desync from dbg dump (#114614)
Add TORCH_NCCL_DUMP_DEBUG_INFO env to control dumping independently
of desync debug feature.

Currently default to disabled (so no behavior change by default),
but plan to default this to true after validation.

Moves 'sleep for 30 sec' that used to be after desync debug to before
it. In my view sleeping before desync is equivalent since we always
sleep the same duration, and keeps the code simpler this way.

Fixes #114433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114614
Approved by: https://github.com/zdevito
ghstack dependencies: #114651
2023-11-28 19:46:10 +00:00
cef79c0df4 [inductor] _sparse_semi_structured_linear fallback - no meta registration; not on testing path (#114477)
Test was wrong in original PR and merged changes were never tested. Further, the sparse op was never actually compiled due to missing `fullgraph=True` and missing meta registration.

When meta is added as per this PR, it gives wrong answers when input needs to be padded and when input needs to be reshaped.

Is this something to do with the generated inductor code for:
```
 constant_pad_nd: "f16[32, 128]" = torch.ops.aten.constant_pad_nd.default(primals_3, [0, 0, 0, 31], 0.0)
...
slice_1: "f16[1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 0, 0, 1);  _sparse_semi_structured_linear = None
```
and

```
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         mul: "Sym(s0*s1)" = primals_4 * primals_5
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view: "f16[s0*s1, 128]" = torch.ops.aten.view.default(primals_6, [mul, 128]);  primals_6 = mul = None
...
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view_1: "f16[s0, s1, 128]" = torch.ops.aten.view.default(slice_1, [primals_4, primals_5, 128]);  slice_1 = None
```

Failing graphs:
Padded:
```
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]  ===== Forward graph 5 =====
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]  <eval_with_key>.66 class GraphModule(torch.nn.Module):
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]     def forward(self, primals_1: "f16[128, 64]", primals_2: "i16[128, 8]", primals_3: "f16[1, 128]"):
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         constant_pad_nd: "f16[32, 128]" = torch.ops.aten.constant_pad_nd.default(primals_3, [0, 0, 0, 31], 0.0)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         _sparse_semi_structured_linear: "f16[32, 128]" = torch.ops.aten._sparse_semi_structured_linear.default(constant_pad_nd, primals_1, primals_2);  constant_pad_nd = primals_1 = primals_2 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         slice_1: "f16[1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 0, 0, 1);  _sparse_semi_structured_linear = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         slice_2: "f16[1, 128]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807);  slice_1 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:147, code: return torch.nn.functional.relu(x)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         relu: "f16[1, 128]" = torch.ops.aten.relu.default(slice_2);  slice_2 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias: "f16[1, 128]" = torch.ops.aten.alias.default(relu)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias_1: "f16[1, 128]" = torch.ops.aten.alias.default(alias);  alias = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         le: "b8[1, 128]" = torch.ops.aten.le.Scalar(alias_1, 0);  alias_1 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         permute: "f16[128, 1]" = torch.ops.aten.permute.default(primals_3, [1, 0]);  primals_3 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         return [relu, le, permute]

```

Reshape:

```
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]  <eval_with_key>.69 class GraphModule(torch.nn.Module):
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]     def forward(self, primals_1: "f16[128, 64]", primals_2: "i16[128, 8]", primals_3: "f16[128]", primals_4: "Sym(s0)", primals_5: "Sym(s1)", primals_6: "f16[s0, s1, 128]"):
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x)
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         mul: "Sym(s0*s1)" = primals_4 * primals_5
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view: "f16[s0*s1, 128]" = torch.ops.aten.view.default(primals_6, [mul, 128]);  primals_6 = mul = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         _sparse_semi_structured_linear: "f16[s0*s1, 128]" = torch.ops.aten._sparse_semi_structured_linear.default(view, primals_1, primals_2, bias = primals_3);  primals_1 = primals_2 = primals_3 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         slice_1: "f16[s0*s1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 1, 0, 9223372036854775807);  _sparse_semi_structured_linear = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view_1: "f16[s0, s1, 128]" = torch.ops.aten.view.default(slice_1, [primals_4, primals_5, 128]);  slice_1 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:147, code: return torch.nn.functional.relu(x)
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         relu: "f16[s0, s1, 128]" = torch.ops.aten.relu.default(view_1);  view_1 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias: "f16[s0, s1, 128]" = torch.ops.aten.alias.default(relu)
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias_1: "f16[s0, s1, 128]" = torch.ops.aten.alias.default(alias);  alias = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         le: "b8[s0, s1, 128]" = torch.ops.aten.le.Scalar(alias_1, 0);  alias_1 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         return [relu, view, le, primals_4, primals_5]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114477
Approved by: https://github.com/jcaip
2023-11-28 19:35:05 +00:00
ddf1cb7870 AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:

(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)

(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.

(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.

(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).

I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation

**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
    x_view = x.view(-1)
    x.set_(torch.ones(2))
    x_view.mul_(2)
    return
```

If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:

(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input

(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```

def functionalized_f(x):
    x_view = x.view(-1)
    # set_() desugars into a no-op; later usages of x will use x_output
    x_output = torch.ones(2)
    # functionalize the mutation on x_view
    x_view_updated = x.mul(2)
    x_updated = x_view_updated.view(x.shape)
    # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
    # We need to return both updated tensors in our graph
    return x_updated, x_output
def runtime_wrapper(x):
    x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
    # First, perform the data mutation on x's old storage
    x.copy_(x_data_mutation_result)
    # Then, swap out the storage of x with the new storage
    x.set_(x_set_mutation_result)
```

There are two things that make this difficult to do though:

(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.

(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.

It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
2023-11-28 19:33:35 +00:00
e83c05c833 [ONNX] Add ONNX ExportedProgram tests (#114633)
Fix #114166
Fix #113705

This PR references tests from `test_export.py` to make sure the exported program from PyTorch can all be successfully exported into ONNX model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114633
Approved by: https://github.com/thiagocrepaldi
2023-11-28 19:03:13 +00:00
39f16c221e Adding event_tracer evalue logging calls in codegen (#114584)
Summary:
This diff adds support in the ExecuTorch codegen layer to log the outputs of kernels to event_tracer. It does this by calling the `event_tracer_log_evalue` API.

When the `ET_EVENT_TRACER_ENABLED` flag is disabled this is essentially a no-op and will add no overhead.

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D51534590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114584
Approved by: https://github.com/larryliu0820
2023-11-28 18:32:05 +00:00
e6a8052051 [C10D] Flight recorder - disable c++ stacktrace by default (#114651)
CPP Stacktrace processing (symbolizer) takes a long time on some systems
using a particular version of addr2line.  In slow systems, this makes
flight-recorder dumping slow enough to time out on even toy programs.

TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection
as part of the flight recorder.

CPP stacktrace is fast enough for use on certain combinations of OS. We
can investigate moving to llvm's symbolizer as a replacement.

On devserver with C++ stacktraces disabled/enabled:
```
python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 12.175s

TORCH_NCCL_TRACE_CPP_STACK=1 python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 53.338s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114651
Approved by: https://github.com/zdevito
2023-11-28 16:49:20 +00:00
b060694088 Add bits dtypes to torch._C stubs (#114661)
As defined 6ae0554d11/c10/core/ScalarType.h (L54-L58)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114661
Approved by: https://github.com/ngimel
2023-11-28 15:21:58 +00:00
0bef97fac3 [dynamo] Support itertools.groupby (#114192)
Summary: for https://github.com/pytorch/pytorch/issues/108698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114192
Approved by: https://github.com/jansel
2023-11-28 14:58:59 +00:00
cc7a969bb3 [FSDP] Added test for ignored_states + auto wrap (#114612)
This adds some unit testing for the `ignored_states` argument and auto wrapping. There is some ongoing discussion with @erhoo82 about his particular use case, but it should not block this PR. (We can land a separate PR if needed.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114612
Approved by: https://github.com/wanchaol
ghstack dependencies: #114611
2023-11-28 14:36:34 +00:00
79ee99e6d2 [easy] Dispatch torch.from_numpy to torch.as_tensor (#114609)
...rather than detaching the tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114609
Approved by: https://github.com/larryliu0820, https://github.com/voznesenskym
ghstack dependencies: #114608
2023-11-28 12:04:37 +00:00
0bb2600c28 Allow to differentiate through NumPy code (#114608)
With this PR it is possible to differentiate through NumPy code modulo
the usual caveats that apply to differentiation:
- That there are no graphbreaks
- That the decomposition in `torch._numpy` is differentiable

@ev-br and I were somewhat careful to achieve the second point, but
it is not tested though and through, so YMMV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114608
Approved by: https://github.com/voznesenskym
2023-11-28 12:04:37 +00:00
89a1fe6966 [pytree] register pytree node type in both C++ pytree and Python pytree (#112111)
Changes:

1. Add `_private_register_pytree_node` API in both C++ and Python pytree. In C++ pytree, the API will only register pytree node for C++ pytree. In Python pytree, the API will only register pytree node for Python pytree.
2. Do not allow registering a type as pytree node twice in the Python pytree.
3. Add thread lock to the Python pytree node register API.
4. The old `_register_pytree_node` API will call the `_private_register_pytree_node` API and raise a deprecation warning.
5. Add a new `register_pytree_node` API to register node type in both C++ and Python implementations.
6. Add tests to ensure a warning will be raised when the old private function is called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112111
Approved by: https://github.com/zou3519
2023-11-28 11:41:38 +00:00
088fc7779e Eliminate unnecessary copy in CUDA addmm with sparse compressed block operand (#114484)
As in the title.

As a result, `nn.linear(<strided tensor>, <BSR tensor>, bias=<strided tensor>)` performance increases as follows (`float16`, `NVIDIA A100-SXM4-80GB`):
- 256x256 weights, speed up is 14..27 %
- 512x512 weights, speed up is 9..25 %
- 1024x1024 weights, speed up is 5..20 %
- 2048x2048 weights, speed up is 3..16 %
- 4092x4092 weights, speed up is 2..9 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114484
Approved by: https://github.com/cpuhrsch
2023-11-28 11:35:55 +00:00
00412e6dfa [export] Add meta to params (#114622)
The graph from `capture_pre_autograd_graph` doesn't have `meta["val"]` on the param nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114622
Approved by: https://github.com/frank-wei, https://github.com/zhxchen17, https://github.com/khabinov
2023-11-28 07:40:15 +00:00
95aec251aa [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op hardtanh (#114580)
**Summary**
Enable the fusion pattern of `QConv2d -> hardtanh` lowering to `hardtanh` as `QConv2d` post operator.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_relu6_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardtanh_cpu

python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_relu6
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_hardtanh
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114580
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #114578, #114579
2023-11-28 07:21:30 +00:00
8c1f65dc2b [Quant] [PT2] Add Hardtanh and ReLU6 into X86InductorQuantizer Conv2d Unary Annotation (#114579)
**Summary**
Add `Hardtanh` and `ReLU6` into X86InductorQuantizer Conv2d Unary Annotation

**TestPlan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114579
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #114578
2023-11-28 07:18:00 +00:00
8a35a68bb7 [Quant] Enable QConv2d with hardtanh post op (#114578)
**Summary**
Enable QConv2d implementation with post op `hardtanh`

**Test Plan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_hardtanh_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114578
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-11-28 07:13:01 +00:00
06abac971a [FSDP] Simplified FSDP wrapping in ignored module test (#114611)
This saves some verbosity. There is no change to functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114611
Approved by: https://github.com/wanchaol
2023-11-28 07:07:37 +00:00
5cfa0647a7 Update mypy to 1.7.0 (#114160)
It appears that `mypy` is now checking a few more previously-unchecked files; these files
are being found via import-following. Not sure exactly why they weren't being checked before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114160
Approved by: https://github.com/eellison
ghstack dependencies: #114162
2023-11-28 06:45:55 +00:00
71b742b42c [inductor] Remove more type: ignore comments (#114162)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114162
Approved by: https://github.com/Skylion007, https://github.com/eellison
2023-11-28 06:45:55 +00:00
3f574eadb4 [dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)
Fixes https://github.com/pytorch/pytorch/issues/113812, https://github.com/pytorch/pytorch/issues/102591, Probably fixes: https://github.com/pytorch/pytorch/issues/113740, https://github.com/pytorch/pytorch/issues/113786, https://github.com/pytorch/pytorch/issues/113788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114154
Approved by: https://github.com/wconstab
2023-11-28 06:29:43 +00:00
6636c2b178 [executorch hash update] update the pinned executorch hash (#114648)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114648
Approved by: https://github.com/pytorchbot
2023-11-28 05:41:36 +00:00
cyy
8933ff3595 Make torch::jit::module movable (#114041)
This PR makes torch::jit::module movable to improve performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114041
Approved by: https://github.com/huydhn
2023-11-28 05:03:37 +00:00
2f875c74bf Print ghcr docker pull during build/test (#114510)
To make debugging easier to external devs

Test plan: Copy and run command from [`Use the following to pull public copy of the image`](https://github.com/pytorch/pytorch/actions/runs/7012511180/job/19077533416?pr=114510#step:6:9):
```
docker pull ghcr.io/pytorch/ci-image:pytorch-linux-jammy-py3.8-gcc11-0d0042fd2e432ea07301ad6f6a474d36a581f0dc

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114510
Approved by: https://github.com/atalman, https://github.com/huydhn
2023-11-28 04:38:17 +00:00
0de67e7949 [cpu] Modify inductor opt flag (#113347)
Fixes https://github.com/pytorch/pytorch/issues/113014, https://github.com/pytorch/pytorch/issues/113012, https://github.com/pytorch/pytorch/issues/93598.

For CPU inductor path, remove `-funsafe-math-optimizations` from optimization flags to fix functional issues.

### Validation on 3 benchmark suites

**FP32**
<img width="582" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/5a648497-a8e2-4057-8dd4-b322e9334456">

- No accuracy problem
- Slight geomean perf drop
- 3 outlier models (speed up < 0.8). Could be solved by adding vectorizations later.

**BF16**
<img width="583" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/ca1cbd34-5712-4d79-9238-0cc11dd279b1">

- No accuracy problem
- Slight geomean perf drop
- 4 outlier models (speed up < 0.8). Could be solved by adding vectorizations later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113347
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-11-28 04:03:24 +00:00
11f11e95df [Quant] [Inductor] Fix an issue in QConv Binary Pattern Match (#114541)
**Summary**
Add the `extra_check` in `_register_quantized_conv_binary_lowering` to skip the pattern which matched unexpected. To match a Conv-Binary pattern, we should expect the extra input of binary node comes from a dequant pattern instead of a constant scalar.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_add_2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114541
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #114540
2023-11-28 02:59:20 +00:00
8556a09d44 Require less alignment for attn bias (#114173)
# Summary
Improved Fix for Attention Mask Alignment Issue (#112577)

This PR addresses Issue #112577 by refining the previously implemented fix, which was found to be incorrect and causes un-needed memory regressions. The update simplifies the approach to handling the alignment of the attention mask for mem eff attention.

## Changes
Alignment Check and Padding: Initially, the alignment of the attention mask is checked. If misalignment is detected, padding is applied, followed by slicing. During this process, a warning is raised to alert users.

Should this be warn_once?

We only call expand, once on the aligned mask.

Reference
https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/cutlass.py#L115

@albanD, @mruberry, @jbschlosser, @walterddr, and @mikaylagawarecki.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114173
Approved by: https://github.com/danthe3rd
2023-11-28 02:40:41 +00:00
4abf2b2261 [dynamo] fixed record_replayer issue when TORCH_COMPILE_DEBUG=1 (#114623)
In https://github.com/pytorch/pytorch/pull/113432, we changed
the behavior of _is_allowed_module_prefix, where we moved the '.'
from the module perfixes. Consequently, 'LOAD_ATTR submodule'
(e.g. LOAD_ATTR fx) is turned into PythonModuleVariable instead
of TorchVariable. This caused some issue for record_replayer.record_module_access
, which is enabled by setting TORCH_COMPILER_DEBUG=1, because 'torch.fx'
doesn't exist in record_replayer's name_to_modrec dictionary when
record_module_access is called.

This PR fixed the issue by adding "torch.fx" into record_replayer's
EXCLUDES list. The fix is likely to be a workaround to unblock
internal workflow. There might be some fundamental changes
to the relevant pieces along with Yanbo's refactoring PRs for
tracing in-graph functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114623
Approved by: https://github.com/mlazos, https://github.com/yanboliang
2023-11-28 02:40:07 +00:00
2333d381b2 Make 'distributed' TORCH_LOGS include ddpoptimizer (#114376)
There are now 3 ways to see logs from ddpoptimzer.
1) TORCH_LOGS="distributed"
2) TORCH_LOGS="dynamo"
3) TORCH_LOGS="torch._dynamo.backends.distributed"

(1 and 2 are different supersets of 3 that also include other content)

Note: ddp_graphs is still a separate 'artifact' logger, which just
includes graph dumps from the graph-splitting process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114376
Approved by: https://github.com/wanchaol
2023-11-28 02:39:28 +00:00
ae40a3ebcf [inductor] added a config to dump profiling results to a file (#114587)
Currently, we print out profile bandwidth result for each triton
kernel to stdout after each profiling run finishes. Consequently,
the profiling results are mixed with other debug outputs.

This PR adds a config, profile_bandwidth_output, to specify a file
where we can dump the results in a sorted order. The new config can
be set by setting "TORCHINDUCTOR_PROFILE_OUTPUT" environment variable.
Hopefully it would offer a slightly better way to navigate the profiling
results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114587
Approved by: https://github.com/Chillee
2023-11-28 02:21:11 +00:00
6ae0554d11 Enable the lowering of quantized reshape (#114443)
**Summary**
Enable the lowering of `dq->reshape->q` into a `qreshape`

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qflatten
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114443
Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/jerryzh168
ghstack dependencies: #114442
2023-11-28 01:43:54 +00:00
4ba3e6758d Canonicalize runtime asserts (#114509)
This allows us to remove quite a few redundant runtime asserts, and potentially a number of guards as well.

On
```
python test/dynamo/test_subclasses.py -k test_unbind
```
we go from
```
inserting runtime assert i0 <= s0
inserting runtime assert 0 <= -i0 + s0
inserting runtime assert i0 + i1 <= s0
inserting runtime assert i0 <= -i1 + s0
inserting runtime assert i0 + i1 + i2 <= s0
inserting runtime assert i0 + i1 <= -i2 + s0
inserting runtime assert Eq(i0 + i1 + i2 + i3, s0)
inserting runtime assert i0 + i1 + i2 + i3 <= s0
inserting runtime assert i0 + i1 + i2 <= -i3 + s0
```
to
```
inserting runtime assert i0 - s0 <= 0
inserting runtime assert i0 + i1 - s0 <= 0
inserting runtime assert i0 + i1 + i2 - s0 <= 0
inserting runtime assert Eq(i0 + i1 + i2 + i3, s0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114509
Approved by: https://github.com/voznesenskym
2023-11-28 01:38:47 +00:00
74370a8a9d Add adaptive_avg_pool2d and flatten into x86 Inductor Quantizer recipe (#114442)
**Summary**
Add adaptive_avg_pool2d and flatten into x86 Inductor Quantizer recipe

**Test Plan**
```
python -m pytest test_x86inductor_quantizer.py -k test_adaptive_avg_pool2d_recipe
python -m pytest test_x86inductor_quantizer.py -k test_flatten_recipe
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114442
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-11-28 01:35:57 +00:00
e25b146b8c [BE][Easy]: Enable flake8-exe rules in ruff too. (#114521)
Enable flake8-exe rules in ruff too. RUFF requires EXE rules to enabled separately from the E prefix. This fixes a parity bug between flake8 and ruff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114521
Approved by: https://github.com/kit1980
2023-11-28 01:27:55 +00:00
304ea761f5 [executorch][be] update test_emit to use export (#114294)
Summary: exir.capture is deprecated. Switch to blessed path

Test Plan: fbsource/fbcode/executorch/exir/emit/test (c40a7a0d2)]$ buck test :

Differential Revision: D51503120

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114294
Approved by: https://github.com/zhxchen17
2023-11-28 01:25:46 +00:00
cf9f3ae8d8 Skip an example of test_instance_norm when running internally due to its size (#114452)
After https://github.com/pytorch/pytorch/pull/113420, `torch.unique` now includes a call to `torch.sort` and that call is slow when running in dev mode, i.e. `@fbcode//mode/dev`.  This causes the test to take more than 10 minutes and time out internally [T170720856](https://www.internalfb.com/intern/tasks/?t=170720856).  Running the test in `@fbcode//mode/opt` is fine, so please let me know if there is a way to set that.  Otherwise, this change will skip the largest example when running in sandcastle internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114452
Approved by: https://github.com/malfet
2023-11-28 01:11:19 +00:00
e592b9a469 [Quant] [PT2] Fix an issue in Conv Binary Quantization Annotation (#114540)
**Summary**
To annotate a conv-binary pattern, should skip the pattern if the conv node has more than one user.

**Test Plan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_binary2
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_binary2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114540
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-11-28 01:06:48 +00:00
b1fb591272 [replicate] Simplify replicate() init logic and remove unnecessary variables in _ReplicateState (#113679)
Many variables _ReplicateState are created because replicate() was lazy initialized. This PR removes these variables and simplifes the logic.y

Differential Revision: [D51317874](https://our.internmc.facebook.com/intern/diff/D51317874/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113679
Approved by: https://github.com/awgu
2023-11-28 00:55:36 +00:00
dffa5f3f23 [dynamo][reland] ExecutorchCallDelegateHigherOrderVariable - add sanity check that input and output tensors are disjoint (#114167)
Summary: Reland of https://github.com/pytorch/pytorch/pull/111960, Fixes https://github.com/pytorch/pytorch/issues/111917

Original PR broke some internal tests which the current diff has resolved.

Test Plan: CI

Differential Revision: D51473196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114167
Approved by: https://github.com/jon-chuang, https://github.com/zou3519
2023-11-28 00:27:23 +00:00
a0be4b7ea7 [fx] Update symbolic_trace nn_module_stack (#114422)
Summary:
Fixed nn_module_stack dynamo produced by symbolic trace to align with the nn_module_stack metadata produced by dynamo. The key should be the module path, with the value being a unique name, and the type. Something like: `{'L__self___one_module': ("L['self'].one_module", <class 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'>)}`

This was causing some tests to fail when using export + the old quantization flow (prepare_fx calls symbolic_trace).

Test Plan: D51534471 `buck2 run @//mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_quantized -- -r "test_xnnpack_leaky_relu"`

Differential Revision: D51539118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114422
Approved by: https://github.com/JacobSzwejbka, https://github.com/jerryzh168
2023-11-28 00:18:41 +00:00
f505d76462 Bug fixes to DDP _update_process_group API. (#114194)
https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state.

As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194
Approved by: https://github.com/rohan-varma
2023-11-27 23:52:40 +00:00
7c98bac4a0 [BE] Speedup register schema compilation (#114438)
For some reason, inlining initializer list into a std::vector takes a lot of time using clang-15. But considering that there are only dozen or so distrinct tags, creating them once and pass as def argument should not affect runtime speed at all, but this significantly improves compilation time. On Mac M1 it reduces time needed to compiler RegisterSchema.cpp from 50 to 3 seconds.

Special case empty tags, to keep torch_gen tests happy

Before
```
% /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -ftime-report -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/Users/nshulga/git/pytorch/pytorch/build/aten/src -I/Users/nshulga/git/pytorch/pytorch/aten/src -I/Users/nshulga/git/pytorch/pytorch/build -I/Users/nshulga/git/pytorch/pytorch -I/Users/nshulga/git/pytorch/pytorch/cmake/../third_party/benchmark/include -I/Users/nshulga/git/pytorch/pytorch/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/build/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/build/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include -I/Users/nshulga/git/pytorch/pytorch/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/../aten/src -I/Users/nshulga/git/pytorch/pytorch/torch/csrc -I/Users/nshulga/git/pytorch/pytorch/third_party/miniz-2.1.0 -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/include -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/src -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/FXdiv/include -I/Users/nshulga/git/pytorch/pytorch/c10/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/pthreadpool/include -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/deps/clog/include -I/Users/nshulga/git/pytorch/pytorch/third_party/NNPACK/include -I/Users/nshulga/git/pytorch/pytorch/third_party/FP16/include -I/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include -I/Users/nshulga/git/pytorch/pytorch/third_party/flatbuffers/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googletest/include -isystem /Users/nshulga/git/pytorch/pytorch/third_party/protobuf/src -isystem /Users/nshulga/git/pytorch/pytorch/third_party/XNNPACK/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/eigen -isystem /Users/nshulga/git/pytorch/pytorch/build/include  -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=braced-scalar-init -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wvla-extension -Wsuggest-override -Wnewline-eof -Winconsistent-missing-override -Winconsistent-missing-destructor-override -Wno-pass-failed -Wno-error=pedantic -Wno-error=old-style-cast -Wno-error=inconsistent-missing-override -Wno-error=inconsistent-missing-destructor-override -Wconstant-conversion -Wno-invalid-partial-specialization -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -Werror -Wno-unused-but-set-variable -fno-math-errno -fno-trapping-math -Werror=format -DUSE_MPS -Wno-unused-private-field -Wno-missing-braces -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk -fPIC -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-unused-function -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -fvisibility=hidden -O2 -Wmissing-prototypes -Werror=missing-prototypes -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -std=gnu++17 -Wno-missing-prototypes -Wno-error=missing-prototypes -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterSchema.cpp.o -c /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/RegisterSchema.cpp
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 131.8054 seconds (132.5540 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---Instr---  --- Name ---
  43.6364 ( 33.2%)   0.0919 ( 30.1%)  43.7282 ( 33.2%)  43.9658 ( 33.2%)  536345245380  ModuleInlinerWrapperPass
  43.6291 ( 33.2%)   0.0891 ( 29.2%)  43.7182 ( 33.2%)  43.9549 ( 33.2%)  536264096394  DevirtSCCRepeatedPass
  42.3766 ( 32.2%)   0.0185 (  6.1%)  42.3951 ( 32.2%)  42.6198 ( 32.2%)  523040901767  GVNPass
   0.4085 (  0.3%)   0.0040 (  1.3%)   0.4125 (  0.3%)   0.4195 (  0.3%)  4106085945  SimplifyCFGPass
   0.3611 (  0.3%)   0.0115 (  3.8%)   0.3726 (  0.3%)   0.3779 (  0.3%)  4864696407  InstCombinePass
   0.1607 (  0.1%)   0.0088 (  2.9%)   0.1695 (  0.1%)   0.1720 (  0.1%)  1780986175  InlinerPass
   0.0865 (  0.1%)   0.0024 (  0.8%)   0.0889 (  0.1%)   0.0914 (  0.1%)  1489982961  SROAPass
   0.0750 (  0.1%)   0.0013 (  0.4%)   0.0763 (  0.1%)   0.0764 (  0.1%)  620016338  SCCPPass
   0.0661 (  0.1%)   0.0040 (  1.3%)   0.0701 (  0.1%)   0.0735 (  0.1%)  592027163  EarlyCSEPass
...
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 48.2802 seconds (48.8638 wall clock)
...
 ```

After
```
% /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -ftime-report -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/Users/nshulga/git/pytorch/pytorch/build/aten/src -I/Users/nshulga/git/pytorch/pytorch/aten/src -I/Users/nshulga/git/pytorch/pytorch/build -I/Users/nshulga/git/pytorch/pytorch -I/Users/nshulga/git/pytorch/pytorch/cmake/../third_party/benchmark/include -I/Users/nshulga/git/pytorch/pytorch/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/build/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/build/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include -I/Users/nshulga/git/pytorch/pytorch/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/../aten/src -I/Users/nshulga/git/pytorch/pytorch/torch/csrc -I/Users/nshulga/git/pytorch/pytorch/third_party/miniz-2.1.0 -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/include -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/src -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/FXdiv/include -I/Users/nshulga/git/pytorch/pytorch/c10/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/pthreadpool/include -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/deps/clog/include -I/Users/nshulga/git/pytorch/pytorch/third_party/NNPACK/include -I/Users/nshulga/git/pytorch/pytorch/third_party/FP16/include -I/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include -I/Users/nshulga/git/pytorch/pytorch/third_party/flatbuffers/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googletest/include -isystem /Users/nshulga/git/pytorch/pytorch/third_party/protobuf/src -isystem /Users/nshulga/git/pytorch/pytorch/third_party/XNNPACK/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/eigen -isystem /Users/nshulga/git/pytorch/pytorch/build/include  -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=braced-scalar-init -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wvla-extension -Wsuggest-override -Wnewline-eof -Winconsistent-missing-override -Winconsistent-missing-destructor-override -Wno-pass-failed -Wno-error=pedantic -Wno-error=old-style-cast -Wno-error=inconsistent-missing-override -Wno-error=inconsistent-missing-destructor-override -Wconstant-conversion -Wno-invalid-partial-specialization -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -Werror -Wno-unused-but-set-variable -fno-math-errno -fno-trapping-math -Werror=format -DUSE_MPS -Wno-unused-private-field -Wno-missing-braces -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk -fPIC -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-unused-function -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -fvisibility=hidden -O2 -Wmissing-prototypes -Werror=missing-prototypes -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -std=gnu++17 -Wno-missing-prototypes -Wno-error=missing-prototypes -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterSchema.cpp.o -c /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/RegisterSchema.cpp
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 1.2920 seconds (1.3187 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---Instr---  --- Name ---
   0.3070 ( 27.6%)   0.0547 ( 30.2%)   0.3617 ( 28.0%)   0.3654 ( 27.7%)  3719690895  ModuleInlinerWrapperPass
   0.3024 ( 27.2%)   0.0525 ( 29.0%)   0.3549 ( 27.5%)   0.3585 ( 27.2%)  3653363330  DevirtSCCRepeatedPass
   0.0619 (  5.6%)   0.0073 (  4.0%)   0.0692 (  5.4%)   0.0711 (  5.4%)  868136227  InstCombinePass
   0.0601 (  5.4%)   0.0065 (  3.6%)   0.0666 (  5.2%)   0.0679 (  5.1%)  696430647  InlinerPass
   0.0363 (  3.3%)   0.0033 (  1.8%)   0.0396 (  3.1%)   0.0425 (  3.2%)  535426974  SimplifyCFGPass
   0.0280 (  2.5%)   0.0069 (  3.8%)   0.0348 (  2.7%)   0.0358 (  2.7%)  378716394  BlockFrequencyAnalysis
   0.0208 (  1.9%)   0.0049 (  2.7%)   0.0257 (  2.0%)   0.0262 (  2.0%)  283689627  BranchProbabilityAnalysis
   0.0239 (  2.1%)   0.0002 (  0.1%)   0.0241 (  1.9%)   0.0241 (  1.8%)  219122704  OpenMPOptCGSCCPass
   0.0174 (  1.6%)   0.0015 (  0.8%)   0.0189 (  1.5%)   0.0192 (  1.5%)  215583965  GVNPass
   0.0153 (  1.4%)   0.0025 (  1.4%)   0.0178 (  1.4%)   0.0187 (  1.4%)  184232295  EarlyCSEPass
...
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.9128 seconds (3.1027 wall clock)
...
```

And the generated schema file looks as follows:
```cpp
TORCH_LIBRARY(aten, m) {
  const std::vector<at::Tag> tags_0 = {at::Tag::pt2_compliant_tag};
  m.def("_cast_Byte(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Char(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Double(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Float(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Int(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Long(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Short(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Half(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_backward(Tensor self, Tensor[] inputs, Tensor? gradient=None, bool? retain_graph=None, bool create_graph=False) -> ()", tags_0);
  m.def("set_data(Tensor(a!) self, Tensor new_data) -> ()", tags_0);
  m.def("data(Tensor self) -> Tensor", tags_0);
  m.def("is_leaf(Tensor self) -> bool", tags_0);
  m.def("output_nr(Tensor self) -> int", tags_0);
  m.def("_version(Tensor self) -> int", tags_0);
  m.def("requires_grad_(Tensor(a!) self, bool requires_grad=True) -> Tensor(a!)", tags_0);
  m.def("retain_grad(Tensor(a!) self) -> ()", tags_0);
  m.def("retains_grad(Tensor self) -> bool", tags_0);
  m.def("_fw_primal(Tensor(a) self, int level) -> Tensor(a)", tags_0);
  m.def("_make_dual(Tensor(a) primal, Tensor tangent, int level) -> Tensor(a)", tags_0);
  m.def("_unpack_dual(Tensor(a) dual, int level) -> (Tensor(a) primal, Tensor tangent)", tags_0);
  m.def("_new_zeros_with_same_feature_meta(Tensor self, Tensor other, *, int self_num_batch_dims=0) -> Tensor", tags_0);
  m.def("_has_same_storage_numel(Tensor self, Tensor other) -> bool", tags_0);
  const std::vector<at::Tag> tags_1 = {at::Tag::inplace_view, at::Tag::pt2_compliant_tag};
  m.def("rename_(Tensor(a!) self, Dimname[]? names) -> Tensor(a!)", tags_1);
  m.def("rename(Tensor(a) self, Dimname[]? names) -> Tensor(a)", tags_0);
  m.def("align_to(Tensor(a) self, Dimname[] names) -> Tensor(a)", tags_0);
  m.def("align_to.ellipsis_idx(Tensor(a) self, Dimname[] order, int ellipsis_idx) -> Tensor(a)", tags_0);
  m.def("align_as(Tensor self, Tensor other) -> Tensor", tags_0);
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114438
Approved by: https://github.com/zou3519
2023-11-27 23:33:04 +00:00
e4b1378a92 Fix dynamo test_logging handling of partial qnames (#114429)
if logger_qname is a.b.c and dynamo_qnames contains a.b, it still
matches dynamo's INFO setting

concretely, torch._dynamo.backends.distributed is implicitly part of
the dynamo namespace since it is covered by `torch._dynamo` which is
one of dynamo_qnames.  However, it is not an exact match for any
of dynamo_qnames, which made this test fail when adding a specific
qname for backends.distributed in the subsequent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114429
Approved by: https://github.com/Skylion007
ghstack dependencies: #114428
2023-11-27 22:52:11 +00:00
2ea2421b44 Skip unit tests that fail on MI210 runners (#114613)
Taken from https://github.com/pytorch/pytorch/pull/105980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114613
Approved by: https://github.com/malfet
2023-11-27 22:25:35 +00:00
2ac0b61e60 [HigherOrderOp] dedup repeated get_attr placeholders in branches of cond (#112874)
We further de-duplicate the dupliacted get_attrs nodes.

For code below:
```python
def test_cond_free_variable_in_both_branches(self):
    backend = EagerAndRecordGraphs()
    cnt = CompileCounterWithBackend(backend)

    z = torch.ones(4, 4)

    class Foo(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.register_buffer("buffer", torch.ones(6, 4))

        def forward(self, x, y):
            def true_fn(x):
                return x.sum() + self.buffer.sum() + z.sum()

            def false_fn(x):
                return x.sum() - z.sum() - self.buffer.sum()

            return control_flow.cond(y, true_fn, false_fn, [x])

    mod_for_compile = torch.compile(
        Foo(), backend=cnt, dynamic=True, fullgraph=True
    )
```

Before de-duplication, we have the following graph module:
```python
class GraphModule(torch.nn.Module):
    def forward(self, L_y_ : torch.Tensor, L_x_ : torch.Tensor, s0 : torch.SymInt, L_z_ : torch.Tensor):
        l_y_ = L_y_
        l_x_ = L_x_
        l_z_ = L_z_

        # File: /home/yidi/local/pytorch/test/dynamo/test_higher_order_ops.py:1243, code: return x.sum() + self.buffer.sum() + z.sum()
        l__self___buffer = self.L__self___buffer

        # File: /home/yidi/local/pytorch/test/dynamo/test_higher_order_ops.py:1246, code: return x.sum() - z.sum() - self.buffer.sum()
        l__self___buffer_1 = self.L__self___buffer

        # File: /home/yidi/local/pytorch/torch/_higher_order_ops/cond.py:118, code: return cond_op(pred, true_fn, false_fn, operands)
        cond_true_0 = self.cond_true_0
        cond_false_0 = self.cond_false_0
        cond = torch.ops.higher_order.cond(l_y_, cond_true_0, cond_false_0, [l_x_, l_z_, l__self___buffer, l__self___buffer_1]);  l_y_ = cond_true_0 = cond_false_0 = l_x_ = l_z_ = l__self___buffer = l__self___buffer_1 = None
        return (cond,)

    class GraphModule(torch.nn.Module):
        def forward(self, l_x_, l_z_, l__self___buffer_true_branch, l__self___buffer_1_false_branch):
            l_x__1 = l_x_
            l_z__1 = l_z_

            # File: /home/yidi/local/pytorch/test/dynamo/test_higher_order_ops.py:1243, code: return x.sum() + self.buffer.sum() + z.sum()
            sum_1 = l_x__1.sum();  l_x__1 = None
            sum_2 = l__self___buffer_true_branch.sum();  l__self___buffer_true_branch = None
            add = sum_1 + sum_2;  sum_1 = sum_2 = None
            sum_3 = l_z__1.sum();  l_z__1 = None
            add_1 = add + sum_3;  add = sum_3 = None
            return add_1

    class GraphModule(torch.nn.Module):
        def forward(self, l_x_, l_z_, l__self___buffer_true_branch, l__self___buffer_1_false_branch):
            l_x__1 = l_x_
            l_z__1 = l_z_

            # File: /home/yidi/local/pytorch/test/dynamo/test_higher_order_ops.py:1246, code: return x.sum() - z.sum() - self.buffer.sum()
            sum_1 = l_x__1.sum();  l_x__1 = None
            sum_2 = l_z__1.sum();  l_z__1 = None
            sub = sum_1 - sum_2;  sum_1 = sum_2 = None
            sum_3 = l__self___buffer_1_false_branch.sum();  l__self___buffer_1_false_branch = None
            sub_1 = sub - sum_3;  sub = sum_3 = None
            return sub_1
```

After de-duplication, we have the following graph module:
```python
class GraphModule(torch.nn.Module):
    def forward(self, L_x_ : torch.Tensor, L_y_ : torch.Tensor, s0 : torch.SymInt, L_z_ : torch.Tensor):
        l_x_ = L_x_
        l_y_ = L_y_
        l_z_ = L_z_

        # File: /home/yidi/local/pytorch/test/dynamo/test_higher_order_ops.py:1232, code: return x.sum() + self.buffer.sum() + z.sum()
        l__self___buffer = self.L__self___buffer

        # File: /home/yidi/local/pytorch/torch/_higher_order_ops/cond.py:118, code: return cond_op(pred, true_fn, false_fn, operands)
        cond_true_0 = self.cond_true_0
        cond_false_0 = self.cond_false_0
        cond = torch.ops.higher_order.cond(l_y_, cond_true_0, cond_false_0, [l__self___buffer, l_x_, l_z_]);  l_y_ = cond_true_0 = cond_false_0 = l__self___buffer = l_x_ = l_z_ = None
        return (cond,)

    class GraphModule(torch.nn.Module):
        def forward(self, l__self___buffer, l_x_, l_z_):
            l__self___buffer_1 = l__self___buffer
            l_x__1 = l_x_
            l_z__1 = l_z_

            # File: /home/yidi/local/pytorch/test/dynamo/test_higher_order_ops.py:1232, code: return x.sum() + self.buffer.sum() + z.sum()
            sum_1 = l_x__1.sum();  l_x__1 = None
            sum_2 = l__self___buffer_1.sum();  l__self___buffer_1 = None
            add = sum_1 + sum_2;  sum_1 = sum_2 = None
            sum_3 = l_z__1.sum();  l_z__1 = None
            add_1 = add + sum_3;  add = sum_3 = None
            return add_1

    class GraphModule(torch.nn.Module):
        def forward(self, l__self___buffer_1, l_x_, l_z_):
            l__self___buffer_2 = l__self___buffer_1
            l_x__1 = l_x_
            l_z__1 = l_z_

            # File: /home/yidi/local/pytorch/test/dynamo/test_higher_order_ops.py:1235, code: return x.sum() - z.sum() - self.buffer.sum()
            sum_1 = l_x__1.sum();  l_x__1 = None
            sum_2 = l_z__1.sum();  l_z__1 = None
            sub = sum_1 - sum_2;  sum_1 = sum_2 = None
            sum_3 = l__self___buffer_2.sum();  l__self___buffer_2 = None
            sub_1 = sub - sum_3;  sub = sum_3 = None
            return sub_1

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112874
Approved by: https://github.com/zou3519
2023-11-27 22:07:42 +00:00
4c794f2ef1 Reinplace foreach when safe and allow aliasing during lowering (#112440)
This reduces compile time of Adam on 1k parameters from 180s to 140s (28%), the main reason being that thousands of buffers no longer get sent to the scheduler.

The idea behind this is that if a destination buffer (from a copy_) has no users, it shouldn't matter if dst aliases src.

This is implemented by reinplacing copy_ nodes when safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112440
Approved by: https://github.com/jansel
2023-11-27 21:32:42 +00:00
e0d2a24967 Reland "[export] Support user input mutation. [1/2]" (#114496) (#114596)
Summary:

Serialization not implemented yet. Will do in the next diff.

Resolving Github issues:
https://github.com/pytorch/pytorch/issues/112429
https://github.com/pytorch/pytorch/issues/114142

Test Plan:
onnx doc test
```
python -m xdoctest /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/onnx/_internal/exporter.py ONNXProgram.model_signature:0
```

Differential Revision: D51588558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114596
Approved by: https://github.com/angelayi
2023-11-27 20:19:04 +00:00
800cf5f7cb Add USE_C10D_NCCL around NCCL trace utils (#114597)
Fixes #114575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114597
Approved by: https://github.com/malfet
2023-11-27 19:55:31 +00:00
69024883fb Make dynamo's test_logging print helpful error (#114428)
BEFORE
```
expected torch._dynamo.backends.distributed is DEBUG, got 0
```
(0 is both unhelpful and also not numerically the right value,
getEffectiveLevel() returns 20 not 0 for this particular case)

AFTER
```
expected torch._dynamo.backends.distributed is DEBUG, got INFO
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114428
Approved by: https://github.com/Skylion007
2023-11-27 19:18:53 +00:00
7fa1251080 [BE][Easy]: Enable NPY lint rules for ruff (#114476)
Enable NPY lint rules for ruff
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114476
Approved by: https://github.com/justinchuby, https://github.com/malfet
2023-11-27 18:56:10 +00:00
1793ef77c6 [BC-breaking] conv1d & conv3d (#114594)
As discussed here: https://github.com/pytorch/pytorch/pull/113885#discussion_r1404573875

#### TODO
- [x] add error inputs after #114589 is merged

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114594
Approved by: https://github.com/lezcano
2023-11-27 18:30:59 +00:00
4bb3a02d02 [BE]: Enable Ruff + Flake8 G201,G202 logging format rule. (#114474)
Standardizes logging calls to always use logging.exception instead of logging.error where appropriate and enforces it with a lint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114474
Approved by: https://github.com/jansel, https://github.com/malfet
2023-11-27 17:38:08 +00:00
3a4dea99df ROCm triton commit pin update (#114348)
Small bump in rocm triton commit pin to resolve reported issue on 7900XTX
> RuntimeError: Triton Error [HIP]: Code: 719, Messsage: unspecified launch failure
https://github.com/ROCmSoftwarePlatform/triton/issues/396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114348
Approved by: https://github.com/jeffdaily
2023-11-27 17:29:23 +00:00
bcfca41a2a [Inductor] fix wrong Inductor UTs (#114504)
# Motivation
These UTs seem wrong. Fix them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114504
Approved by: https://github.com/aakhundov
2023-11-27 17:12:03 +00:00
9fd447c346 [CI] Bump up the graph break count for DALLE2_pytorch temporarily (#114598)
Summary: rotary-embedding-torch's version changing from 0.3.3 to 0.3.6 caused some new graph breaks for DALLE2_pytorch. A proper fix is to pin down rotary-embedding-torch's version in torchbench, and then update our torchbench pin to pick up that change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114598
Approved by: https://github.com/seemethere, https://github.com/aakhundov
2023-11-27 16:43:28 +00:00
56a95afb42 [RelEng] Pin disabled and slow test for release (#114515)
Follow up for https://github.com/pytorch/pytorch/pull/114355
Pin disabled and slow tests when applying release only changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114515
Approved by: https://github.com/DanilBaibak
2023-11-27 15:15:19 +00:00
cff84871ce [reland][opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops (#114589)
Previous PR: #113885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114589
Approved by: https://github.com/lezcano
2023-11-27 14:45:44 +00:00
ccb1de3595 Revert "[inductor] Fix torch.split bug on unbacked symint (#113406)"
This reverts commit cd7d6938c18d90870356553d4631f1388d2bb699.

Reverted https://github.com/pytorch/pytorch/pull/113406 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113406#issuecomment-1827727411))
2023-11-27 12:20:52 +00:00
fa1ccc34c4 Revert "[export] Support user input mutation. [1/2] (#114496)"
This reverts commit b62c0d96bcbe5f354ddce930fbdcd992dbaf1ce8.

Reverted https://github.com/pytorch/pytorch/pull/114496 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114496#issuecomment-1827289635))
2023-11-27 07:52:21 +00:00
8232d4d1c3 Revert "[BE]: Enable Ruff + Flake8 G201,G202 logging format rule. (#114474)"
This reverts commit d30497f6b62007c9d1e3c38179528e9d25ac1292.

Reverted https://github.com/pytorch/pytorch/pull/114474 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I see a bunch of inductor failure after the commit d30497f6b6, trying to revert to see if it helps fix the issues ([comment](https://github.com/pytorch/pytorch/pull/114474#issuecomment-1827271887))
2023-11-27 07:36:08 +00:00
150aaf46ca Revert "[opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops (#113885)"
This reverts commit 4fa1ff8404b6c26c076288aa2a0aa77f0c24916a.

Reverted https://github.com/pytorch/pytorch/pull/113885 on behalf of https://github.com/huydhn due to Sorry for reverting you change but its TestCommonCUDA::test_compare_cpu_nn_functional_conv3d test failing in trunk 4fa1ff8404 ([comment](https://github.com/pytorch/pytorch/pull/113885#issuecomment-1827268473))
2023-11-27 07:33:00 +00:00
68a36d2faa [dtensor] refactor some existing test util to use comm mode (#114404)
As titled, This is just a test util refactor:
redistributed profiler is not good to use and we should use
comm mode going forward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114404
Approved by: https://github.com/wconstab
ghstack dependencies: #113592
2023-11-27 06:43:09 +00:00
b62c0d96bc [export] Support user input mutation. [1/2] (#114496)
Summary:
Serialization not implemented yet. Will do in the next diff.

Resolving Github issues:
https://github.com/pytorch/pytorch/issues/112429
https://github.com/pytorch/pytorch/issues/114142

Test Plan:
buck2 run mode/opt caffe2/test:test_export -- -r test_export_
input_mutation

Differential Revision: D51556962

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114496
Approved by: https://github.com/tugsbayasgalan
2023-11-27 04:53:38 +00:00
624f202522 [dtensor] add CommDebugMode for debugging (#113592)
This PR adds a CommDebugMode debugging tool to record the number of
distributed collectives, utilizing TorchDispatchMode, the idea borrows
from the FlopCounterMode and we can expand this later to make it more
feature complete like the FlopCounterMode

This is useful for debugging with DTensor and testing, in general this
fits for any complex distributed algorithms where it's non-trival to
understand the algorithm, we can use this tool to understand what
happened under the hood., we can later cover c10d collectives directly

Not sure if it would be a good general distributed debug tool yet,
so adding to the dtensor package first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113592
Approved by: https://github.com/wconstab
2023-11-27 02:40:28 +00:00
081c5b3adc Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926) (#114526)
Summary:

The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors *at the end* of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor.

This PR is the result of *a lot* of back and forth with ezyang and eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same:

1) We cache source->symbol in shape_env
2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification
3) We create a new fake mode for backends
(from https://github.com/pytorch/pytorch/pull/113605/files)

This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't).

We went back to the drawing board here, but with a few concessions:
1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons
2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (ezyang did this)

cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: huydhn, Chillee

Differential Revision: D51566250

Pulled By: voznesenskym

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114526
Approved by: https://github.com/Chillee, https://github.com/huydhn
2023-11-26 23:40:32 +00:00
4fa1ff8404 [opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops (#113885)
Previous PR: https://github.com/pytorch/pytorch/pull/85202

Also, cc'ing @lezcano @kshitij12345 @zou3519, who reviewed my previous PR. Thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113885
Approved by: https://github.com/lezcano
2023-11-26 13:44:30 +00:00
028071c4a1 Fix test assertions in test_min_max_nodes_parse. (#114537)
Calls to `assertTrue` corrected to be `assertEqual` in `ElasticLaunchTest test_min_max_nodes_parse`.

As originally written, the `assertTrue` statements will always pass, not actually asserting anything of value for the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114537
Approved by: https://github.com/Skylion007
2023-11-26 09:25:41 +00:00
bbdd9b059f [executorch hash update] update the pinned executorch hash (#114486)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114486
Approved by: https://github.com/pytorchbot
2023-11-26 03:50:54 +00:00
d37c4c6995 Update torch.compiler_troubleshooting.rst (#114530)
If you copy and paste the env var in the docs:
```console
TORCHDYNAMO_REPRO_AFTER=“aot”
```
it leads to this error:
```python
    @functools.wraps(unconfigured_compiler_fn)
    def debug_wrapper(gm, example_inputs, **kwargs):
        compiler_fn = functools.partial(unconfigured_compiler_fn, **kwargs)
>       assert config.repro_after in ("dynamo", "aot", None)
E       torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
E       AssertionError:
```
because `config.repro_after` is being `'“aot”'` but not `'aot'`.

---

It would've saved a few minutes of my time 😄
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114530
Approved by: https://github.com/Chillee
2023-11-25 23:15:47 +00:00
0f5e24bda9 Properly type CachedFunction & rename to CachedMethod (#114161)
Previously, I was unsure how to properly type the parameters of a decorated method.
Then I found https://github.com/python/mypy/issues/13222#issuecomment-1193073470
which explains how to use `Concatenate` to hackily achieve it. Not entirely sure why
we can't write a user-defined version of `Callable` that works seamlessly for both functions
and methods...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114161
Approved by: https://github.com/Skylion007
2023-11-25 01:30:23 +00:00
d30497f6b6 [BE]: Enable Ruff + Flake8 G201,G202 logging format rule. (#114474)
Standardizes logging calls to always use logging.exception instead of logging.error where appropriate and enforces it with a lint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114474
Approved by: https://github.com/jansel
2023-11-24 23:29:51 +00:00
c6d88604d5 [Inductor] Fix mutation tracking of ConvolutionBinaryInplace (#114501)
Init function reorders the arguments so the mutation actually happens on
argument input[0]

I am not sure if there's a good way to test this unfortunately.. Added
tests on https://github.com/pytorch/pytorch/pull/114436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114501
Approved by: https://github.com/leslie-fang-intel, https://github.com/aakhundov
2023-11-24 19:32:41 +00:00
0a063ad2c0 [inductor] Pass None and skip constexpr in custom Triton kernel calls from C++ (#114475)
Summary: `None` arguments are codegened as `*i8` in the `triton_meta` of the generated or user-defined Triton kernels:

85aa372374/torch/_inductor/codegen/triton_utils.py (L33-L36)

Due to this, in contrary to the conventional Triton, we actually should pass `nullptr` to the Triton kernels in C++ wrapper codegen instead of passing nothing (as normally `None` doesn't make it to the generated PTX parameters, just like `tl.constexpr` args).

This PR adds two things:

1. Proper C++ wrapper codegening (ABI and non-ABI) of `nullptr` and `c10::nullopt`, as the prior way codegening `c10::nullopt` as tensor breaks (also `c10` breaks in the ABI mode).

2. Skipping `tl.constexpr` args when calling the loaded-from-cubin compiled Triton kernel in the C++ wrapper codegen. As a side effect, this also resolves an issue with string arguments: now they are simply omitted in the C++ wrapper codegen.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_with_none_input
...
----------------------------------------------------------------------
Ran 4 tests in 40.364s

OK (skipped=2)
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114475
Approved by: https://github.com/oulgen
2023-11-24 12:51:56 +00:00
cd7d6938c1 [inductor] Fix torch.split bug on unbacked symint (#113406)
torch.split(x, l) fails when l's shape is the unbacked symint.

E.g. l =
y.tolist() makes l the unbacked shape, because l depends on the
data access of y. The downdtream call `SliceView.create()`
evaluates the shape even if the input shape is unbacked symint,
which brings up the bug.

Test Plan:
python test/inductor/test_unbacked_symints.py -k test_split_with_sizes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406
Approved by: https://github.com/aakhundov, https://github.com/ezyang
2023-11-24 07:21:00 +00:00
51390722e9 Fix ConvolutionBinaryInplace using target node (#114436)
This IR node mutates in place, it needs to use the argument not the
target.

Fixes #113440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114436
Approved by: https://github.com/jansel
ghstack dependencies: #114169
2023-11-24 06:25:11 +00:00
cyy
07e00de8d7 Add missing member initialization in c10::ExtraMeta constructor (#114448)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114448
Approved by: https://github.com/Skylion007
2023-11-24 03:54:11 +00:00
dad3cc4d02 Fix type for keep_inference_mutation flag (#114482)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114482
Approved by: https://github.com/Skylion007
ghstack dependencies: #114421, #114479, #114481
2023-11-24 00:04:31 +00:00
fa71f5efdc [BE][aot_autograd] Remove unnecessary fields from ViewMutationData (#114481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114481
Approved by: https://github.com/zhxchen17
ghstack dependencies: #114421, #114479
2023-11-24 00:04:26 +00:00
e6e650d5eb [BE][aot_autograd] Remove num_mutated_inputs (#114479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114479
Approved by: https://github.com/zhxchen17
ghstack dependencies: #114421
2023-11-24 00:04:25 +00:00
a378ae33e9 [BE][aot_autograd] Remove mutated_inp_indices (#114421)
We should use mutated_inp_runtime_indices moving forward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114421
Approved by: https://github.com/zhxchen17
2023-11-23 22:41:38 +00:00
cyy
b76e2949f7 Fix pool_size type in TaskThreadPool (#114063)
As negative values of pool_size mean calling defaultNumThreads()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114063
Approved by: https://github.com/Skylion007
2023-11-23 21:20:45 +00:00
a28876832c Fixed an export problem when moving tensors to CPU during torch.export.save (#114029)
For whatever reason calling`.cpu()` on a `nn.Parameter` wrapping a CUDA tensor will return a plain (non-parameter) tensor. This PR fixes the symptom in the linked issue, but not the underlying issue.

Fixes #113999.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114029
Approved by: https://github.com/zhxchen17
2023-11-23 21:17:43 +00:00
fd1a01a393 Set default LR value of SGD to 1e-3 (#114467)
Fixes https://github.com/pytorch/pytorch/issues/114089

Set the lr to 1e-3 in SGD to increase the consistency of input signature of optimizers.

@janeyx99
This should be the redacted PR #114434 ,
sincerely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114467
Approved by: https://github.com/janeyx99
2023-11-23 19:07:38 +00:00
85aa372374 [inductor] Fixed conv issue with dynamic shapes (#114351)
EDIT: fixes https://github.com/pytorch/pytorch/issues/114354

Description:
The following code is failing:
```python
import torch

def func(x, w):
    return torch.nn.functional.conv2d(x, w, groups=int(w.shape[0]))

x = torch.rand(1, 3, 64, 64)
w = torch.rand(3, 1, 3, 3)
y1 = func(x, w)
cfunc = torch.compile(func, fullgraph=True, dynamic=True)
y2 = cfunc(x, w)

torch.testing.assert_close(y1, y2)
```
with the error:
```
  File "/pytorch/torch/_inductor/kernel/conv.py", line 315, in convolution
    assert isinstance(groups, int)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: AssertionError:
  target: aten.convolution.default
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg3_1', layout=FixedLayout('cpu', torch.float32, size=[1, s0, s1, s1], stride=[s0*s1**2, s1**2, s1, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cpu', torch.float32, size=[s0, 1, s0, s0], stride=[s0**2, s0**2, s0, 1]))
  ))
  args[2]: None
  args[3]: [1, 1]
  args[4]: [0, 0]
  args[5]: [1, 1]
  args[6]: False
  args[7]: [0, 0]
  args[8]: s0
```
where `groups` argument is a symbol but expected to be `int`.

This PR specializes `group` to its int value and fixes the problem.

Context: Failing tests in torchvision with gaussian blur and adjust_sharpness ops
- https://github.com/pytorch/vision/actions/runs/6955843968/job/18926393710?pr=8127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114351
Approved by: https://github.com/ezyang
2023-11-23 13:13:06 +00:00
01366efcc9 Revert "[pytree] register pytree node type in both C++ pytree and Python pytree (#112111)"
This reverts commit 4e4a6ad6ecd71a1aefde3992ecf7f77e37d2e264.

Reverted https://github.com/pytorch/pytorch/pull/112111 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/112111#issuecomment-1824099658))
2023-11-23 09:59:32 +00:00
a76bb5d84d Add support for models with mutated buffer on torch.onnx.dynamo_export (#112272)
This PR adds a unit test that leverages `torch.export.ExportedProgram` models that mutates registered buffers. Although the exporter already works out of the box in such scenario, the GraphModule and the exported ONNX model have extra outputs containing the mutated buffers. On future runs of the ONNX model, the mutated buffers are used as input to the model.

The aforementioned extra inputs and outputs are by design and the `ONNXProgram.model_signature` can be used to fetch detailed input/output schema for the exported model.

However, when we want to compare pytorch output to ONNX's, there is a mismatch between the schema because pytorch output does not include the mutated buffers present on the ONNX output.

This PR extends `onnx_program.adapt_torch_outputs_to_onnx(torch_outputs)` so that the mutated buffers are prepended to the Pytorch output, matching the ONNX schema.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112272
Approved by: https://github.com/titaiwangms, https://github.com/BowenBao
2023-11-23 09:59:02 +00:00
7daeb6509f Update audio pinned commit nightly (#114426)
I think we could have this pinned commit being update nightly like what we have with vision.  This will avoid having an outdated audio pinned commit that needs to be updated manually, i.e. https://github.com/pytorch/pytorch/pull/114393
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114426
Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/malfet
2023-11-23 07:36:55 +00:00
6f340c6f30 Handle the case when opening a reverted PR with deleted head branch (#114423)
When reopening a reverted PR, `422: Unprocessable Entity` is returned when the head branch has been deleted, for example https://github.com/pytorch/pytorch/pull/112889#issuecomment-1823216686

```
{
  "message": "Validation Failed",
  "errors": [
    {
      "resource": "PullRequest",
      "code": "custom",
      "field": "state",
      "message": "state cannot be changed. The commsplit branch has been deleted."
    }
  ],
  "documentation_url": "https://docs.github.com/rest/pulls/pulls#update-a-pull-request"
}
```

The revert still happens though, only reopening PR fails, which is ok to ignore in this case I think instead of going the complicated route of trying to restore the deleted branch by merge bot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114423
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-11-23 07:32:46 +00:00
a43edd836c Revert "Add support for models with mutated buffer on torch.onnx.dynamo_export (#112272)"
This reverts commit c4a22d6918b7ca218f2712d7e7e147aca7127fa3.

Reverted https://github.com/pytorch/pytorch/pull/112272 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing dynamo test in trunk c4a22d6918 ([comment](https://github.com/pytorch/pytorch/pull/112272#issuecomment-1823897964))
2023-11-23 07:07:56 +00:00
066e072524 Retry #112889 (Opportunistically use ncclCommSplit when creating new NCCL groups) (#114385)
- [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889)
- Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit`

Fixes cause of revert of original PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385
Approved by: https://github.com/huydhn
2023-11-23 07:00:00 +00:00
ed05af278c [DTensor] Passed dynamic=False for compile tests (#114390)
Test Plan:
```
python test/distributed/_tensor/test_dtensor_compile.py
```

We found that after https://github.com/pytorch/pytorch/pull/114236 landed, DTensor + `torch.compile` tests were breaking (which was confounded with `DTensorSpec` hash changes). The temporary solution is to pass `dynamic=False`.

Otherwise, we see errors like:
<details>

```
======================================================================
ERROR: test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 533, in wrapper
    self._join_processes(fn)
  File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 752, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 802, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 649, in run_test
    getattr(self, test_name)()
  File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 535, in wrapper
    fn()
  File "/data/users/andgu/pytorch/torch/testing/_internal/common_utils.py", line 2652, in wrapper
    method(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 193, in wrapper
    func(self, *args, **kwargs)  # type: ignore[misc]
  File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper
    return func(*args, **kwargs)
  File "/data/users/andgu/pytorch/test/distributed/_tensor/test_dtensor_compile.py", line 328, in test_2d_fsdp_tp_ac_compile
    compiled_output = compiled_2d(inp)
  File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 848, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/_dynamo/eval_frame.py", line 655, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 721, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 383, in _convert_frame_assert
    compiled_product = _compile(
  File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 645, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/data/users/andgu/pytorch/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 562, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/data/users/andgu/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 151, in _fn
    return fn(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 527, in transform
    tracer.run()
  File "/data/users/andgu/pytorch/torch/_dynamo/symbolic_convert.py", line 2123, in run
    super().run()
  File "/data/users/andgu/pytorch/torch/_dynamo/symbolic_convert.py", line 818, in run
    and self.step()
  File "/data/users/andgu/pytorch/torch/_dynamo/symbolic_convert.py", line 781, in step
    getattr(self, inst.opname)(inst)
  File "/data/users/andgu/pytorch/torch/_dynamo/symbolic_convert.py", line 2238, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/data/users/andgu/pytorch/torch/_dynamo/output_graph.py", line 912, in compile_subgraph
    self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
  File "/home/andgu/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/andgu/pytorch/torch/_dynamo/output_graph.py", line 1069, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/data/users/andgu/pytorch/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/_dynamo/output_graph.py", line 1141, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/data/users/andgu/pytorch/torch/_dynamo/output_graph.py", line 1122, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/data/users/andgu/pytorch/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/data/users/andgu/pytorch/torch/__init__.py", line 1696, in __call__
    return self.compiler_fn(model_, inputs_, **self.kwargs)
  File "/data/users/andgu/pytorch/torch/_dynamo/backends/common.py", line 55, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 4946, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/data/users/andgu/pytorch/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 4486, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 2825, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 3011, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 3714, in aot_dispatch_autograd
    fx_g, joint_inputs, maybe_subclass_meta = aot_dispatch_autograd_graph(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 3694, in aot_dispatch_autograd_graph
    fx_g = create_graph(joint_fn_to_trace, updated_joint_inputs, aot_config=aot_config)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1955, in create_graph
    fx_g = make_fx(f, decomposition_table=aot_config.decompositions)(*args)
  File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 869, in wrapped
    t = dispatch_trace(wrap_key(func, args, fx_tracer, pre_dispatch), tracer=fx_tracer, concrete_args=tuple(phs))
  File "/data/users/andgu/pytorch/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 481, in dispatch_trace
    graph = tracer.trace(root, concrete_args)
  File "/data/users/andgu/pytorch/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/fx/_symbolic_trace.py", line 821, in trace
    (self.create_arg(fn(*args)),),
  File "/data/users/andgu/pytorch/torch/fx/_symbolic_trace.py", line 688, in flatten_fn
    tree_out = root_fn(*tree_args)
  File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 517, in wrapped
    out = f(*tensors)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 3607, in joint_fn
    return inner_fn(flat_fn_maybe_joint, (primals, tangents), use_trace_joint=True)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 3591, in inner_fn
    wrapped_outs = fn(*all_args)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1941, in joint_helper
    return functionalized_f_helper(primals, tangents)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1894, in functionalized_f_helper
    f_outs = fn(*f_args)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1862, in inner_fn_with_anomaly
    return inner_fn(*args)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1796, in inner_fn
    outs, tangent_mask = fn(*primals)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1724, in inner_fn
    outs = fn(*args_maybe_cloned)
  File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 4552, in functional_call
    out = Interpreter(mod).run(*args[params_len:], **kwargs)
  File "/data/users/andgu/pytorch/torch/fx/interpreter.py", line 138, in run
    self.env[node] = self.run_node(node)
  File "/data/users/andgu/pytorch/torch/fx/interpreter.py", line 195, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/data/users/andgu/pytorch/torch/fx/interpreter.py", line 267, in call_function
    return target(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/distributed/_tensor/api.py", line 280, in __torch_dispatch__
    return DTensor._op_dispatcher.dispatch(
  File "/data/users/andgu/pytorch/torch/distributed/_tensor/dispatch.py", line 106, in dispatch
    self.sharding_propagator.propagate(op_info)
  File "/data/users/andgu/pytorch/torch/distributed/_tensor/sharding_prop.py", line 161, in propagate
    output_sharding = self.propagate_op_sharding_non_cached(op_info.schema)
  File "/data/users/andgu/pytorch/torch/distributed/_tensor/sharding_prop.py", line 175, in propagate_op_sharding_non_cached
    out_tensor_meta = self._propagate_tensor_meta(op_schema)
  File "/data/users/andgu/pytorch/torch/distributed/_tensor/sharding_prop.py", line 85, in _propagate_tensor_meta
    fake_args = op_schema.gen_fake_args()
  File "/data/users/andgu/pytorch/torch/distributed/_tensor/op_schema.py", line 332, in gen_fake_args
    return tree_map_only(
  File "/data/users/andgu/pytorch/torch/utils/_cxx_pytree.py", line 765, in tree_map_only
    return tree_map(
  File "/data/users/andgu/pytorch/torch/utils/_cxx_pytree.py", line 607, in tree_map
    return optree.tree_map(
  File "/home/andgu/local/miniconda3/envs/pytorch-3.10/lib/python3.10/site-packages/optree/ops.py", line 473, in tree_map
    return treespec.unflatten(flat_results)
  File "/data/users/andgu/pytorch/torch/utils/_cxx_pytree.py", line 713, in wrapped
    return func(x)
  File "/data/users/andgu/pytorch/torch/distributed/_tensor/op_schema.py", line 31, in _rebuild_tensor_from_dtensor_meta
    return torch.empty_strided(
  File "/data/users/andgu/pytorch/torch/_subclasses/functional_tensor.py", line 297, in __torch_dispatch__
    outs_unwrapped = func(*args_unwrapped, **kwargs_unwrapped)
  File "/data/users/andgu/pytorch/torch/_ops.py", line 509, in __call__
    return self._op(*args, **kwargs or {})
  File "/data/users/andgu/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 594, in __torch_dispatch__
    return self.inner_torch_dispatch(func, types, args, kwargs)
  File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 629, in inner_torch_dispatch
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
  File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 317, in proxy_call
    proxy_args, proxy_kwargs = pytree.tree_map_only(
  File "/data/users/andgu/pytorch/torch/utils/_pytree.py", line 631, in tree_map_only
    return tree_map(map_only(__type_or_types)(func), tree)
  File "/data/users/andgu/pytorch/torch/utils/_pytree.py", line 523, in tree_map
    return tree_unflatten([func(i) for i in flat_args], spec)
  File "/data/users/andgu/pytorch/torch/utils/_pytree.py", line 523, in <listcomp>
    return tree_unflatten([func(i) for i in flat_args], spec)
  File "/data/users/andgu/pytorch/torch/utils/_pytree.py", line 591, in wrapped
    return func(x)
  File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 247, in inner
    return get_proxy_slot(n, tracer)()
  File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 110, in get_proxy_slot
    raise RuntimeError(f"{obj} is not tracked with proxy for {tracer}")
torch._dynamo.exc.BackendCompilerFailed: backend='aot_eager' raised:
RuntimeError: s0 is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7fae60366c50>

While executing %result_2 : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%prim_redistribute_2, %l_self_mlp_0_net2_weight, %l_self_mlp_0_net2_bias), kwargs = {})
Original traceback:
  File "/data/users/andgu/pytorch/test/distributed/_tensor/test_dtensor_compile.py", line 51, in forward
    return self.mlp_1(self.mlp_0(input))
  File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 64, in forward
    return self.net2(self.relu(self.net1(x)))
  File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/data/users/andgu/pytorch/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114390
Approved by: https://github.com/wanchaol, https://github.com/huydhn
ghstack dependencies: #114379
2023-11-23 05:47:38 +00:00
34326e43eb [DTensor] Made DTensorSpec hash recomputation lazy (#114379)
If we assign `spec.tensor_meta = ...`, we do not have to recompute the hash eagerly. We just need to clear the existing hash so that the next call to `__hash__` recomputes it.

We found that the breakage of the DTensor + `torch.compile` tests comes from https://github.com/pytorch/pytorch/pull/114236 and are not directly related to the `DTensorSpec` hashing changes. We fix that in the following PR temporarily by passing `dynamic=False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114379
Approved by: https://github.com/wanchaol
2023-11-23 05:45:18 +00:00
36763d3135 [ProcessGroupNCCL] Move new trace utils (#114367)
to TraceUtils.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114367
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2023-11-23 05:07:41 +00:00
c340db56d5 [executorch hash update] update the pinned executorch hash (#114427)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114427
Approved by: https://github.com/pytorchbot
2023-11-23 04:54:06 +00:00
088043fc49 [FSDP] Passed TORCH_NCCL_DESYNC_DEBUG instead of NCCL_DESYNC_DEBUG (#114432)
This is to silence some warnings like:
```
[rank0]:[W Utils.hpp:164] Warning: Environment variable NCCL_DESYNC_DEBUG is deprecated; use TORCH_NCCL_DESYNC_DEBUG instead (function getCvarBool)
[rank3]:[W Utils.hpp:164] Warning: Environment variable NCCL_DESYNC_DEBUG is deprecated; use TORCH_NCCL_DESYNC_DEBUG instead (function getCvarBool)
[rank1]:[W Utils.hpp:164] Warning: Environment variable NCCL_DESYNC_DEBUG is deprecated; use TORCH_NCCL_DESYNC_DEBUG instead (function getCvarBool)
[rank2]:[W Utils.hpp:164] Warning: Environment variable NCCL_DESYNC_DEBUG is deprecated; use TORCH_NCCL_DESYNC_DEBUG instead (function getCvarBool)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114432
Approved by: https://github.com/fegin
2023-11-23 04:53:12 +00:00
d18e6b07aa Overload vec::dequantize to eliminate rounding error for quantized sigmoid (#114098)
**Description**
Fix #107030
Dequantize X by `(x_val - zp) * scale` instead of `x_val * scale + (-zp * scale)` to eliminate rounding error.
Now this overload is used for sigmoid only.

Performance impact:
![image](https://github.com/pytorch/pytorch/assets/12522207/655abd16-7d9d-4a9a-8c59-327ebf39157a)
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake)

**Test plan**
`python test_quantization.py TestQuantizedOps.test_sigmoid_dequantize_rounding_error`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114098
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-11-23 04:33:57 +00:00
c4a22d6918 Add support for models with mutated buffer on torch.onnx.dynamo_export (#112272)
This PR adds a unit test that leverages `torch.export.ExportedProgram` models that mutates registered buffers. Although the exporter already works out of the box in such scenario, the GraphModule and the exported ONNX model have extra outputs containing the mutated buffers. On future runs of the ONNX model, the mutated buffers are used as input to the model.

The aforementioned extra inputs and outputs are by design and the `ONNXProgram.model_signature` can be used to fetch detailed input/output schema for the exported model.

However, when we want to compare pytorch output to ONNX's, there is a mismatch between the schema because pytorch output does not include the mutated buffers present on the ONNX output.

This PR extends `onnx_program.adapt_torch_outputs_to_onnx(torch_outputs)` so that the mutated buffers are prepended to the Pytorch output, matching the ONNX schema.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112272
Approved by: https://github.com/titaiwangms, https://github.com/BowenBao
2023-11-23 03:39:18 +00:00
b27565ad7d Forward fix D51468211 (#114381)
Summary:
Forward fix test failures caused by D51468211.

The root cause is that when converting the param_buffer into fake_tensor, we didn't set the static_shapes=True, this causes the shape_env to have more symbols than expected. The current status is that we assume all param and buffers are constant sizes.

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests:export_test_cpu -- --exact 'aps_models/ads/icvr/tests:export_test_cpu - test_20x_icvr_export (aps_models.ads.icvr.tests.export_test.ExportTest)'

Reviewed By: hongtansun-meta

Differential Revision: D51531279

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114381
Approved by: https://github.com/angelayi
2023-11-23 02:58:52 +00:00
7a697c4683 [RelEng] Tag docker images for release, pin unstable and disabled jobs, apply release only changes (#114355)
1. This tags docker images using docker pull/tag/push for current release
2. Sets RELEASE_VERSION_TAG var and regenerates the workflows using the new docker tag
3. Remove conda token setting and Binary tests release changes these are already automated
4. Pin unstable and disabled jobs, autumate: https://github.com/pytorch/pytorch/pull/111675

Test:
```
RELEASE_VERSION=2.2 ./scripts/release/apply-release-changes.sh
Tagging pytorch/manylinux-builder:cuda11.8-main to pytorch/manylinux-builder:cuda11.8-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:cuda12.1-main to pytorch/manylinux-builder:cuda12.1-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:cuda11.8-main to pytorch/libtorch-cxx11-builder:cuda11.8-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:cuda12.1-main to pytorch/libtorch-cxx11-builder:cuda12.1-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:rocm5.6-main to pytorch/manylinux-builder:rocm5.6-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:rocm5.7-main to pytorch/manylinux-builder:rocm5.7-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:rocm5.6-main to pytorch/libtorch-cxx11-builder:rocm5.6-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:rocm5.7-main to pytorch/libtorch-cxx11-builder:rocm5.7-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:cpu-main to pytorch/manylinux-builder:cpu-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:cpu-main to pytorch/libtorch-cxx11-builder:cpu-2.2 , dry_run: enabled
Tagging pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main to pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-2.2 , dry_run: enabled
Tagging pytorch/manylinuxaarch64-builder:cpu-aarch64-main to pytorch/manylinuxaarch64-builder:cpu-aarch64-2.2 , dry_run: enabled
Tagging pytorch/conda-builder:cuda11.8-main to pytorch/conda-builder:cuda11.8-2.2 , dry_run: enabled
Tagging pytorch/conda-builder:cuda12.1-main to pytorch/conda-builder:cuda12.1-2.2 , dry_run: enabled
Tagging pytorch/conda-builder:cpu-main to pytorch/conda-builder:cpu-2.2 , dry_run: enabled
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-manywheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-conda-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-manywheel-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-wheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-conda-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-release-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-debug-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-release-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-debug-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-binary-wheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-binary-conda-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-binary-libtorch-cxx11-abi-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-arm64-binary-libtorch-cxx11-abi-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-arm64-binary-wheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-arm64-binary-conda-nightly.yml
````

Result of pinning unstable and disabled jobs:
```
# The link to the published list of disabled jobs
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionid=kKJlAXdrUbk3CilXbKu.6OwNTGQB8a.B"
# and unstable jobs
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionid=vzaicOxSsh55iXBXwgGrW6dFeVtPfrhr"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114355
Approved by: https://github.com/malfet
2023-11-23 02:14:22 +00:00
2bae888f65 Automated submodule update: FBGEMM (#113977)
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: a142e2064d

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113977
Approved by: https://github.com/malfet
2023-11-23 01:46:34 +00:00
272b40aee5 Revert "deprecate PairwiseParallel from test (#114314)"
This reverts commit 07b6f377b401933e69a605037b8a5c2fba627601.

Reverted https://github.com/pytorch/pytorch/pull/114314 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but this seems to fail periodic multigpu tests ([comment](https://github.com/pytorch/pytorch/pull/114314#issuecomment-1823727818))
2023-11-23 01:43:32 +00:00
f961bda939 [export] Move serialized custom class objs to toplevel (#114371)
Summary:
Move the serialized CustomClassHolder objects to the toplevel SerializedArtifact instead of embedding the bytes in the graph.

Currently the CustomClassHolder objects are embedded in the graph instead of being lifted to the ExportedProgram, so there's some logic introduced to lift it to the higher level of the serialized ExportedProgram. However, once that CustomClassHolder objects get lifted, we can remove the TODOs I added.

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D51479125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114371
Approved by: https://github.com/ydwu4
2023-11-22 23:44:20 +00:00
eqy
6a86cf00ad [CUDA][cuBLAS] Remove explicit cuBLAS workspace allocation for CUDA 12.2+ (#113994)
cuBLAS should be using `cudaMallocAsync` in CUDA 12.2+, which removes the need for explicit workspace allocation to avoid increasing memory usage with multiple graph captures.

CC @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113994
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-11-22 23:23:51 +00:00
5f504d1de7 Check for boolean values as argument on pow function. (#114133)
Hello everyone! 😄
Also @lezcano , nice to meet you! :)

Sorry if I miss anything, this is my first time around here. 🙃

This PR basically makes the same behaviour for cuda when using `torch.pow`. Basically Python considers True as 1 and False as 0. I just added this check into `pow` function. From what I understood, when I do `.equal` for `Scalar` that is boolean, I'm sure that types match so that won't cause more trouble.

I know that the issue suggest to disable this case but that could be a little more complicated, in my humble opinion. And that can create some compability problems too, I guess.

My argument is that code below is correct for native language, so I guess it does makes sense sending booleans as Scalar.

```
$ x = True
$ x + x
2
```

This was my first test:
```
Python 3.12.0 | packaged by Anaconda, Inc. | (main, Oct  2 2023, 17:29:18) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.pow(torch.tensor([1, 2], device='cuda'), True)
tensor([1, 2], device='cuda:0')
>>> torch.pow(torch.tensor([1, 2]), True)
tensor([1, 2])
>>> torch.pow(torch.tensor([1, 2]), False)
tensor([1, 1])
>>> torch.pow(torch.tensor([1, 2], device='cuda'), False)
tensor([1, 1], device='cuda:0')
```

I've run `test_torch.py` and got following results, so my guess is that I didn't break anything. I was just looking for a test that uses linear regression, as suggested.

```
Ran 1619 tests in 52.363s

OK (skipped=111)
[TORCH_VITAL] Dataloader.enabled		 True
[TORCH_VITAL] Dataloader.basic_unit_test		 TEST_VALUE_STRING
[TORCH_VITAL] CUDA.used		 true

```
(I can paste whole log, if necessary)

If this is a bad idea overall, dont worry about it. It's not a big deal, it's actually a two line change 😅  so can we talk of how do things in a different strategy.

For the record I've signed the agreement already. And I didn't run linter because it's not working 😞 . Looks like PyYaml 6.0 is broken and there's a 6.0.1 fix already but I have no idea how to update that 😅

Fixes #113198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114133
Approved by: https://github.com/lezcano
2023-11-22 22:57:36 +00:00
aca6446a6e [executorch hash update] update the pinned executorch hash (#114325)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114325
Approved by: https://github.com/pytorchbot
2023-11-22 22:38:40 +00:00
6f3cd046ab [BE] remove skipIfDynamo for some module hook tests (#114387)
As titled.

Test Plan:
exiting tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114387
Approved by: https://github.com/ezyang
2023-11-22 22:15:34 +00:00
2f536ff92c Refactor values kwarg in foreach tests (#112781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112781
Approved by: https://github.com/lezcano
ghstack dependencies: #112778
2023-11-22 22:10:54 +00:00
ea7d70aecc [BE]: ruff FURB136: replace ternary with min/max (preview) (#114382)
Replaces ternary if else statements with simple min max when appropriate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114382
Approved by: https://github.com/albanD
2023-11-22 22:10:01 +00:00
88a8a0daa4 Revert "Require less alignment for masking (#114173)"
This reverts commit f882c175d8e9731238c3f29ca10821f2fe9f0797.

Reverted https://github.com/pytorch/pytorch/pull/114173 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing some inductor tests f882c175d8 ([comment](https://github.com/pytorch/pytorch/pull/114173#issuecomment-1823552362))
2023-11-22 21:49:31 +00:00
e7726b596e [FSDP] Added DDP parity test for CPU training (#114372)
This is a follow-up to https://github.com/pytorch/pytorch/pull/112145/ to include a numerical parity test with DDP for CPU training.
```
python -m pytest test/distributed/fsdp/test_fsdp_misc.py -k test_fsdp_cpu_training -s
```

We should follow-up on https://github.com/pytorch/pytorch/pull/112145/files#r1375102283 at some point too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114372
Approved by: https://github.com/XilunWu
2023-11-22 21:46:57 +00:00
1b66701379 ci: Bump TorchAudio, less third_party deps (#114393)
Installing the current pinned version of TorchAudio can be problematic because it
expects to be able to download a file from sourceware.org (see [ref](a8f4e97bd5/third_party/bzip2/CMakeLists.txt (L14))) and that does
not have any guarantees of uptime.

This bumps this commit to the latest v2.1.1 commit (https://github.com/pytorch/audio/releases/tag/v2.1.1) which should have less third_party dependencies and thus be less flaky

<details>

<summary> Should help with errors like: </summary>

logs link: https://github.com/pytorch/pytorch/actions/runs/6959510046/job/18942955523#step:15:592

```
5h+ pip install --progress-bar off --no-use-pep517 --user git+https://github.com/pytorch/audio.git@a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
Collecting git+https://github.com/pytorch/audio.git@a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
  Cloning https://github.com/pytorch/audio.git (to revision a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602) to /tmp/pip-req-build-6b5hkzmq
  Running command git clone --filter=blob:none --quiet https://github.com/pytorch/audio.git /tmp/pip-req-build-6b5hkzmq
  Running command git rev-parse -q --verify 'sha^a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602'
  Running command git fetch -q https://github.com/pytorch/audio.git a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
  Running command git checkout -q a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
  Resolved https://github.com/pytorch/audio.git to commit a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... 25l- error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [60 lines of output]
      Traceback (most recent call last):
        File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 1348, in do_open
          h.request(req.get_method(), req.selector, req.data, headers,
        File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 1283, in request
          self._send_request(method, url, body, headers, encode_chunked)
        File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 1329, in _send_request
          self.endheaders(body, encode_chunked=encode_chunked)
        File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 1278, in endheaders
          self._send_output(message_body, encode_chunked=encode_chunked)
        File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 1038, in _send_output
          self.send(msg)
        File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 976, in send
          self.connect()
        File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 1448, in connect
          super().connect()
        File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 942, in connect
          self.sock = self._create_connection(
        File "/opt/conda/envs/py_3.10/lib/python3.10/socket.py", line 845, in create_connection
          raise err
        File "/opt/conda/envs/py_3.10/lib/python3.10/socket.py", line 833, in create_connection
          sock.connect(sa)
      OSError: [Errno 99] Cannot assign requested address

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-req-build-6b5hkzmq/setup.py", line 184, in <module>
          _main()
        File "/tmp/pip-req-build-6b5hkzmq/setup.py", line 145, in _main
          _fetch_third_party_libraries()
        File "/tmp/pip-req-build-6b5hkzmq/setup.py", line 129, in _fetch_third_party_libraries
          _fetch_archives(_parse_sources())
        File "/tmp/pip-req-build-6b5hkzmq/setup.py", line 123, in _fetch_archives
          torch.hub.download_url_to_file(url, dest, progress=False)
        File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py", line 620, in download_url_to_file
          u = urlopen(req)
        File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 216, in urlopen
          return opener.open(url, data, timeout)
        File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 519, in open
          response = self._open(req, data)
        File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 536, in _open
          result = self._call_chain(self.handle_open, protocol, protocol +
        File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 496, in _call_chain
          result = func(*args)
        File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 1391, in https_open
          return self.do_open(http.client.HTTPSConnection, req,
        File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 1351, in do_open
          raise URLError(err)
      urllib.error.URLError: <urlopen error [Errno 99] Cannot assign requested address>
      -- Git branch: HEAD
      -- Git SHA: a8f4e97bd5356a7a77510cdf6a3a62e25a5dc[602](https://github.com/pytorch/pytorch/actions/runs/6959510046/job/18942955523#step:15:603)
      -- Git tag: None
      -- PyTorch dependency: torch
      -- Building version 2.0.0a0+a8f4e97
       --- Initializing submodules
       --- Initialized submodule
       --- Fetching v1.2.12.tar.gz
       --- Fetching bzip2-1.0.8.tar.gz
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
```

</details>

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114393
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/huydhn
2023-11-22 21:05:20 +00:00
d416e5b34f [torchrun] fix incorrect warning for non static backend (#114335)
This PR fixes a incorrect warning for non static rdzv backend, the
warning should only be thrown when the rdzv endpoint not specified.

error repro from @stas00

```
$ cat test.py
import torch

$ python -u -m torch.distributed.run --nproc_per_node=1 --rdzv_endpoint localhost:6000  --rdzv_backend c10d test.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114335
Approved by: https://github.com/H-Huang
2023-11-22 20:09:14 +00:00
f882c175d8 Require less alignment for masking (#114173)
# Summary
Improved Fix for Attention Mask Alignment Issue (#112577)

This PR addresses Issue #112577 by refining the previously implemented fix, which was found to be incorrect and causes un-needed memory regressions. The update simplifies the approach to handling the alignment of the attention mask for mem eff attention.

## Changes
Alignment Check and Padding: Initially, the alignment of the attention mask is checked. If misalignment is detected, padding is applied, followed by slicing. During this process, a warning is raised to alert users.

Should this be warn_once?

We only call expand, once on the aligned mask.

Reference
https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/cutlass.py#L115

@albanD, @mruberry, @jbschlosser, @walterddr, and @mikaylagawarecki.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114173
Approved by: https://github.com/danthe3rd
2023-11-22 20:02:51 +00:00
07b6f377b4 deprecate PairwiseParallel from test (#114314)
**Summary**
To solve issue #113706:
1. replace `PariwiseParallel` with `ColwiseParallel` and `RowwiseParallel`.
2. replace the input of ColwiseParallel from `make_input_replicate_1d` and `make_output_replicate_1d` to `input_layouts` and `output_layouts`.
3. deprecate the tests for `_parallelize_mlp` as it only supports `PariwiseParallel`.

**Test Plan**
`pytest pytorch/test/distributed/tensor/parallel/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114314
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2023-11-22 19:45:40 +00:00
9d68cfee0d [sparse][semi-structured] Make cusparseLt handle + flag thread_local (#114273)
Summary:

As raised in this issue: https://github.com/pytorch/pytorch/issues/113776

cuSPARSELt does not support sharing handles across different threads.
Ideally we would use something like CuSparseHandlePool to do this, but
since cuSPARSELt handle creation is inconsitent with the rest of CUDA,
we have to do make these variables thread_local instead.

Test Plan:

`python test/test_sparse_semi_structured.py`

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114273
Approved by: https://github.com/danthe3rd
2023-11-22 18:55:52 +00:00
84909fef52 Add meta registration for aten.linear_backward (#114359)
Fixes #114358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114359
Approved by: https://github.com/ezyang
2023-11-22 18:24:24 +00:00
0f887a6d1a limit fused kernel num args. (#113131)
Fixes #97361

When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.

Code change:

1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
2023-11-22 18:05:33 +00:00
1f1ff629a8 Use parent class attribute supports_out for foreach_zero opinfo (#112778)
Instead of introducing a new has_no_out_of_place attribute
Also fixes foreach_copy tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112778
Approved by: https://github.com/lezcano
2023-11-22 18:00:44 +00:00
d6578b3678 [quant][pt2e] Refactor some internal code for observer insertion (#113500)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113500
Approved by: https://github.com/kimishpatel
2023-11-22 17:44:46 +00:00
b927a4e2ca Revert "Opportunistically use ncclCommSplit when creating new NCCL groups (#112889)"
This reverts commit 64a5372e6ce9b6ca0ee5c7482b27e24561725b28.

Reverted https://github.com/pytorch/pytorch/pull/112889 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk 4d07428ede ([comment](https://github.com/pytorch/pytorch/pull/112889#issuecomment-1823214376))
2023-11-22 17:43:51 +00:00
00ae299016 [c10d] Remove unused function (#114341)
Summary: As the title suggests

Test Plan: OSS CI

Differential Revision: D51386619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114341
Approved by: https://github.com/Skylion007
2023-11-22 17:31:20 +00:00
9fcf1f9632 [export] Update schema (#114172)
Summary: Will update CustomClassHolder in a followup

Test Plan: CI

Differential Revision: D51343522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114172
Approved by: https://github.com/zhxchen17
2023-11-22 16:43:43 +00:00
9bab96c78c [ONNX] Consider negative dim in _index_fill_reshape_helper (#114050)
Fix export issue of index_copy op with negative dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114050
Approved by: https://github.com/thiagocrepaldi
2023-11-22 15:40:57 +00:00
f2ca07b680 [ProcessGroupNCCL] Remove jumper to UCC (#114170)
The "jumper" to UCC lib in ProcessGroupNCCL was a temporary solution a while back. Cleaning it now that UCC has its own "PG" representation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114170
Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/Aidyn-A
2023-11-22 15:35:06 +00:00
d7f698102e Disable MPS tests on macos-m1-13 runners (#114360)
As all of them are down at the moment, see screenshot below from [HUD](https://hud.pytorch.org/metrics)
<img width="669" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/6c400791-ae7e-460a-9e77-55d454b587f3">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114360
Approved by: https://github.com/atalman
2023-11-22 15:08:15 +00:00
324cde59b2 [MPS] Fix test_copy_cast_no_leak (#114313)
When running on MacOS-13.2 test always fails on first run, but succeeds on the second as presumably it reserves some memory to cache f32->f16 graph. Make it resilient against such failures by adding a warmup step when one conversion is performed before recording driver memory utilization.

Fixes https://github.com/pytorch/pytorch/issues/114305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114313
Approved by: https://github.com/huydhn
2023-11-22 14:48:24 +00:00
33fad1c0d4 [AOTI] Fix a weight loading issue when the weight size can be 0 (#114280)
Summary: When a weight tensor is 0-size, no device memory should be allocated for it. This PR fixes the weight loading logic for such a case. This problem was found when running the 14K model test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114280
Approved by: https://github.com/chenyang78
2023-11-22 14:03:51 +00:00
2f3beb715c Revert "Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926)"
This reverts commit 2ca1119d532af0ba385c7b5944b954c9385b4901.

Reverted https://github.com/pytorch/pytorch/pull/113926 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113926#issuecomment-1822713852))
2023-11-22 12:52:33 +00:00
e239a2b2d7 Revert "[dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)"
This reverts commit 266054c3cac0f800f37348aea1409c4759dd2315.

Reverted https://github.com/pytorch/pytorch/pull/114154 on behalf of https://github.com/DanilBaibak due to The lower PR in the stack https://github.com/pytorch/pytorch/pull/113926 breaks the internal build ([comment](https://github.com/pytorch/pytorch/pull/114154#issuecomment-1822704476))
2023-11-22 12:46:15 +00:00
b4faa6bfa4 [dynamo] report guard failure user stack, fix incorrectly skipping interesting files (#114053)
Fixes https://github.com/pytorch/pytorch/issues/114015

Before:
```
test/dynamo/test_functions.py::DefaultsTests::test_zip_strict [2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'], 94696321555200)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] len(L['ys']) == 3
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'], 94696321555200)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] len(L['zs']) == 3
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][0], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][0] == 1.0
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][1], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][1] == 2.0
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][2], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][2] == 3.0
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][0], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][0] == 2.0
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][1], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][1] == 5.0
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][2], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][2] == 8.0
[2023-11-18 23:11:09,317] [0/0] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:365 in init_ambient_guards
[2023-11-18 23:11:09,317] [0/0] torch._dynamo.guards.__guards: [DEBUG] (___skip_backend_check() or ___current_backend() == ___lookup_backend(140084534469552))  # _dynamo/output_graph.py:371 in init_ambient_guards
[2023-11-18 23:11:09,317] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['x'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[3], stride=[1])
[2023-11-18 23:11:09,320] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function fn in /home/jonch/Desktop/Programming/mlsys/pytorch/test/dynamo/test_functions.py:2539
[2023-11-18 23:11:09,320] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-18 23:11:09,320] torch._dynamo.guards.__recompiles: [DEBUG]     - L['zs'][2] == 8.0

```

After:
```
test/dynamo/test_functions.py::DefaultsTests::test_zip_strict [2023-11-18 23:07:33,341] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2023-11-18 23:07:33,341] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False           # x = x.clone()  # test/dynamo/test_functions.py:2540 in fn
[2023-11-18 23:07:33,341] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'], 94568804551424)                     # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] len(L['ys']) == 3                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'], 94568804551424)                     # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] len(L['zs']) == 3                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][0], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][0] == 1.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][1], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][1] == 2.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][2], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][2] == 3.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][0], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][0] == 2.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][1], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][1] == 5.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][2], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][2] == 8.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:365 in init_ambient_guards
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] (___skip_backend_check() or ___current_backend() == ___lookup_backend(140370726823264))  # _dynamo/output_graph.py:371 in init_ambient_guards
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['x'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[3], stride=[1])  # x = x.clone()  # test/dynamo/test_functions.py:2540 in fn
[2023-11-18 23:07:33,346] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function fn in /home/jonch/Desktop/Programming/mlsys/pytorch/test/dynamo/test_functions.py:2539
[2023-11-18 23:07:33,346] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-18 23:07:33,346] torch._dynamo.guards.__recompiles: [DEBUG]     - L['zs'][2] == 8.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114053
Approved by: https://github.com/ezyang
2023-11-22 12:26:41 +00:00
2b72543f36 Solving pickle error when saving CyclicLR state_dict (#110931)
## How to reproduce:
```py
import os
import tempfile

import torch
from torch import nn
from torch.optim import SGD
from torch.optim.lr_scheduler import CyclicLR

model = nn.Linear(100, 100)
opt = SGD(model.parameters(), lr=1.)
scheduler = CyclicLR(opt, base_lr=0.1, max_lr=0.2, scale_fn=lambda x: 0.99)

tmp = tempfile.NamedTemporaryFile(delete=False)
try:
    torch.save(scheduler.state_dict(), tmp.name)
    scheduler.load_state_dict(torch.load(tmp.name))
finally:
    tmp.close()
    os.unlink(tmp.name)
```
Error:
```
_pickle.PicklingError: Can't pickle <function <lambda> at 0x000001A51DF67600>: attribute lookup <lambda> on __main__ failed
```
## Fix:
Saving `scale_fn` to the state dict only if it is a callable object and not if it is a function or lambda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110931
Approved by: https://github.com/janeyx99
2023-11-22 11:38:35 +00:00
a0e3321f0c [inductor cpp] vectorize embedding lookup (#114062)
For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.

This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed.

For the added UT `test_embedding_vec`, the generated code before this PR is:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
                    auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
                    out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
                }
            }
        }
    }
}
```

After this PR, we have:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
                    auto tmp6 = tmp4 + tmp5;
                    tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062
Approved by: https://github.com/jansel
2023-11-22 11:19:42 +00:00
3e1abde46d Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)"
This reverts commit a911b4db9d82238a1d423e2b4c0a3d700217f0c1.

Reverted https://github.com/pytorch/pytorch/pull/111554 on behalf of https://github.com/DanilBaibak due to The lower PR in the stack #113926 breaks the internal build ([comment](https://github.com/pytorch/pytorch/pull/111554#issuecomment-1822472206))
2023-11-22 10:13:48 +00:00
172a103857 [dynamo] strict=True kwarg for zip (#114047)
Fixes https://github.com/pytorch/pytorch/issues/113894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114047
Approved by: https://github.com/ezyang
2023-11-22 08:48:51 +00:00
c77a4a4096 Fix compiling add with torch.int32 and scalars (#113965)
Fixes #113944

When `b` and `alpha` are both scalars, using `prims.mul` will create a tensor with dtype `int64` resulting in wrong dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113965
Approved by: https://github.com/ezyang
2023-11-22 07:32:19 +00:00
9f0deb132b [Inductor] Refactor group/batch fusion to support user defined execution order and configs (#113738)
Meta internal customers need more flexible configs on these group/batch fusion's execution order and parameters, I'd like to provide a new inductor config that users can fine and auto tune these group/batch fusions for different models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113738
Approved by: https://github.com/xuzhao9
2023-11-22 05:46:23 +00:00
bebe66e262 [ONNX] Benchmark to save sample inputs to disk before running (#114163)
Such that even if failures occur during model run, the sample inputs
are accessible for later investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114163
Approved by: https://github.com/thiagocrepaldi
ghstack dependencies: #113780
2023-11-22 05:39:00 +00:00
bd44bdb675 [ONNX][dynamo_export] Turn off opmath for type promotion (#113780)
Although opmath is the right thing to do to retain on-par precision, it inserts
upcasts everywhere in the graph. This is particularly hard for backend to optimize
since there is no way to differentiate between inserted upcasts and model code
casts. Hence we consolidate the input dtype to the result dtype to avoid this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113780
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby
2023-11-22 05:39:00 +00:00
e7326ec295 [DTensor] Computed DTensorSpec hash lazily (#114322)
This is a forward fix for https://github.com/pytorch/pytorch/issues/113781.

We lazily compute the hash so that we do not try to compute the hash on `SymInt`s (for the stride) during Dynamo tracing.

Tested via:
```
python test/distributed/_tensor/test_dtensor_compile.py -k test_2d_fsdp_tp_ac_compile
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114322
Approved by: https://github.com/wanchaol
ghstack dependencies: #113919, #113924, #114134, #113925, #113930, #114141, #113915, #114140
2023-11-22 04:13:11 +00:00
c5ddfa79b3 [HigherOrderOp] add output tensor meta check for cond (#113900)
This PR checks the tensor meta of the outputs of cond's branches. This helps us to identify several tests that return outputs that have different requires_grad. Also fix the error messages, which previously was in torch.ops.higher_order.cond now is raised in dynamo CondHigherOrder.

Test Plan:
Existing tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113900
Approved by: https://github.com/zou3519
ghstack dependencies: #113819
2023-11-22 04:06:30 +00:00
9e657ce2ed [HigherOrderOp] set should_flatten_output=True for cond (#113819)
This PR add should_flatten_outpu=True for cond. This effectively allows cond to support pytree output with the output being flattened. Note: a single tensor output will be automatically casted as tuple for torch.ops.higher_order.cond.

This PR also adds support for comparing BuiltinVariables e.g. tuple, this is to make sure we could make dynamo inline comparing two tree_spec to make sure both branches returns the same tree_spec.

Test Plan:
Existing tests. Will add more pytree tests and modify the documentations in the follow-up prs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113819
Approved by: https://github.com/zou3519
2023-11-22 04:06:30 +00:00
e0ec71deab Fix module: distributed labeler (#114324)
Removes preceding `/` which was preventing labeler from working.  (looks like a typo in the original PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114324
Approved by: https://github.com/XilunWu, https://github.com/fegin
2023-11-22 03:43:14 +00:00
0a33cf95c6 Add python-3.12 to triton wheels build matrix (#114327)
Not sure if it will work, but perhaps worth a try

Inspired by [following comment](56556d0aac/manywheel/build_cuda.sh (L266)):
```
# No triton dependency for now on 3.12 since we don't have binaries for it
# and torch.compile doesn't work.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114327
Approved by: https://github.com/kit1980, https://github.com/PaliC
2023-11-22 03:26:32 +00:00
2c4930a91d Revert "[fx/DDP] add nested ctx_manager test for DDP Dynamo (#114056)"
This reverts commit d5d62e85615fdf345e0556a9d8edbee2d3c64ae2.

Reverted https://github.com/pytorch/pytorch/pull/114056 on behalf of https://github.com/malfet due to Breaks inductor_distributed, see d5d62e8561 ([comment](https://github.com/pytorch/pytorch/pull/114056#issuecomment-1822006423))
2023-11-22 02:52:31 +00:00
db8f9686a7 [cmake] set 'mcpu=generic' as the default build flag for mkldnn on aarch64 (#113820)
This is to remove the dependencies on mkldnn cmake default definitions

Fixes https://github.com/pytorch/pytorch/issues/109312

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113820
Approved by: https://github.com/malfet
2023-11-22 02:49:33 +00:00
6187153753 Consolidate sym/non-sym overloads for _make_wrapper_subclass (#114236)
I'm not sure why we needed two overloads previously, let's find out! Removing the int overload is load bearing because it now forces specialization on SymInt arguments instead of falling through to the SymInt overload, see new test.

I decided NOT to allow storage offset simultaneously with None strides.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114236
Approved by: https://github.com/albanD
2023-11-22 02:03:29 +00:00
a785fbe513 [reland][quant][pt2e] Refactor insert observer to do sharing checking in the same place (#113458) (#113920)
Summary:
Previously it is scatter in two different places: before inserting observer and during observer,
this PR moved everything before we insert observer

* Next: refactor QuantizationSpec and check more fields for sharing

Test Plan:
CI (regression tests)

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D51420029](https://our.internmc.facebook.com/intern/diff/D51420029)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113920
Approved by: https://github.com/andrewor14
2023-11-22 01:48:51 +00:00
3f736c2d77 Add ONNXProgram.__call__ API to run model with ONNX Runtime (#113495)
Currently the user can use torch.onnx.dynamo_export to export the model.
to ONNX.

```python
import torch

class Model(torch.nn.Module):
    def forward(self, x):
        return x + 1.0

onnx_program = torch.onnx.dynamo_export(
    Model(),
    torch.randn(1, 1, 2, dtype=torch.float),
)
```

The next step would be instantiating a ONNX runtime to execute it.

```python
import onnxruntime  # type: ignore[import]

onnx_input = self.adapt_torch_inputs_to_onnx(*args, **kwargs)
options = options or {}
providers = options.get("providers", onnxruntime.get_available_providers())
onnx_model = self.model_proto.SerializeToString()
ort_session = onnxruntime.InferenceSession(onnx_model, providers=providers)

def to_numpy(tensor):
    return (
        tensor.detach().cpu().numpy()
        if tensor.requires_grad
        else tensor.cpu().numpy()
    )

onnxruntime_input = {
    k.name: to_numpy(v) for k, v in zip(ort_session.get_inputs(), onnx_input)
}

return ort_session.run(None, onnxruntime_input)
```

This PR provides the `ONNXProgram.__call__` method as facilitator to use ONNX Runtime under the hood, similar to how `torch.export.ExportedProgram.__call__` which allows the underlying `torch.fx.GraphModule` to be executed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113495
Approved by: https://github.com/titaiwangms
2023-11-22 01:48:45 +00:00
044cd56dcc [Easy] make @markDynamoStrictTest set nopython=True (#114308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114308
Approved by: https://github.com/zou3519, https://github.com/oulgen
2023-11-22 01:36:29 +00:00
d5d62e8561 [fx/DDP] add nested ctx_manager test for DDP Dynamo (#114056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114056
Approved by: https://github.com/wconstab
2023-11-22 01:08:25 +00:00
4d07428ede Fix for out of bounds read in mobile interpreter FORMAT opcode handler (#110303)
Summary:
The FORMAT opcode for the mobile TorchScript interpreter contained an out of bounds read issue leading to memory corruption.

This change adds an explicit check that the number of inputs passed to the format method called when handling the FORMAT opcode is a valid and within bounds of the stack.

Test Plan: contbuild + OSS signals

Differential Revision: D49739095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110303
Approved by: https://github.com/malfet
2023-11-22 01:05:42 +00:00
9cbee4757e [Autotune] Reduce XLBOCK for outer reduction (#114284)
I have observed that quite a few Reduction.Outer kernels have potential for performance improvement by reducing register pressure. This is due to our current register pressure reduction logics, which only reduces RBLOCK, doesn't work for outer reductions. While we can tighten up there, which will likely increase compile time, I found a better workaround to tune down XBLOCK in the first place.

Perf job: main 9efbb4ea73 (11/16) vs hoy/autotune/reduction
Slight compile time and perf improvement seen.
I also saw perf improvement locally for the few kernels being investigated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114284
Approved by: https://github.com/jansel
2023-11-22 00:28:08 +00:00
995fae6060 Move small pypi build as default for linux cuda 12.1 (#114281)
This is first PR to resolve: https://github.com/pytorch/pytorch/issues/113972
Move our small wheel build as default
Test:
```
pip3 install --no-cache-dir --pre torch-2.2.0.dev20231121%2Bcu121-cp310-cp310-linux_x86_64.whl  --index-url https://download.pytorch.org/whl/nightly/cu121
Looking in indexes: https://download.pytorch.org/whl/nightly/cu121
Processing ./torch-2.2.0.dev20231121%2Bcu121-cp310-cp310-linux_x86_64.whl
Collecting filelock (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/filelock-3.9.0-py3-none-any.whl (9.7 kB)
Collecting typing-extensions>=4.8.0 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Collecting sympy (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/sympy-1.11.1-py3-none-any.whl (6.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/6.5 MB 253.4 MB/s eta 0:00:00
Collecting networkx (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/networkx-3.0rc1-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 387.1 MB/s eta 0:00:00
Collecting jinja2 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/Jinja2-3.1.2-py3-none-any.whl (133 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.1/133.1 kB 365.3 MB/s eta 0:00:00
Collecting fsspec (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/fsspec-2023.4.0-py3-none-any.whl (153 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.0/154.0 kB 370.6 MB/s eta 0:00:00
Collecting pytorch-triton==2.1.0+6e4932cda8 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-2.1.0%2B6e4932cda8-cp310-cp310-linux_x86_64.whl (125.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 125.4/125.4 MB 384.1 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 404.9 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 kB 402.5 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 383.9 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 731.7/731.7 MB 406.9 MB/s eta 0:00:00
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 388.2 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 410.5 MB/s eta 0:00:00
Collecting nvidia-curand-cu12==10.3.2.106 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 272.9 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 381.5 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 394.6 MB/s eta 0:00:00
Collecting nvidia-nccl-cu12==2.19.3 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (166.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.0/166.0 MB 384.7 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu12==12.1.105 (from torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 281.8 MB/s eta 0:00:00
Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/cu121/nvidia_nvjitlink_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (19.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.8/19.8 MB 367.3 MB/s eta 0:00:00
Collecting MarkupSafe>=2.0 (from jinja2->torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Collecting mpmath>=0.19 (from sympy->torch==2.2.0.dev20231121+cu121)
  Downloading https://download.pytorch.org/whl/nightly/mpmath-1.2.1-py3-none-any.whl (532 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 532.6/532.6 kB 391.3 MB/s eta 0:00:00
Installing collected packages: mpmath, typing-extensions, sympy, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, pytorch-triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114281
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-11-22 00:10:03 +00:00
628586606e [test] fix broken test, enable test (#114235)
Fixes root cause of https://github.com/pytorch/pytorch/pull/114053#issuecomment-1820632457

This test was not running on OSS CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114235
Approved by: https://github.com/ezyang
2023-11-22 00:04:39 +00:00
066ac56e02 ci: Clean up logic for merge -r (#114295)
Rely on built in bash conditionals for doing the if statement rather
than relying on $?

To avoid issues observed in https://github.com/pytorch/pytorch/pull/111008#issuecomment-1821547141

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114295
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-11-21 23:36:34 +00:00
afdc528520 Print the index and summary of the SampleInput that failed an OpInfo test (#99444)
Related to the Reproducible Testing BE project. Goal is to print out the sample input that failed an OpInfo test.

Crazy idea: to avoid requiring widespread changes across tests that use OpInfo sample inputs, return a new special iterator type from `OpInfo.sample_inputs()`, etc. that tracks the most recent item seen. If a test fails later on, print out this info to identify the sample that failed the test.

This solves the problem that the test framework currently has no concept of which sample input is being operated on.

This PR contains the following changes:
* New `TrackedInputIter` that wraps a sample inputs func iterator and tracks the most recent input seen in a `TrackedInput` structure
    * The information is stored in a dictionary on the test function itself, mapping `full test ID -> most recent TrackedInput`
* To determine the test function that is being run, we do some stack crawling hackery in `extract_test_fn_and_id()`
* Above applies only when one of the following is called: `OpInfo.sample_inputs()`, `OpInfo.error_inputs()`, `OpInfo.reference_inputs()`, and `OpInfo.conjugate_sample_inputs()`. This could easily be extended to `ModuleInfo`s and the sparse sample input funcs as well

Example output when a sample input causes a failure:
```
======================================================================
ERROR: test_foo_add_cpu_uint8 (__main__.TestFakeTensorCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 911, in test_wrapper
    return test(*args, **kwargs)
  File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 1097, in only_fn
    return fn(slf, *args, **kwargs)
  File "/home/jbschlosser/branches/reproducible_testing/test/test_ops.py", line 2211, in test_foo
    self.fail('Example failure')
AssertionError: Example failure

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_utils.py", line 2436, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 414, in instantiated_test
    result = test(self, **param_kwargs)
  File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 917, in test_wrapper
    raise Exception(
Exception: Caused by sample input at index 2: SampleInput(input=Tensor[size=(5, 1), device="cpu", dtype=torch.uint8], args=TensorList[Tensor[size=(5,), device="cpu", dtype=torch.uint8]], kwargs={}, broadcasts_input=True, name='')

To execute this test, run the following from the base repo dir:
     python test/test_ops.py -k test_foo_add_cpu_uint8

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
```

This notably doesn't print the actual `SampleInput` values, as that's hard without fully reproducible random sample generation. I went down this path for a while and it seems infeasible without adding an untenable amount of overhead to set the random seed per SampleInput (see https://github.com/pytorch/pytorch/issues/86694#issuecomment-1614943708 for more details). For now, I am settling for at least spitting out the index and some metadata of the `SampleInput`, as it seems better than nothing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99444
Approved by: https://github.com/janeyx99
2023-11-21 23:08:35 +00:00
7fc292930c Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-21 23:07:21 +00:00
b88abb1674 [ONNX] Fix export issue of aten::layer_norm in opset 17 (#114058)
For torch.nn.LayerNorm, weight and bias could be None(when parameter elementwise_affine is False or bias is False), but for onnx op LayerNormalization from opset 17, weight and bias cannot be None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114058
Approved by: https://github.com/thiagocrepaldi
2023-11-21 22:45:50 +00:00
62de29d06f [optim] be explicit about CPU scalar tensor dtypes (#111008)
Fixes https://github.com/pytorch/pytorch/issues/110940

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111008
Approved by: https://github.com/janeyx99
2023-11-21 22:44:50 +00:00
266054c3ca [dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)
Fixes https://github.com/pytorch/pytorch/issues/113812, https://github.com/pytorch/pytorch/issues/102591, Probably fixes: https://github.com/pytorch/pytorch/issues/113740, https://github.com/pytorch/pytorch/issues/113786, https://github.com/pytorch/pytorch/issues/113788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114154
Approved by: https://github.com/wconstab
2023-11-21 22:40:08 +00:00
54d04553ea [fx, DDP] fx.split_module will setup/unwind autocast & grad_mode (#113374)
---

Replaces: https://github.com/pytorch/pytorch/pull/112231
Fixes: https://github.com/pytorch/pytorch/issues/111794

DDPOptimizer splits modules. We need to setup/unwind global states (autocast, grad_enabled) for each split, as this affects downstream compilation.

---

See before and after this PR for the split fx modules here (for autocast mode): https://github.com/pytorch/pytorch/pull/112231#issuecomment-1804274605

---

### Discussion
We don't actually have to do this for grad mode: https://github.com/pytorch/pytorch/pull/112231#issuecomment-1804280031. It's not wrong to do it anyway, but maybe unnecessary? But may still be better to keep this PR's changes so we're sure what the grad mode state ought to be for each subgraph.

It may come in handy in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113374
Approved by: https://github.com/wconstab
2023-11-21 21:29:59 +00:00
6ff7260700 [CI] Switch to check against expected result files for cpu inductor integration tests (#113668)
Summary: With this, we can completely remove CI_SKIP from common.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113668
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #113574, #113575, #113446, #113559
2023-11-21 21:20:47 +00:00
a9f9f98e2f [CI] Switch to check against expected result files for dynamo_eager and aot_eager benchmark tests (#113559)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113559
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #113574, #113575, #113446
2023-11-21 21:20:47 +00:00
212f668408 [CI] Remove CI skip list for inductor integration tests (#113446)
Summary: Switch to completely rely on checking against expected result files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113446
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/jansel
ghstack dependencies: #113574, #113575
2023-11-21 21:20:41 +00:00
3c8a4f01b9 [CI] Increase the shard numbers for torchbench tests (#113575)
Summary: torchbench tests are always the lagging shards when comparing to other integration test shards, so let's bump up the corresponding shard numbers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113575
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/jansel
ghstack dependencies: #113574
2023-11-21 21:20:34 +00:00
799d8c3035 [CI] Rename the inductor test config names for dynamic shapes tests (#113574)
Summary: To make the naming consistent with tests in inductor-periodic and simplify update_expected.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113574
Approved by: https://github.com/eellison, https://github.com/malfet, https://github.com/jansel
2023-11-21 21:20:27 +00:00
ebeaec71bf [aotinductor] don't generate python profiling code in the cpp world (#114182)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114182
Approved by: https://github.com/aakhundov, https://github.com/desertfire
2023-11-21 21:11:58 +00:00
64a5372e6c Opportunistically use ncclCommSplit when creating new NCCL groups (#112889)
Currently `ncclCommInitRankConfig` is always used when creating new
communicator groups.  This is wasteful as it creates non-shared pairs
of endpoint queues as well as costs time to re-establish
communication.

This change is transparent and opportunistic; when `dist.new_group` is
called, it will use the existing, healthy world process group to
select the right ranks to include in the process group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889
Approved by: https://github.com/kwen2501
2023-11-21 21:03:52 +00:00
3b108a150a A fix for reduction + pointwise + multi-level reduction optimization (#112935)
ATT, for cases like reduction + multiple pointwises + multi-level reduction, previously to decide num_splits of the multi-level reduction, we only check whether the input of multi-level reduction or input of input of multi-level reduction is a reduction node (i.e. max search level is 2). This PR changes the behavior to search for a reduction input node recursively if previous input nodes are pointwise nodes.

Performance-wise it looks fine.
![Screenshot 2023-11-15 at 11 52 28 PM](https://github.com/pytorch/pytorch/assets/10527447/e726948c-0c00-4839-87a4-bcf9044c66d7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112935
Approved by: https://github.com/chenyang78
2023-11-21 20:34:07 +00:00
2abfb8ec7d Correctly codegen math.inf in Inductor (#114159)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114159
Approved by: https://github.com/lezcano
2023-11-21 20:16:48 +00:00
c47d2b8035 Add Half support for CPU autocast on eager mode (#112484)
Add Half support for CPU autocast on eager mode since common operators have Half support on CPU.
https://github.com/pytorch/pytorch/issues/96093.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112484
Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang
2023-11-21 20:08:28 +00:00
4e4a6ad6ec [pytree] register pytree node type in both C++ pytree and Python pytree (#112111)
Changes:

1. Add `_private_register_pytree_node` API in both C++ and Python pytree. In C++ pytree, the API will only register pytree node for C++ pytree. In Python pytree, the API will only register pytree node for Python pytree.
2. Do not allow registering a type as pytree node twice in the Python pytree.
3. Add thread lock to the Python pytree node register API.
4. The old `_register_pytree_node` API will call the `_private_register_pytree_node` API and raise a deprecation warning.
5. Add a new `register_pytree_node` API to register node type in both C++ and Python implementations.
6. Add tests to ensure a warning will be raised when the old private function is called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112111
Approved by: https://github.com/zou3519
2023-11-21 19:53:13 +00:00
85b97605ab Enable set sequence nr (#114120)
Summary:
In some cases (especially those involving collective calls) - we would want to always kick off a collective call first before running going down another path.

For  example:

```
tbe lookup -> a2a ->
                     overarch
dense ------------->
```

if the forward code is written as
a2a_out = a2a
dense = dense_net
out = overarch(a2a_out, dense)
out.backward()

The current default is running backwards in the opposite order the forward is called. However, there is no data dependency between a2a and dense, so in reality either of them could be run first. We would like the a2a to run first because it provides optimal (on average) overlap.

Changing the seq_nr of a2a_out to something large enough would allow autograd engine to kick it off first.

Test Plan: Tests incoming

Differential Revision: D51445261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114120
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-11-21 19:47:28 +00:00
1a3dbf57ca vmap: simple inplace batch rule (#113513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113513
Approved by: https://github.com/zou3519
2023-11-21 18:55:54 +00:00
f66add9b85 [dynamo] graph break on np.ndarray.tobytes (#114208)
We can't model this accurately across np and tnp https://github.com/pytorch/pytorch/issues/114204#issuecomment-1820269949

So let's not even try. Just graph break.

Fixes: https://github.com/pytorch/pytorch/issues/114204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114208
Approved by: https://github.com/lezcano
2023-11-21 18:19:37 +00:00
7694b05416 [DTensor] Reduced to one isinstance call in is_shard (#114140)
This is a nit change to save one `isinstance` call for when `dim` is not `None` but the placement is not `Shard`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114140
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
ghstack dependencies: #113919, #113924, #114134, #113925, #113930, #114141, #113915
2023-11-21 17:31:02 +00:00
ef90508f75 [AOTI] Support ReinterpretView in abi mode (#114169)
https://github.com/pytorch/pytorch/pull/113967 added support for
ReinterpretView but it turnes out we codegen it differently in abi
compat mode. This PR adds support for abi compat mode as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114169
Approved by: https://github.com/aakhundov
2023-11-21 17:08:00 +00:00
b5dd37f23e [MPS] Fix memory leak in copy_from_mps_ (#114197)
By always calling `[destBuffer release]` before leaving the scope in which it was allocated.
Leak was introduced by https://github.com/pytorch/pytorch/pull/84928
Add regression test.
Before the change:
```
% python ../test/test_mps.py -v -k test_copy_cast_no_leak --repeat 10
test_copy_cast_no_leak (__main__.TestMemoryLeak) ... FAIL

======================================================================
FAIL: test_copy_cast_no_leak (__main__.TestMemoryLeak)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2554, in wrapper
    method(*args, **kwargs)
  File "/Users/nshulga/git/pytorch/pytorch/build/../test/test_mps.py", line 1064, in test_copy_cast_no_leak
    self.assertTrue(driver_before == driver_after, f"Detected {driver_after-driver_before} bytes leak of GPU memory")
AssertionError: False is not true : Detected 65536 bytes leak of GPU memory

To execute this test, run the following from the base repo dir:
     python test/test_mps.py -k test_copy_cast_no_leak

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 1.102s

FAILED (failures=1)
```
After:
```
% python ../test/test_mps.py -k test_copy_cast_no_leak --repeat 10
.
----------------------------------------------------------------------
Ran 1 test in 0.819s

OK
.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK
.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK
...
```

Fixes https://github.com/pytorch/pytorch/issues/114096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114197
Approved by: https://github.com/kit1980
2023-11-21 14:52:55 +00:00
4b7f9fa436 Meta register all foreach ops (#112281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112281
Approved by: https://github.com/lezcano
2023-11-21 14:23:09 +00:00
1f8d00c5a3 [inductor] Added decomposition for upsample_nearest_exact Nd (#113749)
Description:
- Added decomposition for upsample_nearest_exact: 1d, 2d, 3d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113749
Approved by: https://github.com/lezcano
2023-11-21 13:03:47 +00:00
7733599b2e update pthreadpool to 4fe0e1e183925bf8cfa6aae24237e724a96479b (#113904)
submodule / Updating pthreadpool to this revision.

This is in preparation for upgrading XNNPACK, as the new XNNPACK version uses some of the new pthreadpool APIs introduced in this revision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113904
Approved by: https://github.com/Skylion007
2023-11-21 12:45:16 +00:00
2aa486de9b vendor packaging.version (#114108)
Fixes #113940. This vendors the relevant parts of [`packaging==23.2.0`]() to have access to `Version` and `InvalidVersion` without taking a runtime dependency on `setuptools` or `packaging`.

I didn't find any vendoring policy so I put it under `torch._vendor.packaging`. While I have only vendored the files we need, I have not touched or trimmed the files otherwise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114108
Approved by: https://github.com/malfet, https://github.com/albanD
2023-11-21 11:51:23 +00:00
8ec59d3553 Revert "[dynamo] report guard failure user stack, fix incorrectly skipping interesting files (#114053)"
This reverts commit 826ab0e32d558415d5d682842417fd16b2223739.

Reverted https://github.com/pytorch/pytorch/pull/114053 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/114053#issuecomment-1820584281))
2023-11-21 10:05:15 +00:00
dd6ef0877e Revert "[inductor cpp] vectorize embedding lookup (#114062)"
This reverts commit 2c0474c02d3ac04a429504225d7f1a6536d3b9e6.

Reverted https://github.com/pytorch/pytorch/pull/114062 on behalf of https://github.com/huydhn due to Sorry for reverting your change, please help fix lint and reland it 2c0474c02d ([comment](https://github.com/pytorch/pytorch/pull/114062#issuecomment-1820526515))
2023-11-21 09:21:20 +00:00
1efff12a88 [pytorch-vulkan] BinaryOps auto convert int tensors into float (#114145)
Summary: Some model has hardcoded int constant tensors for some binary operations.

Test Plan:
```
yipjustin@yipjustin-mbp fbsource % buck2 run -c pt.has_backtraces=1   --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -- --gtest_filter="*"
...
[       OK ] VulkanAPITest.linear_3d_flat (0 ms)
[ RUN      ] VulkanAPITest.linear_3d_small
[       OK ] VulkanAPITest.linear_3d_small (0 ms)
[ RUN      ] VulkanAPITest.linear_3d_large
[       OK ] VulkanAPITest.linear_3d_large (0 ms)
[ RUN      ] VulkanAPITest.linear_4d_flat
[       OK ] VulkanAPITest.linear_4d_flat (0 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (0 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (0 ms)
[ RUN      ] VulkanAPITest.lstm_success
[       OK ] VulkanAPITest.lstm_success (5 ms)
[ RUN      ] VulkanAPITest.lstm_mclareninputs_success
[       OK ] VulkanAPITest.lstm_mclareninputs_success (21 ms)
[ RUN      ] VulkanAPITest.lstm_prepack_success
[       OK ] VulkanAPITest.lstm_prepack_success (8 ms)
[ RUN      ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:8108: Skipped
QueryPool is not available

[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 414 tests from VulkanAPITest (5690 ms total)

[----------] Global test environment tear-down
[==========] 414 tests from 1 test suite ran. (5690 ms total)
[  PASSED  ] 413 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 9 DISABLED TESTS

```

Full Paste: P885827407

Differential Revision: D51452935

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114145
Approved by: https://github.com/SS-JIA
2023-11-21 09:06:33 +00:00
5f0d72124e Revert "Print the index and summary of the SampleInput that failed an OpInfo test (#99444)"
This reverts commit e7f12b1eb0cedfd20dcb41ea35e21e9a71e3390a.

Reverted https://github.com/pytorch/pytorch/pull/99444 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause memory leak on CUDA job e7f12b1eb0 ([comment](https://github.com/pytorch/pytorch/pull/99444#issuecomment-1820491298))
2023-11-21 08:58:54 +00:00
6c597ef015 [PyTorch] Fix attr cleanup after constant folding (#113957)
Summary:
Two nodes can point to the same attribute via node.target.

This makes sure,
    - we don't try to delete already deleted attribute, i.e. delete attr only once
    - we do delete all the nodes pointing to the attribute

Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:test_xnnpack_passes -- executorch.backends.xnnpack.test.passes.test_batch_norm_fusion.TestBatchNormFusion.test_q8_batch_norm_fusion
```

Differential Revision: D51419442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113957
Approved by: https://github.com/Skylion007
2023-11-21 07:48:15 +00:00
2c0474c02d [inductor cpp] vectorize embedding lookup (#114062)
For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.

This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed.

For the added UT `test_embedding_vec`, the generated code before this PR is:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
                    auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
                    out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
                }
            }
        }
    }
}
```

After this PR, we have:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
                    auto tmp6 = tmp4 + tmp5;
                    tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062
Approved by: https://github.com/jansel
ghstack dependencies: #113950
2023-11-21 07:37:15 +00:00
8f8722e3f1 [nccl-pg] Avoid using NCCL_ prefix for non-NCCL env variables (#114077)
NCCL_ prefix should only be used for NCCL library's environment variables.  We currently use a few environment variables in PyTorch with the NCCL_ prefix that are the NCCL library does not understand.

This patch renames such environment variables to use the TORCH_NCCL_ prefix instead.  We still maintain the old NCCL_ variables, but throw a warning when they are used.

The following env changes have been made:

`NCCL_BLOCKING_WAIT` -> `TORCH_NCCL_BLOCKING_WAIT`
`NCCL_ENABLE_TIMING` -> `TORCH_NCCL_ENABLE_TIMING`
`NCCL_DESYNC_DEBUG` -> `TORCH_NCCL_DESYNC_DEBUG`
`NCCL_ASYNC_ERROR_HANDLING` -> `TORCH_NCCL_ASYNC_ERROR_HANDLING`
`ENABLE_NCCL_HEALTH_CHECK` -> `TORCH_ENABLE_NCCL_HEALTH_CHECK`
`NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK` -> `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114077
Approved by: https://github.com/fduwjj
2023-11-21 07:23:42 +00:00
e122c90d3c [executorch hash update] update the pinned executorch hash (#114008)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114008
Approved by: https://github.com/pytorchbot, https://github.com/huydhn
2023-11-21 06:31:14 +00:00
99af534e93 [docs][jit] Mention dynamic-shapes settings in jit/OVERVIEW.md (#113964)
Document torch._C._jit_set_fusion_strategy, which can control how many static-shape compilation attempts are made before falling back to dynamic shapes, before falling back to uncompiled graph execution.

Would be good to keep all the graph executor settings documented in one place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113964
Approved by: https://github.com/eellison
2023-11-21 06:21:38 +00:00
7ea184d7e3 Handle item() on boolean tensor (#114157)
This needs some special handling because we don't actually allocate
boolean symbols in sympy; we allocate an integer indicator variable.
See comment for more details.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114157
Approved by: https://github.com/ydwu4
2023-11-21 04:34:58 +00:00
18e1a37c4e [ao] updating embedding_bag support for fx and eager (#107623)
Summary: our docs were saying dynamic embedding bag wasn't supported but
it actually is (at least at the same level as embeddings were) it just wasn't previously tested/listed.

Test Plan: python test/test_quantization.py -k "test_embedding"

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107623
Approved by: https://github.com/jerryzh168
2023-11-21 03:54:00 +00:00
dc65f6c601 [c10d] Remove deprecated multi-gpu-per-thread APIs (#114156)
As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document.  The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156
Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang
2023-11-21 03:50:23 +00:00
f67696f45e Update TorchFix to 0.2.0 (#114190)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114190
Approved by: https://github.com/malfet
2023-11-21 03:46:28 +00:00
e76c54bd87 [vision hash update] update the pinned vision hash (#113217)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113217
Approved by: https://github.com/pytorchbot
2023-11-21 03:39:45 +00:00
bbc39b7bb4 [dtensor] enable RMSprop optimizer foreach support (#114152)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114152
Approved by: https://github.com/XilunWu
ghstack dependencies: #114149, #114150, #114151
2023-11-21 03:23:40 +00:00
bcd310a7ad [dtensor] enable adagrad foreach support (#114151)
This PR enables the adagrad foreach mode support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114151
Approved by: https://github.com/XilunWu
ghstack dependencies: #114149, #114150
2023-11-21 03:23:40 +00:00
9b50611002 [dtensor] add test for SGD optimizer (#114150)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114150
Approved by: https://github.com/XilunWu
ghstack dependencies: #114149
2023-11-21 03:23:35 +00:00
b09bd36402 [dtensor] add test for adamw (#114149)
This PR add tests for adamw optimizers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114149
Approved by: https://github.com/XilunWu
2023-11-21 03:23:28 +00:00
36869463e0 [DTensor] add forward layer norm test (#114174)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114174
Approved by: https://github.com/fduwjj, https://github.com/wanchaol
2023-11-21 03:01:21 +00:00
87925789ae Make V.graph properly typed (#114025)
Previously it lacked a type hint and so was treated as an Any type. This
resulted in a lot of untyped code downstream as V.graph is referenced in
many places in inductor code. I've typed it properly now as
GraphLowering, and fixed the numerous type errors this surfaced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025
Approved by: https://github.com/eellison
ghstack dependencies: #114013
2023-11-21 02:14:29 +00:00
4812a62ca0 [inductor] Delete more type-ignores in dependencies.py (#114013)
A couple of type hints were wrong

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114013
Approved by: https://github.com/eellison
2023-11-21 02:14:29 +00:00
a911b4db9d AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:

(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)

(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.

(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.

(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).

I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation

**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
    x_view = x.view(-1)
    x.set_(torch.ones(2))
    x_view.mul_(2)
    return
```

If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:

(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input

(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```

def functionalized_f(x):
    x_view = x.view(-1)
    # set_() desugars into a no-op; later usages of x will use x_output
    x_output = torch.ones(2)
    # functionalize the mutation on x_view
    x_view_updated = x.mul(2)
    x_updated = x_view_updated.view(x.shape)
    # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
    # We need to return both updated tensors in our graph
    return x_updated, x_output
def runtime_wrapper(x):
    x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
    # First, perform the data mutation on x's old storage
    x.copy_(x_data_mutation_result)
    # Then, swap out the storage of x with the new storage
    x.set_(x_set_mutation_result)
```

There are two things that make this difficult to do though:

(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.

(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.

It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
2023-11-21 01:52:46 +00:00
81f93991d3 Update merge rule to allow pytorchbot to land ExecuTorch hash update (#114180)
The bot cannot merge the hash update PR otherwise, for example https://github.com/pytorch/pytorch/pull/114008#issuecomment-1818032181.  I also need to move ExecuTorch jobs in trunk to pull to match the rule without the need to add `ciflow/trunk` label.  The test job takes less than 20 minutes to finish atm on `2xlarge`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114180
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/malfet
2023-11-21 01:36:52 +00:00
e8996055a9 [iOS][PTMCoreMLCompiler] update other deprecated function (#114177)
Summary: old way was deprecated

Test Plan: ci

Reviewed By: kirklandsign

Differential Revision: D51172622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114177
Approved by: https://github.com/kirklandsign
2023-11-21 01:36:00 +00:00
77f16eb00c Fix prod double backward when there are 2+ zeros (#113969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113969
Approved by: https://github.com/albanD
2023-11-21 01:32:10 +00:00
85ce8a602b Pin pywavelets to 1.4.1 (scikit-image dependency) (#114146)
This is to prevent pip from pulling in 1.22.4 and fails Docker image builds, for example, https://github.com/pytorch/pytorch/actions/runs/6923861547/job/18842791777

The new package was released on Nov 17th https://pypi.org/project/PyWavelets/1.5.0/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114146
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-11-21 01:29:33 +00:00
585332fb8d [ProcessGroupNCCL] Fix avoid-record-stream warning for P2P (#114168)
I have been seen below warning even though I did not set `TORCH_NCCL_AVOID_RECORD_STREAMS` to 1.
```
Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives.  (function operator())
```

Turns out that `TORCH_WARN_ONCE` is unconditional, so the original code below would print out both the value of `avoidRecordStreams_` and the error message:
```
TORCH_WARN_ONCE(
   avoidRecordStreams_,
   "TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point "
   "collectives.");
```
 That's also where the "0" in the message came from.

Cc: @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114168
Approved by: https://github.com/eqy, https://github.com/fduwjj, https://github.com/H-Huang
2023-11-21 01:29:00 +00:00
6ec344b08f Fix empty cpu tensor output in cudagraph (#114144)
We can ignore empty cpu tensors

Differential Revision: [D51472324](https://our.internmc.facebook.com/intern/diff/D51472324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114144
Approved by: https://github.com/davidberard98
2023-11-21 01:24:48 +00:00
3e49621f3b [DTensor] Cached hash for DTensorSpec (#113915)
**Overview**
Generally, I think we can try to freeze as many of these classes used in DTensor sharding propagation as possible so that we can cache hashes. This PR targets hashing `DTensorSpec`, which turns out to be relatively expensive.

**Details**
It looks like `tensor_meta` is only updated in `_wrap_output_spec_tensor_meta`, which only runs if the propagation was not cached:
ae94c7e491/torch/distributed/_tensor/sharding_prop.py (L137)
ae94c7e491/torch/distributed/_tensor/sharding_prop.py (L153)
In that case, I think we can cache the hash for the `DTensorSpec` and only update it when one of the hashed attributes changes, which we only really expect to happen for `tensor_meta`.

To ensure correctness, we need that all hashed attributes are immutable.
- `DeviceMesh` caches its hash: a9134fa99a/torch/distributed/_device_mesh.py (L181)
- This PR makes each `Placement` a frozen `dataclass`, making them immutable (relying on the fact that they do not have references to any mutable objects).
- `TensorMeta` is a `NamedTuple` of `torch.Size`, `Tuple[int, ...]`, and `torch.dtype`, so it is immutable: 9916d8a9ea/torch/distributed/_tensor/placement_types.py (L369-L375)

**Example**
For some simple small GPT model:
Before: 0.125 ms
<img width="509" alt="Screenshot 2023-11-16 at 10 08 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/10e59401-f635-431f-80b5-1b48df3a706e">

After: 0.048 ms
<img width="294" alt="Screenshot 2023-11-16 at 10 08 47 PM" src="https://github.com/pytorch/pytorch/assets/31054793/09a3b0b9-f68c-4afc-bca1-c29a4b01c2fb">

The overall Adam CPU step time decreases from 7.647 ms to 6.451 ms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113915
Approved by: https://github.com/wanchaol
ghstack dependencies: #113919, #113924, #114134, #113925, #113930, #114141
2023-11-21 01:24:21 +00:00
fb25fd6f86 [DTensor] Replaced neg dim normalization with assert in helper (#114141)
This is a replacement for https://github.com/pytorch/pytorch/pull/113922. I think we can still leave the check for negative shard dimension in `compute_local_shape_and_global_offset` and replace the normalization logic with an assert. This should provide us a stack trace to see which user-facing API did not normalize the dim as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114141
Approved by: https://github.com/wanchaol
ghstack dependencies: #113919, #113924, #114134, #113925, #113930
2023-11-21 01:24:21 +00:00
d70857bd9e [pytorch][lite interpreter] add tracer run under inference guard (#114003)
Summary: This can change the ops called under the hood. Its not safe to always call because of on device training.

Test Plan: ci

Differential Revision: D51440119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114003
Approved by: https://github.com/Jack-Khuu
2023-11-21 00:45:52 +00:00
e7f12b1eb0 Print the index and summary of the SampleInput that failed an OpInfo test (#99444)
Related to the Reproducible Testing BE project. Goal is to print out the sample input that failed an OpInfo test.

Crazy idea: to avoid requiring widespread changes across tests that use OpInfo sample inputs, return a new special iterator type from `OpInfo.sample_inputs()`, etc. that tracks the most recent item seen. If a test fails later on, print out this info to identify the sample that failed the test.

This solves the problem that the test framework currently has no concept of which sample input is being operated on.

This PR contains the following changes:
* New `TrackedInputIter` that wraps a sample inputs func iterator and tracks the most recent input seen in a `TrackedInput` structure
    * The information is stored in a dictionary on the test function itself, mapping `full test ID -> most recent TrackedInput`
* To determine the test function that is being run, we do some stack crawling hackery in `extract_test_fn_and_id()`
* Above applies only when one of the following is called: `OpInfo.sample_inputs()`, `OpInfo.error_inputs()`, `OpInfo.reference_inputs()`, and `OpInfo.conjugate_sample_inputs()`. This could easily be extended to `ModuleInfo`s and the sparse sample input funcs as well

Example output when a sample input causes a failure:
```
======================================================================
ERROR: test_foo_add_cpu_uint8 (__main__.TestFakeTensorCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 911, in test_wrapper
    return test(*args, **kwargs)
  File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 1097, in only_fn
    return fn(slf, *args, **kwargs)
  File "/home/jbschlosser/branches/reproducible_testing/test/test_ops.py", line 2211, in test_foo
    self.fail('Example failure')
AssertionError: Example failure

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_utils.py", line 2436, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 414, in instantiated_test
    result = test(self, **param_kwargs)
  File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 917, in test_wrapper
    raise Exception(
Exception: Caused by sample input at index 2: SampleInput(input=Tensor[size=(5, 1), device="cpu", dtype=torch.uint8], args=TensorList[Tensor[size=(5,), device="cpu", dtype=torch.uint8]], kwargs={}, broadcasts_input=True, name='')

To execute this test, run the following from the base repo dir:
     python test/test_ops.py -k test_foo_add_cpu_uint8

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
```

This notably doesn't print the actual `SampleInput` values, as that's hard without fully reproducible random sample generation. I went down this path for a while and it seems infeasible without adding an untenable amount of overhead to set the random seed per SampleInput (see https://github.com/pytorch/pytorch/issues/86694#issuecomment-1614943708 for more details). For now, I am settling for at least spitting out the index and some metadata of the `SampleInput`, as it seems better than nothing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99444
Approved by: https://github.com/janeyx99
2023-11-21 00:11:20 +00:00
e4a88d9581 Convert SymInts to SymFloats with SymPy (#113683)
Fixes #109365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113683
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-11-20 23:35:40 +00:00
4182092feb [reland][HigherOrderOp] remove _deprecated_global_ns (#113813)
This is a reland of #112757. Cannot land original one internally because internal diff is not in sync with OSS due to issues in dealing with two export repos (executorch and pytorch) using the ghimport-ghexport approach.

Will try the web UI of import and export instead of ghimport and ghexport flow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113813
Approved by: https://github.com/angelayi
2023-11-20 23:16:18 +00:00
c1d9d4a2b5 checkpoint_sequential warns if use_reentrant not passed explicitly (#114158)
Use warning text for deprecation message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114158
Approved by: https://github.com/albanD
2023-11-20 23:08:44 +00:00
2ca1119d53 Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926)
The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors *at the end* of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor.

This PR is the result of *a lot* of back and forth with @ezyang and @eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same:

1) We cache source->symbol in shape_env
2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification
3) We create a new fake mode for backends
(from https://github.com/pytorch/pytorch/pull/113605/files)

This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't).

We went back to the drawing board here, but with a few concessions:
1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons
2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (@ezyang did this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113926
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-11-20 23:06:37 +00:00
7afceb9f64 [AOTI] add float support of triton (#114014)
Summary: As the title

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_functions.py::DefaultsTests::test_triton_kernel_None_arg' --print-passing-details

Differential Revision: D51421325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114014
Approved by: https://github.com/oulgen, https://github.com/aakhundov
2023-11-20 23:03:37 +00:00
ae00d9623e [inductor] Add ABI shim function for torch.scatter (#114027)
Summary: Scatter fallback calls `at::scatter` in the C++ wrapper codegen. This doesn't work in the ABI compatibility mode, as the latter requires a shim function. One is added in this PR.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_scatter_fallback
s...
----------------------------------------------------------------------
Ran 4 tests in 52.713s

OK (skipped=1)
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114027
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #114024
2023-11-20 22:51:59 +00:00
4b07fca7d7 [export] Allow shifted constraint ranges in dynamo._export (#114024)
Summary: Previously, when we had two dynamic shape symbols `s0` and `s1` bound by the relationship `s1 == s0 + 1`, even when the range constraints were set in accordance with the relationship (e.g., to `[2, 1024]` for `s0` and to `[3, 1025]` for `s1`), `torch._dynamo.export` raised an error saying that the constraint is violated. Here we add a range check between the expression and the constraint and, if the ranges match, don't declare the constraint violated.

We also add a flag to disable the dim constraint solver in `torch._dynamo.export` (not set by default for BC), passed down from the `torch._export.aot_compile`. This is because, even for simple constraints like `s1 == s0 + 1`, the solver claims that the constraint is too complex and the dimension `s0` must be specialized. The new flag is not exposed as a part of the public API (i.e., the one without `_`s in the module names).

Both changes are required to unblock PT2 compilation of an internal model with AOT Inductor.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_shifted_constraint_ranges
s...
----------------------------------------------------------------------
Ran 4 tests in 53.247s

OK (skipped=1)
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114024
Approved by: https://github.com/zhxchen17
2023-11-20 22:49:14 +00:00
c39c69953f [DTensor] Used new placements for neg dim in distribute_tensor (#113930)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113930
Approved by: https://github.com/wanchaol
ghstack dependencies: #113919, #113924, #114134, #113925
2023-11-20 22:32:58 +00:00
e2095a04ae [DTensor] Ensured grad_placements was tuple (#113925)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113925
Approved by: https://github.com/wanchaol
ghstack dependencies: #113919, #113924, #114134
2023-11-20 22:32:58 +00:00
f4ffd46c08 [DTensor] Used new placements for neg dim in from_local (#114134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114134
Approved by: https://github.com/wanchaol
ghstack dependencies: #113919, #113924
2023-11-20 22:32:51 +00:00
b41ad7d695 [DTensor] Used new placements for neg dim in redistribute (#113924)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113924
Approved by: https://github.com/wanchaol
ghstack dependencies: #113919
2023-11-20 22:30:16 +00:00
77e058f055 [DTensor] Made _Partial, Replicate frozen dataclasses (#113919)
This is part of the larger stack to work toward being able to cache hashes for `DTensorSpec`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113919
Approved by: https://github.com/wanchaol
2023-11-20 22:28:47 +00:00
97d2b439ce [BE] Use definitely_true/sym_eq for same_meta (#114137)
Follows https://github.com/pytorch/pytorch/pull/113159

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114137
Approved by: https://github.com/Skylion007
2023-11-20 22:22:26 +00:00
13dd7f0c98 [export] Add missing builtin ops. (#113982)
Summary: Fixing issue https://github.com/pytorch/pytorch/issues/113778

Test Plan: eyes.

Differential Revision: D51436177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113982
Approved by: https://github.com/Skylion007, https://github.com/ydwu4
2023-11-20 21:59:49 +00:00
8c4812be80 Replace expect_int with guard_int (#113921)
The idea is that instead of erroring, we will just specialize at these sites.

Fixes https://github.com/pytorch/pytorch/issues/113142

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113921
Approved by: https://github.com/zou3519
2023-11-20 21:27:48 +00:00
59ad51e10a Insert deferred runtime asserts into Dynamo FX graph (#113958)
During the course of fake tensor propagation (and, potentially, also Dynamo execution, although I do not believe it is possible to exercise this right now), we may generate deferred runtime asserts, which represent "guards" on unbacked symbols which cannot be immediately checked on entry to a code block; instead, they have to be checked at runtime. However, we currently accumulate these deferred runtime asserts into the ShapeEnv, and don't do anything with them.

This PR modifies Dynamo to automatically insert these runtime asserts into the FX graph, before passing it on to the backend compiler. The assert format coincides with the export assert format as practiced in `torch/_export/passes/add_runtime_assertions_for_constraints_pass.py`, but actually these passes are completely disjoint right now as I only handle deferred runtime asserts, while export only handles ranges (which I should probably also handle, but don't in this PR.)

The assertions must be inserted by Dynamo, because you could potentially then pass the asserts onto another backend like "eager" which no longer looks at the ShapeEnv before. Thanks to previous work in export, these asserts are preserved in AOTAutograd, but they are dropped by Inductor, which needs to be fixed in future work. This piece will be a bit awkward, as Inductor would have preferred to work with the Sympy expressions directly, ah well.

Here is what the Dynamo traced FX graph looks like for the test in question:

```
  <eval_with_key>.0 class GraphModule(torch.nn.Module):
     def forward(self, L_x_ : torch.Tensor):
         l_x_ = L_x_

         # File: /data/users/ezyang/c/pytorch/wu.py:8, code: y = x.item()
         item = l_x_.item()

         # No stacktrace found for following nodes
         ge_1 = item >= 0
         scalar_tensor_default = torch.ops.aten.scalar_tensor.default(ge_1);  ge_1 = None
         _assert_async_msg = torch.ops.aten._assert_async.msg(scalar_tensor_default, "Deferred runtime assert failed: i0 >= 0, where i0 was defined by 'item' (for more information, run with TORCH_LOGS=+dynamo,dynamic)");  scalar_tensor_default = None

         # File: /data/users/ezyang/c/pytorch/wu.py:9, code: torch._check_is_size

         _check_is_size = torch._check_is_size(item)

         # File: /data/users/ezyang/c/pytorch/wu.py:10, code: if y >= 0:
         ge = item >= 0;  item = None

         # File: /data/users/ezyang/c/pytorch/wu.py:11, code: return x * 2
         mul = l_x_ * 2;  l_x_ = None
         return (mul,)

```

Note that we actually keep the `_check_is_size` in the graph redundantly. However, assert_async is retained in the graph, whereas _check_is_size ends up getting DCE'ed.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113958
Approved by: https://github.com/aakhundov, https://github.com/tugsbayasgalan
ghstack dependencies: #113978
2023-11-20 21:25:11 +00:00
473b17c4c1 Run sympy expressions with Python values / FX tracing (#113978)
To codegen deferred runtime asserts, I need to be able to convert sympy expressions back into regular Python expressions that I can put in FX graphs. This PR adds some of the machinery to do this: it adds a new sympy analysis that runs operations on all FX traceable operations that can also be run with plain Python int/float/bool/etc. It's tested by symbolic tracing through the analysis, and then testing that this traced graph gives the same result as running the Python analysis directly.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113978
Approved by: https://github.com/aakhundov, https://github.com/lezcano
2023-11-20 21:25:11 +00:00
cd2798943d [dtensor] support convolution ops (#113123)
This PR creates a prototype of training convolutional neural networks based on DTensor.

- Register required ops and implement operator dispatch
- Add unit tests and example

Basically, we shard the activations and replicate the model weights in this prototype. We can scale out to multiple GPUs and reduce the per-GPU memory footprint with this approach, and achieve weak scaling in terms of training performance (i.e., time per iteration).

Reference log (on 2xA100 GPU):

Unit Test
```bash
root@luna-prod-78-80gb:/pytorch# python3 test/distributed/_tensor/test_convolution_ops.py
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (Triggered internally at /opt/conda/conda-bld/pytorch_1699257304556/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2170.)
  return F.conv2d(input, weight, bias, self.stride,
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (Triggered internally at /opt/conda/conda-bld/pytorch_1699257304556/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2170.)
  return F.conv2d(input, weight, bias, self.stride,
..
----------------------------------------------------------------------
Ran 2 tests in 30.354s

OK
root@luna-prod-78-80gb:/pytorch# python3 test/distributed/_tensor/test_other_ops.py
[rank0]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank0]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank1]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank1]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
...
----------------------------------------------------------------------
Ran 3 tests in 16.343s

OK
```
ConvNeXt Example
```bash
root@luna-prod-78-80gb:/pytorch# python3 torch/distributed/_tensor/examples/convnext_example.py
rank 3, 20 iterations, latency     584.80 ms, forward     102.84 ms, backward     297.80 ms, max reserved    16.34 GiB, max allocated    14.75 GiB
rank 1, 20 iterations, latency     584.64 ms, forward     104.85 ms, backward     297.60 ms, max reserved    16.40 GiB, max allocated    14.74 GiB
rank 0, 20 iterations, latency     584.48 ms, forward     104.64 ms, backward     297.90 ms, max reserved    16.39 GiB, max allocated    14.75 GiB
rank 2, 20 iterations, latency     584.96 ms, forward      93.21 ms, backward     297.95 ms, max reserved    16.40 GiB, max allocated    14.74 GiB
```

@wanchaol @fduwjj FYI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113123
Approved by: https://github.com/wanchaol
2023-11-20 21:01:28 +00:00
af51c948ac Add mechanism for make_fx to not error on data-dependent-ops (#114129)
I'm looking for a make_fx(tracing_mode=real) that doesn't error out on
data-dependent operations. This PR adds a flag to do that. We use this
to help implement offline generation, but this is useful by itself:
sometimes we want to trace a function with real tensors and don't care
if we bake values in (because we just want to see what happened).

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114129
Approved by: https://github.com/ezyang
ghstack dependencies: #114128
2023-11-20 20:55:55 +00:00
d1bb0b0e4d Mark more built-in ops as pt2_compliant (#114128)
See title

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114128
Approved by: https://github.com/ezyang
2023-11-20 20:55:55 +00:00
811bec46ef Don't DCE item nodes if they're float (#114135)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114135
Approved by: https://github.com/Skylion007
2023-11-20 20:42:27 +00:00
0c450f4504 [functorch] fix potential race condition while loading vmap decomposition library (#113520)
There can be a potential race condition while loading the `vmap` decomposition library in multi-threading programs.

This PR adds a thread lock to avoid the case of registering the kernel multiple times.

```python
import threading
from torch._functorch.vmap import lazy_load_decompositions

threads = []
for i in range(10000):
    thread = threading.Thread(target=lazy_load_decompositions)
    threads.append(thread)
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()
```

```text
RuntimeError: This is not allowed since there's already a kernel registered from python overriding mse_loss_backward's behavior for FuncTorchBatched dispatch key and aten namespace.
    VMAP_DECOMPOSITIONS_LIB.impl(decomp, decomposition_table[decomp])
RuntimeError: This is not allowed since there's already a kernel registered from python overriding mse_loss_backward's behavior for FuncTorchBatched dispatch key and aten namespace.
RuntimeError: This is not allowed since there's already a kernel registered from python overriding mse_loss_backward's behavior for FuncTorchBatched dispatch key and aten namespace.
RuntimeError: This is not allowed since there's already a kernel registered from python overriding mse_loss_backward's behavior for FuncTorchBatched dispatch key and aten namespace.
RuntimeError: This is not allowed since there's already a kernel registered from python overriding mse_loss_backward's behavior for FuncTorchBatched dispatch key and aten namespace.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113520
Approved by: https://github.com/zou3519
2023-11-20 19:50:54 +00:00
2b97f5a9a1 Disallow fp8 type promotion (#113975)
Fixes #113663

As well as updating the promotion logic to disallow automatic type promotion between fp8 types this PR also cleans up the table entries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113975
Approved by: https://github.com/albanD, https://github.com/malfet
2023-11-20 19:47:43 +00:00
0bb29f9450 [dynamo] Guard on HAS_GRAPH_BREAKS if graph breaks are present (i.e. cache miss if compiled object requires nopython) (#114073)
Fixes https://github.com/pytorch/pytorch/issues/114059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114073
Approved by: https://github.com/ezyang
2023-11-20 19:32:03 +00:00
2b4c489f71 [lint] Install compatible numpy for 3.8 (#113869)
Not ideal, but better than barring lint on 3.8

Fixes https://github.com/pytorch/pytorch/issues/113864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113869
Approved by: https://github.com/albanD
2023-11-20 19:23:43 +00:00
fc39efc4c1 Fix filename typo 'funtionalized' (#114132)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114132
Approved by: https://github.com/zhxchen17, https://github.com/Skylion007
2023-11-20 19:19:25 +00:00
934e9c3346 Boolean masking backwards doesn't work even with dynamic output shape ops, break accordingly (#114126)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114126
Approved by: https://github.com/albanD
2023-11-20 19:07:37 +00:00
039a4689a2 Update sdpa doctstring to point to flash-attn-v2 (#114124)
# Summary
See title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114124
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-11-20 19:05:30 +00:00
9d2425c8a4 [dynamo] Be clearer about dict subtype source availability (#114069)
```
# [NOTE] OrderedDict, dict subtypes must always have source
# We cannot instantiate such subtypes in-graph due to builtin __new__
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114069
Approved by: https://github.com/ezyang
2023-11-20 18:49:42 +00:00
100b9952b1 [dynamo] Fix user defined object sourceless callable (#114066)
Fixes https://github.com/pytorch/pytorch/issues/114019
We do not need to guard on callable user object defined instantiated in graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114066
Approved by: https://github.com/ezyang
2023-11-20 18:38:03 +00:00
e4ec5545cd [export] Turn on verifier for serialization. (#113980)
Summary: as title.

Test Plan: CI

Differential Revision: D51435909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113980
Approved by: https://github.com/larryliu0820
2023-11-20 18:32:16 +00:00
d1ae5efa94 [torch][fsdp] More informative assertion error when rank mismatch (#113765)
Summary: I had a job fail due to rank mismatch but didn't find enough information in the assertion message. This change makes the message more informative.

Test Plan:
CI tests and I ran a test job which failed as expected:

```
Rank 1 has different values for step: 8016.0. Other ranks: 7870.0
```

Differential Revision: D51322046

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113765
Approved by: https://github.com/wz337, https://github.com/fegin
2023-11-20 17:44:41 +00:00
59bc98e4ae [EASY] Rewrite test_anomaly_aot_autograd to more reliably trigger error (#114122)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114122
Approved by: https://github.com/albanD
2023-11-20 17:42:42 +00:00
95eab508e3 [caffe2] Add non-x86 stub definition for libraryFor too (#114023)
Summary: Fix non-x86 build errors with missing `libraryFor` symbol.

Test Plan:
```
$ buck2 build -c fbcode.arch=aarch64 fbcode//admarket/adfinder:adfinder
```

Reviewed By: malfet

Differential Revision: D51444766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114023
Approved by: https://github.com/aaronenyeshi, https://github.com/malfet
2023-11-20 17:01:47 +00:00
aeb5fd52c7 Remove dead tensor_has_hints. (#114071)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114071
Approved by: https://github.com/aakhundov
2023-11-20 16:02:24 +00:00
7d5e8c1d51 [BE][easy]: Update ruff to 0.1.6 (#114125)
Updates ruff to 0.1.6 for more bugfixes, less false positives / false negatives, and support for more rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114125
Approved by: https://github.com/albanD, https://github.com/malfet
2023-11-20 15:36:27 +00:00
cbc6873538 [Dynamo][Forward fix] Add torch.ao back to is_allowed list (#114016) (#114111)
Summary:

As title

Test Plan: Sandcastle

Reviewed By: drisspg, huydhn, voznesenskym

Differential Revision: D51445366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114111
Approved by: https://github.com/jeanschmidt
2023-11-20 14:59:33 +00:00
140c54e6cc [xla hash update] update the pinned xla hash (#110377)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110377
Approved by: https://github.com/pytorchbot
2023-11-20 10:54:05 +00:00
f36d09fcb7 Revert "Add function to materialize COW storages (#113396)"
This reverts commit e2f090086bd494ee7b25da5b8e4f48d6cf61cc98.

Reverted https://github.com/pytorch/pytorch/pull/113396 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113396#issuecomment-1818769090))
2023-11-20 10:26:01 +00:00
fe428a284b Revert "Add torch._lazy_clone to create COW tensors (#113397)"
This reverts commit 9916d8a9eaaf2c05c131f2a2dbe9eabeeaa9dffc.

Reverted https://github.com/pytorch/pytorch/pull/113397 on behalf of https://github.com/DanilBaibak due to Unfortunately, I need to revert your PR because the lower [PR in the stack](https://github.com/pytorch/pytorch/pull/113396) is failing a bunch of internal build jobs. ([comment](https://github.com/pytorch/pytorch/pull/113397#issuecomment-1818761224))
2023-11-20 10:21:09 +00:00
d40d72d664 Revert "Skip test_lazy_clone for Inductor (#114012)"
This reverts commit ecd8d388b9dec01c5abdf4978e632c9a3db34f95.

Reverted https://github.com/pytorch/pytorch/pull/114012 on behalf of https://github.com/DanilBaibak due to I revert the PR due to the original changes broke the internal build. Here is the original diff stack [D51444337](https://www.internalfb.com/diff/D51444337) ([comment](https://github.com/pytorch/pytorch/pull/114012#issuecomment-1818745425))
2023-11-20 10:12:44 +00:00
7d0339fb9a Revert "[Dynamo][Forward fix] Add torch.ao back to is_allowed list (#114016)"
This reverts commit 09fe36274acb77249a058de0d778b73b29570036.

Reverted https://github.com/pytorch/pytorch/pull/114016 on behalf of https://github.com/DanilBaibak due to The PR was exported as part of the co-dev approach and needs to merged once the internal diff will landed. ([comment](https://github.com/pytorch/pytorch/pull/114016#issuecomment-1818591191))
2023-11-20 09:32:15 +00:00
7963aaac41 add Half support for AdaptiveAvgPool2d and AdaptiveMaxPool2d on CPU (#102079)
### Testing

Single core:

AdaptiveMaxPool2d:
shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
input size: (2, 56, 264, 264), output size: (100, 100)  | 71.5826 | 78.7460 | 85.7195 | 7.3925 | 6.0618 | 6.2596
input size: (2, 56, 264, 264), output size: (50, 50)  | 28.122 | 30.8572 | 36.6366 | 6.2645 | 3.4781 | 3.6628
input size: (32, 32, 100, 100), output size: (50, 50)  | 109.2978 | 115.0330 | 121.9500 | 13.4329 | 10.2769 | 12.1975
input size: (16, 4, 300, 300), output size: (100, 100) | 34.1849 | 36.5876 | 40.9862 | 4.7719 | 4.3362 | 4.1417

28 cores:

AdaptiveMaxPool2d:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
input size: (2, 56, 264, 264), output size: (100, 100)  | 3.1809 | 3.5057 | 3.6728 | 0.6657 | 0.3138 | 0.2934
input size: (2, 56, 264, 264), output size: (50, 50)  | 1.2779 | 1.3869 | 1.5238 | 0.4223 | 0.1775 | 0.1825
input size: (32, 32, 100, 100), output size: (50, 50)  | 4.7942 | 4.9670 | 5.2330 | 1.7146 | 0.6477 | 0.7001
input size: (16, 4, 300, 300), output size: (100, 100) | 1.9522 | 2.0879 | 2.3155 | 0.4370 | 0.3175 | 0.2828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102079
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-11-20 03:01:00 +00:00
5a96a42cea [AOTI] Improve the two-pass wrapper codegen (#114067)
Summary: For the second-pass, we don't have to rerun the whole inductor flow again. This PR moves that second-pass to the codegen time. This change not only speeds up the compilation, but also removes kernel scheduling inconsistency between the two passes. Another future improvement is to make the second-pass reuse the scheduler and do the wrapper codegen only.

This is a copy of https://github.com/pytorch/pytorch/pull/113762 to land in github first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114067
Approved by: https://github.com/chenyang78
2023-11-19 23:30:36 +00:00
cyy
226384b460 [2/N] Cleanup header inclusions in torch_cpu by iwyu (#109964)
Further cleaning up of torch_cpu header inclusions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109964
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-11-19 20:56:32 +00:00
0bd4d1f4ab Add sparse tensors support to dataloader. (#112842)
Fixes https://github.com/pytorch/pytorch/issues/106837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112842
Approved by: https://github.com/cpuhrsch, https://github.com/gokulavasan
2023-11-19 16:05:27 +00:00
12f95df0e9 Eliminate unnecessary multiplications by 1 in addmm with sparse compressed tensor operand (#114026)
This PR:
- updates `torch/sparse/_triton_ops_meta.py` for the API change in `triton.testing.do_bench`
- force `num_stages` to be 1 when blocksize is 128x128 to avoid out of resources exception when `bsr_dense_mm` is called from `nn.linear`.
- as in the title. The performance of `nn.linear` on BSR tensor weights (dtypes `float16` and `bfloat16`) is increased as follows (`NVIDIA A100-SXM4-80GB`):
  - for blocksize 16x16, the average/maximum speed up is about 11/20 %
  - for blocksize 32x32, the average/maximum speed up is about 15/24 %
  - for blocksize 64x64, the average/maximum speed up is about 18/26 %
  - for blocksize 128x128, the average/maximum speed up is about 15/28 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114026
Approved by: https://github.com/cpuhrsch
2023-11-19 12:13:54 +00:00
826ab0e32d [dynamo] report guard failure user stack, fix incorrectly skipping interesting files (#114053)
Fixes https://github.com/pytorch/pytorch/issues/114015

Before:
```
test/dynamo/test_functions.py::DefaultsTests::test_zip_strict [2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'], 94696321555200)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] len(L['ys']) == 3
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'], 94696321555200)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] len(L['zs']) == 3
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][0], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][0] == 1.0
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][1], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][1] == 2.0
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][2], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][2] == 3.0
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][0], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][0] == 2.0
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][1], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][1] == 5.0
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][2], 94696321556032)
[2023-11-18 23:11:09,316] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][2] == 8.0
[2023-11-18 23:11:09,317] [0/0] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:365 in init_ambient_guards
[2023-11-18 23:11:09,317] [0/0] torch._dynamo.guards.__guards: [DEBUG] (___skip_backend_check() or ___current_backend() == ___lookup_backend(140084534469552))  # _dynamo/output_graph.py:371 in init_ambient_guards
[2023-11-18 23:11:09,317] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['x'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[3], stride=[1])
[2023-11-18 23:11:09,320] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function fn in /home/jonch/Desktop/Programming/mlsys/pytorch/test/dynamo/test_functions.py:2539
[2023-11-18 23:11:09,320] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-18 23:11:09,320] torch._dynamo.guards.__recompiles: [DEBUG]     - L['zs'][2] == 8.0

```

After:
```
test/dynamo/test_functions.py::DefaultsTests::test_zip_strict [2023-11-18 23:07:33,341] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2023-11-18 23:07:33,341] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False           # x = x.clone()  # test/dynamo/test_functions.py:2540 in fn
[2023-11-18 23:07:33,341] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'], 94568804551424)                     # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] len(L['ys']) == 3                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'], 94568804551424)                     # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] len(L['zs']) == 3                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][0], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][0] == 1.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][1], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][1] == 2.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['ys'][2], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['ys'][2] == 3.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][0], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][0] == 2.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][1], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][1] == 5.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['zs'][2], 94568804552256)                  # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] L['zs'][2] == 8.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:365 in init_ambient_guards
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] (___skip_backend_check() or ___current_backend() == ___lookup_backend(140370726823264))  # _dynamo/output_graph.py:371 in init_ambient_guards
[2023-11-18 23:07:33,342] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['x'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[3], stride=[1])  # x = x.clone()  # test/dynamo/test_functions.py:2540 in fn
[2023-11-18 23:07:33,346] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function fn in /home/jonch/Desktop/Programming/mlsys/pytorch/test/dynamo/test_functions.py:2539
[2023-11-18 23:07:33,346] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-18 23:07:33,346] torch._dynamo.guards.__recompiles: [DEBUG]     - L['zs'][2] == 8.0                                             # for y, z in zip(ys, zs, strict=True):  # test/dynamo/test_functions.py:2541 in fn

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114053
Approved by: https://github.com/ezyang
2023-11-19 10:24:10 +00:00
edc5ae3113 Allow for calling lift_fresh_copy manually (#113923)
In this case, the input could be fake!  Just treat it normally in that case.

Fixes https://github.com/pytorch/pytorch/issues/113331

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113923
Approved by: https://github.com/eellison, https://github.com/bdhirsh, https://github.com/leslie-fang-intel
2023-11-19 07:13:49 +00:00
72a8329ec9 [reland][aotinductor] Add example_value metadata to nodes (#113986)
Test Plan:
`TORCH_LOGS=dynamo,inductor,aot  CUDA_VISIBLE_DEVICES=7 TORCH_COMPILE_DEBUG=0 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010  caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /tmp/409501788/66/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR"`

Without passes:
`BS: 2048, MFLOPS/BS: 40.51, TFLOP/s: 37.32, Time per iter: 2.22ms, Threads: 1, QPS: 921146.83, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 66.15s`

With passes:
`BS: 2048, MFLOPS/BS: 40.51, TFLOP/s: 37.49, Time per iter: 2.21ms, Threads: 1, QPS: 925450.82, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 261.11s`

Differential Revision: D51436878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113986
Approved by: https://github.com/zhxchen17
2023-11-19 07:12:24 +00:00
33c6cae13b [pytorch-vulkan][5/n] Enable BMM with the new packing. Massive refactor. (#113943)
Summary:
After the refactoring of matrix multiplication, the `bmm` logic can easily be adapted using the existing `mm` code, since the only difference is on the "batch" dimension, which is not packed now. So we can simply add computation along the "z" dimension.

Further, I realized that `bias` and `beta` are simply a post-processing step after the matrix multiplication, I have factored them out. The nice part about this factoring is that we can directly leverage the broadcasting logic, hence we don't need to have a separate shader just to add the bias.

So this diff massively simply the `mm` code.
1. Reduce 4 shaders `mm`, `addmm`, `bmm`, `baddbmm` into just `mm`.
2. Remove packing for bias.
3. Add support on `at::bmm(m1.vulkan(), m2.vulkan())` <= This is a blocking feature for the emformer models.

Test Plan:
```

% buck2 run  -c pt.has_backtraces=1 --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64
...

[       OK ] VulkanAPITest.linear_4d_flat (0 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (0 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (1 ms)
[ RUN      ] VulkanAPITest.lstm_success
[       OK ] VulkanAPITest.lstm_success (5 ms)
[ RUN      ] VulkanAPITest.lstm_mclareninputs_success
[       OK ] VulkanAPITest.lstm_mclareninputs_success (22 ms)
[ RUN      ] VulkanAPITest.lstm_prepack_success
[       OK ] VulkanAPITest.lstm_prepack_success (3 ms)
[ RUN      ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:8056: Skipped
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 411 tests from VulkanAPITest (5754 ms total)
[----------] Global test environment tear-down
[==========] 411 tests from 1 test suite ran. (5754 ms total)
[  PASSED  ] 410 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

```

Full Paste: P884697749

Differential Revision: D51421256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113943
Approved by: https://github.com/SS-JIA
2023-11-19 06:24:30 +00:00
e3eca4c49f Revert "Convert SymInts to SymFloats with SymPy (#113683)"
This reverts commit 0ec66b3be5a53ab960872981b5027c49c2e6b7e9.

Reverted https://github.com/pytorch/pytorch/pull/113683 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing in trunk 0ec66b3be5, probably a landrace as this is not failing on your PR ([comment](https://github.com/pytorch/pytorch/pull/113683#issuecomment-1817759130))
2023-11-19 06:09:15 +00:00
fb3bc3949a [Inductor] remove GPT2ForSequenceClassification from ci skip list (#112100)
**Summary**
As discussed in https://github.com/pytorch/pytorch/issues/109019, the accuracy issue of `GPT2ForSequenceClassification` has been fixed in https://github.com/pytorch/pytorch/pull/108690. Remove it from CI Skip list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112100
Approved by: https://github.com/lezcano
2023-11-19 05:12:18 +00:00
84f791e697 Fix checking symbolic shapes inside torch._check (#113811)
Fixes https://github.com/pytorch/pytorch/issues/110719#issuecomment-1768710678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113811
Approved by: https://github.com/ezyang, https://github.com/peterbell10
2023-11-19 04:13:18 +00:00
cyy
bae61ecb96 [Reland 1] Cleanup header inclusions in torch_cpu by iwyu (#112311)
Reland https://github.com/pytorch/pytorch/pull/101178 to use IWYU on torch_cpu. The header file changes are excluded to avoid breaking internal jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112311
Approved by: https://github.com/ezyang
2023-11-19 04:06:36 +00:00
68ab458fe3 Don't recommmend max_split_size_mb first (#113481)
I've run into a couple of cases now where max_split_size_mb has been set
in projects as a workaround for fragmentation but it ends up causing problems
later, such as degraded performance from freeing empty segments. While it
is a useful setting to have, expandable_segments is probably a better first
resort for fixing fragmentation since when it works it is less likely to
need synchronous GPU operations to continue running.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113481
Approved by: https://github.com/msaroufim, https://github.com/albanD
ghstack dependencies: #113231
2023-11-19 04:05:01 +00:00
d968c4cac3 [torchelastic] ensure grandchild processes are restarted correctly (#113231)
When torchelastic notices that one rank has failed, it will sent a SIGTERM
signal to other trainer ranks to tear them down before restarting. However,
if the trainer itself launches subprocesses, or is launched by a non-python
wrapper script, then the SIGTERM will be delivered only to the direct child of
torch eleastic and not all descendants. This opens subprocesses in a new
linux 'session' which starts a new process group with the pgid the same
as the trainers pid. Then when we send signals, we deliver them to the
process group rather than just the direct child.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113231
Approved by: https://github.com/H-Huang
2023-11-19 04:05:01 +00:00
958f3b0df6 [nccl-pg] Migrate to getCvar* functions for env variable checking (#113797)
Summary:
The getCvar* functions allow us to provide multiple environment variables for the same value.  This allows us to deprecate some variables in favor of others, while still allowing users to temporarily use the old variables for some time.

Test Plan: OSS CI

Reviewed By: fduwjj, XilunWu

Differential Revision: D51225487

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113797
Approved by: https://github.com/fduwjj
2023-11-19 03:48:58 +00:00
09fe36274a [Dynamo][Forward fix] Add torch.ao back to is_allowed list (#114016)
Summary: As title

Test Plan: Sandcastle

Differential Revision: D51445366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114016
Approved by: https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/huydhn
2023-11-19 02:59:34 +00:00
b30580e121 [PT] Include tensor shape info in the error messages of torch split (#113984)
Summary: Include tensor shape info in the error messages of torch split.

Test Plan: CI

Differential Revision: D51436684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113984
Approved by: https://github.com/ezyang
2023-11-19 01:34:57 +00:00
0ec66b3be5 Convert SymInts to SymFloats with SymPy (#113683)
Fixes #109365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113683
Approved by: https://github.com/ezyang
2023-11-18 22:18:24 +00:00
870539670a [Dynamo] Support skip/inline function by name and consolidate skip/inline check logics (#113888)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113888
Approved by: https://github.com/mlazos
2023-11-18 21:36:29 +00:00
f0dedb340f [C++] Fix clang compilation issue. (#114017)
Summary:
Clang compilation failed recently due to unfound crtbeginS.o and libgcc, but we should not be using them, as for self-containness Clang use compiler-rt instead. I'm also switching to using the lld linker which is also from the clang release pacakge.

There was another issue where glibc was not found during linking. It looks like the glibc path was passed into the linker via `-B`. It should be also passed via`-L` which can get to the linker, for library reference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114017
Approved by: https://github.com/hl475
2023-11-18 19:56:44 +00:00
11857e9a64 [Inductor] Allow autotuned argument to be anywhere in the argument list (#114002)
Prior to this PR, autotuned arguments could only be at the back of the argument list. This is an inductor limitation and not triton limitation. Fixing this allows more MRS kernels to use user defined triton kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114002
Approved by: https://github.com/aakhundov
ghstack dependencies: #113967
2023-11-18 18:19:32 +00:00
e0c3936843 [Inductor] Support ReinterpretView in inductor codegen (#113967)
Adding support for ReinterpretView in inductor so that jagged MRS kernels can use native triton kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113967
Approved by: https://github.com/aakhundov
2023-11-18 18:19:32 +00:00
ff7c06a01b Revert "limit fused kernel num args. (#113131)"
This reverts commit 7b442c2b0ae0d9c944a777d7352135f370837c15.

Reverted https://github.com/pytorch/pytorch/pull/113131 on behalf of https://github.com/albanD due to Breaks lint on trunk ([comment](https://github.com/pytorch/pytorch/pull/113131#issuecomment-1817548349))
2023-11-18 16:14:08 +00:00
b53d47a719 [inductor cpp] refactor: CppVecOverrides inherits CppOverrides (#113950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113950
Approved by: https://github.com/Skylion007
2023-11-18 15:33:30 +00:00
f8516cef88 [pytorch-vulkan][2/n] Height packing (#113883)
Summary:
Enable logic for converting a channel packed tensor into heigh packed one.

Not yet connecting with rest of the system yet.

Test Plan:
```
(base) yipjustin@yipjustin-mac fbsource % buck2 run  -c pt.has_backtraces=1  --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64  -- --gtest_filter="*packing*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_quantized_api_test.cpp
Buck UI: https://www.internalfb.com/buck2/9a0d6bd6-e4a2-4d58-8f38-f806a0703122
Network: Up: 0B  Down: 0B
Jobs completed: 4. Time elapsed: 0.1s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *packing*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN      ] VulkanAPITest.channel_to_height_packing_test
[       OK ] VulkanAPITest.channel_to_height_packing_test (35 ms)
[----------] 1 test from VulkanAPITest (35 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (36 ms total)
[  PASSED  ] 1 test.
```

Reviewed By: SS-JIA

Differential Revision: D51379737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113883
Approved by: https://github.com/SS-JIA
2023-11-18 09:46:48 +00:00
fdaddec2c3 make_fx can now SymIntify int inputs (#113452)
This PR also contains a basket of fixes that were turned up by now testing more arguments with SymInt. I fixed as many of the easy ones as I could easily get earlier in this stack and a bunch here, but there are some more annoying ones I xfailed.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113452
Approved by: https://github.com/Chillee
ghstack dependencies: #113877, #113911
2023-11-18 06:39:09 +00:00
33f7c6638f Guard when fetching non-symbolic value out of Scalar (#113911)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113911
Approved by: https://github.com/voznesenskym
ghstack dependencies: #113877
2023-11-18 06:39:09 +00:00
bc0d87cde3 Explicitly enumerate all method to operator mappings (#113968)
This is useful for documentary purposes, since these are precisely the
operators you need to understand to deal with int/float compute inside
make_fx traced graphs with symbolic ints/floats.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113968
Approved by: https://github.com/Skylion007
2023-11-18 05:43:39 +00:00
ecd8d388b9 Skip test_lazy_clone for Inductor (#114012)
As half of those tests fail if run individually, but first failure masks all subsequent ones, i.e.
```
PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -v -k test_lazy_clone_cuda_float32
test_lazy_clone_cuda_float32 (__main__.TestTorchDeviceTypeCUDA) ... FAIL
...
   self.assertTrue(torch._C._is_cow_tensor(t))
AssertionError: False is not true
----------------------------------------------------------------------
Ran 1 test in 19.419s

FAILED (failures=1)
```
But
```
$ PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -k test_lazy_clone_
...
......................
----------------------------------------------------------------------
Ran 24 tests in 24.969s

OK
```
This flaky behavior was already detected, for example see https://github.com/pytorch/pytorch/issues/113953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114012
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-11-18 04:57:00 +00:00
caffa44b1c Correctly use real boolean operators, not bitwise in shape guard prints (#113927)
Fixes https://github.com/pytorch/pytorch/issues/113875

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113927
Approved by: https://github.com/voznesenskym
2023-11-18 04:24:45 +00:00
7b442c2b0a limit fused kernel num args. (#113131)
Fixes #97361

When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.

Code change:

1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
2023-11-18 03:55:52 +00:00
5e30741754 Clean up optimizer imports in test_optim (#113971)
This is purely a cosmetic change to set up for my optimizer infos, which will benefit from not needing to type optim.SparseAdam or whatever.

The next step is actually adding the OptimizerInfos, similar to my attempt in https://github.com/pytorch/pytorch/pull/102774/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113971
Approved by: https://github.com/cpuhrsch
2023-11-18 03:52:01 +00:00
46542f6ce2 [reland][export] make aot_export_module uses dynamo's fake_mode (#114009)
Retry landing https://github.com/pytorch/pytorch/pull/113681

Fixes https://github.com/pytorch/pytorch/issues/110100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114009
Approved by: https://github.com/angelayi
2023-11-18 03:36:34 +00:00
310e3060b7 [Caffe2] Handle cpuinfo_initialize() failure (#114011)
It can fail on ARM platform if `/sys` folder is not accessible.
In that case, call `std:🧵:hardware_concurrency()`, which is
aligned with the thread_pool initialization logic of `c10::TaskThreadPoolBase:defaultNumThreads()`

Further addresses issue raised in https://github.com/pytorch/pytorch/issues/113568
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114011
Approved by: https://github.com/kit1980
ghstack dependencies: #113771
2023-11-18 03:20:22 +00:00
855a5cf427 312 test fix in named tensor and TS deprecations (#113981)
Fix existing bugs / deprecations that become hard errors when running CI with Python 3.12

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113981
Approved by: https://github.com/malfet
2023-11-18 03:06:04 +00:00
4667e20b3f Delete a bunch of type-ignores (#113990)
* Replaced `ignore[import]` by mypy config file entries
* Removed a bunch of ignores around previously-fixed attr-defined /
  call-arg issues
* Fixed some invalid / undefined types; added a few more type-ignores to
  squelch the downstream errors this exposed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113990
Approved by: https://github.com/eellison, https://github.com/Skylion007
ghstack dependencies: #113979
2023-11-18 02:48:38 +00:00
47220bc72a fixes multiple GPU detected error for test_fsdp_fine_tune.py (#112406)
fixes "Duplicate GPU detected : rank 1 and rank 0 both on CUDA device" on  test_fsdp_fine_tune.py. Only run the test if GPU number > 1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112406
Approved by: https://github.com/awgu
2023-11-18 02:07:18 +00:00
1567917e5a [ROCm] Enable several inductor UTs (#112777)
- test_compiled_optimizers.py
- test_foreach.py
- test_profiler.py
- Fix test_profiler.py:test_inductor_profiling_triton_launch - Look for hipModuleLaunchKernel in the events list for AMD GPUs instead of cuLaunchKernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112777
Approved by: https://github.com/jataylo, https://github.com/malfet
2023-11-18 02:05:57 +00:00
b169f04170 [ONNX] Fix bench w/ iobinding; Remove cpu fallback (#113703)
Summary
- `TORCH_TO_NUMPY_DTYPE` was misplaced previously hence subclasses cannot access it.
- Remove cpu fallback when benching onnx with gpu, expose gpu run failures properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113703
Approved by: https://github.com/thiagocrepaldi
ghstack dependencies: #113404, #113697
2023-11-18 01:33:06 +00:00
d4189d8007 Extend _TestONNXRuntime to reuses all tests for new model format (#112289)
`_TestONNXRuntime` has infra to test models which are either Callable or a `torch.nn.Module`.

After #111497, we want to re-run all those tests for model of type `torch.export.ExportedProgram`.

This PR adds to `self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime` the capability of detect the model type to be tested and export the incoming `torch.nn.Module` model to `torch.export.ExportedProgram` before running ONNX export tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112289
Approved by: https://github.com/titaiwangms
2023-11-18 00:27:56 +00:00
2efa89a388 [torch/csrc/onnx] Use nested namespaces (3/N) (#113993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113993
Approved by: https://github.com/ZainRizvi
ghstack dependencies: #113991, #113992
2023-11-18 00:20:19 +00:00
d6744a698c [torch/csrc/onnx] Use nested namespaces (2/N) (#113992)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113992
Approved by: https://github.com/ZainRizvi
ghstack dependencies: #113991
2023-11-18 00:20:19 +00:00
c83a897348 [torch/csrc/onnx] Use nested namespaces (1/N) (#113991)
Differential Revision: [D51439849](https://our.internmc.facebook.com/intern/diff/D51439849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113991
Approved by: https://github.com/ZainRizvi
2023-11-18 00:20:10 +00:00
e360f4c6dd [DTensor] Renamed shard_spec -> placements in test file (#113917)
Public APIs like `from_local` and `distribute_tensor` name the argument as `placements`, not `shard_spec` anymore. This was a direct find and replace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113917
Approved by: https://github.com/wanchaol
ghstack dependencies: #113654, #113903
2023-11-18 00:13:30 +00:00
8372983fe3 [AOTInductor] Use ProxyExecutor for aten op if c-shim is missing (#113918)
Summary:
As discussed in the meeting, we are inverting the policy on the use of proxy executor for aten fallbacks.
By default, aten fallback ops will use proxy executor, unless a c-shim is available.

Test Plan: CIs

Differential Revision: D51417683

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113918
Approved by: https://github.com/chenyang78
2023-11-18 00:04:21 +00:00
dab272eed8 [td] Consistent pytest cache (#113804)
Move the pytest cache downloading into the build step and store it in additional ci files so that it stays consistent during sharding.

Only build env is taken into account now instead of also test config since we might not have the test config during build time, making it less specific, but I also think this might be better since tests are likely to fail across the same test config (I also think it might be worth not even looking at build env but thats a different topic)

Each cache upload should only include information from the current run.  Do not merge current cache with downloaded cache during upload (shouldn't matter anyways since the downloaded cache won't exist at the time)

From what I cant tell of the s3 retention policy, pytest cache files will be deleted after 30 days (cc @ZainRizvi to confirm), so we never have to worry about space or pulling old versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113804
Approved by: https://github.com/ZainRizvi
2023-11-17 23:45:47 +00:00
033d7b670a [Dynamo][6.1/N] Refactor out TorchInGraphFunctionVariable and improve heuristic (#113432)
This is splitted from #113009, please check https://github.com/pytorch/pytorch/pull/113009#issuecomment-1804417925 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113432
Approved by: https://github.com/ezyang
2023-11-17 23:42:00 +00:00
3fc38e6c83 [GHF] Abort merge on rebase failure (#113960)
Abort merges invoked with `-r` if there is nothing to rebase

Make `rebase_onto`/`rebase_ghstack_onto` return False if rebase is no-op and abort merge in that case

Remove `-e` option from both trymerge and tryrebase workflows as  one should never report failures on workflow dispatch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113960
Approved by: https://github.com/clee2000
2023-11-17 23:11:00 +00:00
a450c784da [AotAutograd] Move mutations hidden from autograd in graph (#113454)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113454
Approved by: https://github.com/bdhirsh
2023-11-17 22:47:06 +00:00
4d8c73b2b7 Trivial fix for minor typo in torch.jit._script.py (#113892)
Trivial PR to close an open issue regarding a typo.  Looked for more typos in file, but found none.

Fixes #113866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113892
Approved by: https://github.com/janeyx99
2023-11-17 22:20:21 +00:00
e736d27e38 [inductor] Fix slice scatter shape calculation (#113838)
Fixes #113641

As written, there is an off-by-one error whenever `end - start` doesn't evenly
divide into `step`. e.g. if `end - start = 1` and `step = 2` we should get a
single element but `1 // 2 == 0` so this wouldn't take anything from the slice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113838
Approved by: https://github.com/Chillee
2023-11-17 22:09:35 +00:00
e5102ccd27 [quant][pt2] Support conv1d-bn QAT fusion (#113714)
Summary: Previously the PT2 QAT code only supported conv2d-bn.
This commit extends all existing QAT fusion support to conv1d-bn,
including support for all variants like relu, no bias, literal
args, cuda etc. This commit also refactors the code such that
we can support conv3d-bn easily in the future.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D51428979](https://our.internmc.facebook.com/intern/diff/D51428979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113714
Approved by: https://github.com/jerryzh168
2023-11-17 22:09:30 +00:00
d40d2709c9 Minor fix in Unit Test test_max_autotune.py (#113889)
The benchmark method of TestBenchmarkRequest accesses a non-existent property in a codepath. Looks like a typo, this fixes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113889
Approved by: https://github.com/Skylion007
2023-11-17 21:51:56 +00:00
5d439b07ca Fix failing test_mkldnn_pattern_matcher if built without MKL (#113949)
The test checks for the `mkldnn_fusion.linear` pass which checks `_is_packable_linear` that depends on `torch._C.has_mkl`. So skip the test as it would fail due to no pattern matches counted.

See https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/mkldnn_fusion.py#L827

CC @XiaobingSuper as the author of the test.

Not sure how many other test are affected by similar issues but this is the one in pattern matcher I see failing.

Strangely the first part of the test succeeds where `bias = True` as it finds a match for `unfuse_bias_add_to_pointwise` (torch/_inductor/fx_passes/post_grad.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113949
Approved by: https://github.com/jansel
2023-11-17 21:29:10 +00:00
69d9267c4f [BE]: ruff - enable PIE804 (#113951)
Enables ruff PIE804 which kills some more unnecessary temporary dicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113951
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-11-17 21:23:02 +00:00
4b1583fe57 type-ignore issues exposed by import following (#113979)
Some new errors were introduced in a land-race with
https://github.com/pytorch/pytorch/pull/113830. Silence them for now to
get the lintrunner job green again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113979
Approved by: https://github.com/huydhn
2023-11-17 21:20:09 +00:00
0885c58296 Add Bfloat16 scalar support to gloo backend (#113557)
There was missing support for bfloat scalars. When I use gloo backend
`torch.distributed.init_process_group(backend='gloo')`
and run
`torch.nn.parallel.DistributedDataParallel(model)`
and _model_ has Bfloat16 features I receive following error:
`RuntimeError: Invalid scalar type`

This change fix this issue.
c10::BFloat16 defines conversions from/to float, so calculations are made on float for bfloat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113557
Approved by: https://github.com/XilunWu, https://github.com/jgong5
2023-11-17 21:16:54 +00:00
c435b8c10a Fix autograd engine callback error propagation from device thread (#113702)
The existing try-catch doesn't work because it doesn't call err.persist(). This is in contrast to the try-catch for evaluate_function which does work because it calls into python_engine's thread_on_exception which calls persist.

Calling persist on a python_error stashes the PyErr state from the thread-local PyThreadState onto the python_error object, so that when this error object is stored onto the future and passed back to the calling cpu thread, python_engine's execute try-catch can then err.restore() the error state. Finally, the python_engine's execute would re-raise so that this is re-caught by the HANDLE_TH_ERRORS macro.

Fixes https://github.com/pytorch/pytorch/issues/75750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113702
Approved by: https://github.com/albanD
2023-11-17 20:17:02 +00:00
957312a4cf [ONNX] Relax unsupported node analysis on complex dtype (#113785)
In cases like #113444, users usually stop at UnsupportedNodeAnalysis with unsupported nodes information. Although in SARIF, they can clearly see it's due to lack of COMPLEX support, in screen error message, it's only showing original FX node name, such as `aten.mul.Tensor`. ~~This PR catches the information from diagnostic messages and reveal it to users.~~

The root cause is that UnsupportedNodeAnalysis is leveraging on `onnxfunction_dispatcher.get_function_overloads()` to decide if an ATen is supported or not. However, in `onnxfunction_dispatcher.get_function_overloads()`, lacking of complex function support is considered unsupported. This PR defines Unsupported FX nodes as not in registry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113785
Approved by: https://github.com/thiagocrepaldi
2023-11-17 20:11:20 +00:00
76bf10e551 Revert "Fix checking symbolic shapes inside torch._check (#113811)"
This reverts commit 4f8cb52ed94bcdce16c421d7a5e3e9d32acfa439.

Reverted https://github.com/pytorch/pytorch/pull/113811 on behalf of https://github.com/huydhn due to This one still break inductor tests on main 4f8cb52ed9 ([comment](https://github.com/pytorch/pytorch/pull/113811#issuecomment-1817001514))
2023-11-17 19:56:02 +00:00
c51827b8ce [ez] Hash update to reuse issues again (#113961)
The bot that creates the issue got changed, but the search did not, so it wasn't finding old PRs and was just making new ones.

This PR makes it reuse PRs again instead of making a new one everytime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113961
Approved by: https://github.com/huydhn
2023-11-17 19:06:38 +00:00
ac08022137 [BE][benchmarks] Minor comment cleanup, typos (#113898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113898
Approved by: https://github.com/desertfire
2023-11-17 19:03:41 +00:00
00b67193ef [utils] move config_typing.pyi to torch.utils (#113929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113929
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #111299, #111300, #113901, #113916
2023-11-17 18:51:57 +00:00
a7b701ed21 Update ExecuTorch pinned commit daily (#113832)
WIP

* [X] Update this pinned commit periodically, similar to https://github.com/pytorch/pytorch/pull/113499
* [ ] Increase ET coverage on PT CI, ideally, we should run all ET pull jobs?
* [ ] Switch ExecuTorch's torch, vision, and audio nightly pins to commit pins
* [ ] Update ExecuTorch's torch, vision, and audio commit pins periodically

### Testing

`python .github/scripts/update_commit_hashes.py --repo-name executorch --branch main --pin-folder .ci/docker/ci_commit_pins`

The testing PR is https://github.com/pytorch/pytorch/pull/113834

(I will move the pinned commit out of the Docker image if the docker build process is flaky, otherwise, refreshing Docker image daily seems like a good thing to catch early issue with the images?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113832
Approved by: https://github.com/clee2000
2023-11-17 18:38:46 +00:00
d4bb16f443 Change functorch import to proxy_tensor import (#113913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113913
Approved by: https://github.com/ezyang, https://github.com/zou3519
2023-11-17 18:32:50 +00:00
631fb33fd6 Enable import following in MYPYNOFOLLOW (now MYPYINDUCTOR) (#113830)
Skipping importing some packages for now to make this change more
tractable.

For some reason, lintrunner on CI raises errors in all imported `.pyi` files,
even though it doesn't on my local machine. The errors are all from missing
generic types, as the MYPYINDUCTOR config has `disallow_any_generics`
set. I have thus added `disable-error-code` comments to the relevant files,
though I fixed a few that were easy enough.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113830
Approved by: https://github.com/Skylion007
ghstack dependencies: #113722, #113721
2023-11-17 18:24:21 +00:00
0c8362de1a [dynamo] Make {guards,eval_frame}.py pass follow_imports typechecking (#113721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113721
Approved by: https://github.com/Skylion007
ghstack dependencies: #113722
2023-11-17 18:24:21 +00:00
e2b114ab9f [BE] Package dynamic_dims/constraint_dims into CreateSymbolicPolicy (#113802)
This will make it more convenient to propagate more information through
all of these functions in the future (e.g., for storage offset
information.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113802
Approved by: https://github.com/davidberard98, https://github.com/voznesenskym
2023-11-17 18:22:46 +00:00
dc3d0caab3 BUG: fix np.ndarray.resize under dynamo (#113931)
Make sure ndarray.resize actually works in-place, so that dynamo does the right thing tracking the result.

Fixes #113539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113931
Approved by: https://github.com/lezcano
2023-11-17 18:12:17 +00:00
6849d75300 Automated submodule update: FBGEMM (#112312)
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 049f2a9ac6

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112312
Approved by: https://github.com/malfet
2023-11-17 17:46:18 +00:00
7c35874ad6 Fix for PyTorch mobile flatbuffer loader out of bounds reads (#110162)
Summary:
The mobile_ivalue_size field in the mobile_bytecode flatbuffer schema can be larger than the ivalues vector. This introduces potential for memory corruption when parsing the mobile_bytecode Module.

This diff fixes the issue by ensuring that  mobile_ivalue_size is less than the size of the ivalues vector.

Test Plan: contbuild & OSS CI

Differential Revision: D49687548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110162
Approved by: https://github.com/malfet
2023-11-17 17:29:07 +00:00
9f47580ad7 [BE] Don't mutate torch.compile global config in tests (#113882)
We should uniformly use `config.patch` so the configuration changes don't effect
different tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113882
Approved by: https://github.com/lezcano
2023-11-17 16:49:48 +00:00
4f8cb52ed9 Fix checking symbolic shapes inside torch._check (#113811)
Fixes https://github.com/pytorch/pytorch/issues/110719#issuecomment-1768710678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113811
Approved by: https://github.com/ezyang, https://github.com/peterbell10
2023-11-17 16:14:02 +00:00
dbb96ef30d improve annotation device parameters where a device ordinal is allowed (#113647)
Using mypy in code that depends on pytorch, I noticed that the type annotation doesn't allow a device ordinal.

`error: Argument "device" to "to_empty" of "Module" has incompatible type "int"; expected "str | device"  [arg-type]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113647
Approved by: https://github.com/albanD
2023-11-17 14:41:22 +00:00
a56af02913 [dynamo] Added support for is_contiguous with dynamic shapes (#113645)
Description:
- Added support for `x.is_contiguous` with dynamic shapes

On `main` the following code is giving a graph break:
```python
import torch

@torch.compile(backend="eager", dynamic=True, fullgraph=True)
def f(x):
    if x.is_contiguous():
        return x
    else:
        return 0

x = torch.randn(13, 14)
f(x)
```
with the error message:
```
  File "pytorch/torch/_dynamo/variables/builder.py", line 1541, in wrap_fx_proxy_cls
    unimplemented(
  File "pytorch/torch/_dynamo/exc.py", line 193, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor bool call_method is_contiguous

from user code:
   File "check_is_contig_dynamic_true.py", line 37, in f
    if x.is_contiguous():
```

This PR fixes the issue.
```
TORCH_COMPILE_DEBUG=1 python check_is_contig_dynamic_true.py
[2023-11-14 15:49:04,399] [0/0] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing f check_is_contig_dynamic_true.py:34
[2023-11-14 15:49:04,403] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line check_is_contig_dynamic_true.py:34 in f ()
[2023-11-14 15:49:04,403] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]     @torch.compile(backend="eager", dynamic=True, fullgraph=True)
[2023-11-14 15:49:04,405] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line check_is_contig_dynamic_true.py:37 in f (f)
[2023-11-14 15:49:04,405] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]         if x.is_contiguous():
[2023-11-14 15:49:04,405] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST x []
[2023-11-14 15:49:04,405] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_ATTR is_contiguous [LazyVariableTracker()]
[2023-11-14 15:49:04,804] [0/0] torch._dynamo.output_graph: [DEBUG] create_graph_input L_x_ L['x']
[2023-11-14 15:49:04,805] [0/0] torch._dynamo.variables.builder: [DEBUG] wrap_to_fake L['x'] (5, 4) [<DimDynamic.DUCK: 1>, <DimDynamic.DUCK: 1>] [None, None]
[2023-11-14 15:49:04,839] [0/0] torch._dynamo.output_graph: [DEBUG] create_graph_input s0 L['x'].size()[0]
[2023-11-14 15:49:04,840] [0/0] torch._dynamo.output_graph: [DEBUG] create_graph_input s1 L['x'].size()[1]
[2023-11-14 15:49:04,840] [0/0] torch._dynamo.output_graph: [DEBUG] create_graph_input s2 L['x'].stride()[0]
[2023-11-14 15:49:04,840] [0/0] torch._dynamo.output_graph: [DEBUG] create_graph_input s1 L['x'].stride()[1]
[2023-11-14 15:49:04,840] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 0 [GetAttrVariable(TensorVariable(), is_contiguous)]
[2023-11-14 15:49:04,843] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE POP_JUMP_IF_FALSE 12 [ConstantVariable(bool)]
[2023-11-14 15:49:04,844] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line check_is_contig_dynamic_true.py:42 in f (f)
[2023-11-14 15:49:04,844] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]             return 0
[2023-11-14 15:49:04,844] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST 0 []
[2023-11-14 15:49:04,844] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE RETURN_VALUE None [ConstantVariable(int)]
[2023-11-14 15:49:04,844] [0/0] torch._dynamo.convert_frame: [DEBUG] Skipping frame because no content in function call f                     check_is_contig_dynamic_true.py 34
[2023-11-14 15:49:04,844] [0/0] torch._dynamo.convert_frame: [DEBUG] No graph captured with one_graph=True
[2023-11-14 15:49:04,848] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2023-11-14 15:49:04,848] torch._dynamo.utils: [INFO] Function                           Runtimes (s)
[2023-11-14 15:49:04,848] torch._dynamo.utils: [INFO] -------------------------------  --------------
[2023-11-14 15:49:04,848] torch._dynamo.utils: [INFO] _compile.<locals>.compile_inner          1.2083
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113645
Approved by: https://github.com/lezcano
2023-11-17 12:32:38 +00:00
3df2c42921 [dynamic_shapes] SymNode's hint does not always conform to pytype (#113848)
Fixes https://github.com/pytorch/pytorch/issues/113393

Another chapter in the story of Python's horrible handling of int <-> bool interactions.

```python
print(True and 1)  # 1
print(1 and True)  # True
print(True or 1)  # True
print(1 or True)  # 1
```
For sanity's sake, since we have defined more sane type promotion rules, let's use those and ensure `out_hint` conforms to `SymNode`'s `pytype`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113848
Approved by: https://github.com/ezyang
2023-11-17 11:28:55 +00:00
a5e4d4f25f [dynamo] promote skipfiles logging to verbose (#113916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113916
Approved by: https://github.com/ezyang
ghstack dependencies: #111299, #111300, #113901
2023-11-17 10:00:44 +00:00
b62230a685 [dynamo] remove unused OptimizeCtx field - export (#113901)
This is only an internal API, so it's not really a BC breaking concern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113901
Approved by: https://github.com/ezyang
ghstack dependencies: #111299, #111300
2023-11-17 10:00:44 +00:00
78318d0249 [dynamo] Cache size calc for differing config (#111300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111300
Approved by: https://github.com/ezyang
ghstack dependencies: #111299
2023-11-17 09:59:58 +00:00
5927e9cbf2 [dynamo] guarded config (#111299)
---

Fixes: https://github.com/pytorch/pytorch/issues/110682

Replaces: https://github.com/pytorch/pytorch/pull/111074

The guards are installed based on config that is valid at the call to `torch.compile`, rather than at any subsequent call / triggered compilation. Subsequent compilations will restore the config if there is a config mismatch of the existing global config with the saved config.

TODO:
- [X] add tests

Follow up PRs:
- [x] add revised cache size computation (follow up PR: #111300 , based on: https://github.com/pytorch/pytorch/pull/107496)
- [ ] handle run-only mode?
- [ ] config restoration itself is not thread-safe (tracked: https://github.com/pytorch/pytorch/issues/111150)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111299
Approved by: https://github.com/ezyang
2023-11-17 09:59:58 +00:00
7731c97e06 Revert "Fix checking symbolic shapes inside torch._check (#113811)"
This reverts commit 7f224f6714419f3d56e64a66079340b0e914a2ca.

Reverted https://github.com/pytorch/pytorch/pull/113811 on behalf of https://github.com/jeanschmidt due to Breaking inductor tests on main ([comment](https://github.com/pytorch/pytorch/pull/113811#issuecomment-1816024288))
2023-11-17 09:29:45 +00:00
f27ab241a4 [dynamo] Fix UnspecializedNNModuleVariable's source (#113852)
Fixes https://github.com/pytorch/pytorch/issues/113041

In the case where we have an object represented as an UnspecializedNNModuleVariable, the source of an attribute on that object is `AttrSource(base=NotNNModuleSource(base=NNModuleSource(base=AttrSource(base=LocalSource(local_name='self', cell_or_freevar=False), member='seq'))), member='b')`. This causes dynamo to add an extra attribute as it doesn't go to this [`register_attr` step](eddce3c054/torch/_dynamo/variables/builder.py (L955-L962)).

However if we have an object represented as a UserDefinedObjectVariable, the source of an attribute on that object is `AttrSource(base=NNModuleSource(base=AttrSource(base=LocalSource(local_name='self', cell_or_freevar=False), member='seq')), member='b')`.

It seems that UnspecializedNNModuleVariables should behave in the same was as UserDefinedObjectVariables, but the source in these two cases are different. So, I removed the part that changes the source in the UnspecializedNNModuleVariables, and it seems to work! And CI is green (+ reduced graph breaks).

```
   def test_unspecialized_nnmodule(self):
        class TestModule(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.a = torch.tensor(1.0)

            def forward(self, x: torch.Tensor) -> torch.Tensor:
                return x + self.a

        def forward_hook(
            module: torch.nn.Module, inputs, output
        ) -> torch.Tensor:
            return 2 * output

        seq = torch.nn.Sequential(TestModule()).eval()
        seq.b = torch.tensor(2)
        handle = seq.register_forward_hook(forward_hook)

        class M(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.seq = seq

            def forward(self, x):
                # self.seq.b has source: AttrSource(base=NotNNModuleSource(base=NNModuleSource(base=AttrSource(base=LocalSource(local_name='self', cell_or_freevar=False), member='seq'))), member='b')
                return self.seq(x) + self.seq.b

        inp = (torch.randn(2, 8),)
        ep = export(M(), inp)
```
```
    def test_user_defined_var(self):
        class TestModule(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.a = torch.tensor(1.0)

            def forward(self, x: torch.Tensor) -> torch.Tensor:
                return x + self.a

        class UserDefined:
            def __init__(self):
                self.test_module = TestModule()
                self.b = torch.tensor(2)

            def __call__(self, x):
                return self.test_module(x)

        class M(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.seq = UserDefined()

            def forward(self, x):
                # self.seq.b has source: AttrSource(base=NNModuleSource(base=AttrSource(base=LocalSource(local_name='self', cell_or_freevar=False), member='seq')), member='b')
                return self.seq(x) + self.seq.b

        inp = (torch.randn(2, 8),)
        ep = export(M(), inp)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113852
Approved by: https://github.com/yanboliang
2023-11-17 08:17:27 +00:00
7c38b76efe Make offsets dynamic by default (#113734)
Copied from @ezyang 's #113693.

The motivation for this change is that we'd like to guard on storage offset in inductor, to make assumptions about data alignment.

create_symbolic_sizes_strides_storage_offset() creates the sizes/strides/offset for fake tensors - they can either be integers or symints. This PR changes storage_offset to always be dynamic. In variables/builder.py, we remove a conditional so that all tensors get added to tracked_fakes. This is because the storage offset will be dynamic even if the other logic in builder.py suggests that it will be static; otherwise, we run into this issue:

1e260c851b/torch/fx/experimental/symbolic_shapes.py (L892-L895)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113734
Approved by: https://github.com/ezyang
2023-11-17 07:57:21 +00:00
c94fdebd3e [dynamo] chore: Fallback on const_handler instead of special-casing on ConstantVariable (#113893)
Fixes https://github.com/pytorch/pytorch/pull/113874#issuecomment-1815269686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113893
Approved by: https://github.com/ezyang
2023-11-17 07:46:58 +00:00
c233cef8fd [dynamo] Enforce lifetime of output fx graph and its metadata (#113517)
Fixes https://github.com/pytorch/pytorch/issues/113516

Also asserts that by the time we modify the output's graph nodes, we are in the irreversible state of `should_exit`.

Remove `creation_timestamp` from graph as it is only consumed by dynamo for checkpoint restore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113517
Approved by: https://github.com/ezyang
2023-11-17 07:34:43 +00:00
16da135550 More replacing assert with CUDA_KERNEL_ASSERT in kernels (#113563)
Fixes #103973

**Background:**
After https://github.com/pytorch/pytorch/pull/113098, user verified that torch.sum() worked for environment where PCIe atomics was exposed as a problem for such operation.

**Goal:**
This is to expend the changes to other kernels where assert is called. The goal is the same so that we can disable kernel assertion easily for those users when the call sites consistently use CUDA_KERNEL_ASSERT.

**Test:**
We build wheels with these fixes for those users who had PCIe atomics issue, and users verified they can perform their workflow now with these fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113563
Approved by: https://github.com/jeffdaily, https://github.com/ezyang
2023-11-17 07:28:00 +00:00
015fd2eb41 [NCCL PG] Add dumping flight recorder in the NCCL watchdog timeout (#113678)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113678
Approved by: https://github.com/XilunWu
ghstack dependencies: #113503
2023-11-17 07:00:41 +00:00
0ea126e834 add use_fake_all_gather and use_fake_reduce_scatter to FSDP for ablation studies (#113106)
Summary: As titled

Test Plan: Not needed because this is only for doing ablation studies

Reviewed By: awgu

Differential Revision: D50867908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113106
Approved by: https://github.com/awgu
2023-11-17 05:43:30 +00:00
4979f9c0d7 [EASY] Support SymInt tracing on broadcast_shapes (#113877)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113877
Approved by: https://github.com/Skylion007
2023-11-17 04:43:57 +00:00
e8ee14292e Export _C in torch/__init__.py explicitly with from . import (#113887)
This is now required with mypy 1.7. See release blog post: https://mypy-lang.blogspot.com/2023/11/mypy-17-released.html under the heading "New Rules for Re-exports".

Under normal circumstances this isn't noticeable, but when the setting
```
implicit_reexport = false
```
is used in the mypy config file, then mypy can't find `torch._C` when only `torch` has been imported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113887
Approved by: https://github.com/Skylion007
2023-11-17 03:32:14 +00:00
7f224f6714 Fix checking symbolic shapes inside torch._check (#113811)
Fixes https://github.com/pytorch/pytorch/issues/110719#issuecomment-1768710678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113811
Approved by: https://github.com/ezyang, https://github.com/peterbell10
2023-11-17 03:05:49 +00:00
237cbd5be6 BUG: trace frames with numpy scalar -> ndarray functions (#112959)
Fixes #112951

Make dynamo detect that `np.arange(3)` returns a FakeTensor, so the frame needs to be traced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112959
Approved by: https://github.com/lezcano
2023-11-17 03:00:24 +00:00
99b89db174 [DTensor] Added op_call in no-mesh dispatch assert message (#113903)
This helps debug, e.g. when there is an unsupported op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113903
Approved by: https://github.com/wanchaol
ghstack dependencies: #113654
2023-11-17 02:44:54 +00:00
0894981f6c [HigherOrderOp][BE] change _make_inlined check callable() (#113881)
A follow up of discussion #113814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113881
Approved by: https://github.com/Skylion007
2023-11-17 02:44:12 +00:00
ae94c7e491 [dtensor] add foreach_zero_ support (#113897)
This PR add foreach_zero_ op support, to fix when
optim.zero_grad(set_to_none=False) hit this op and erroring out the
device mesh not found issue.

Also move the test to use zero_grad as the last step as that's when we
going to have dtensor as grads

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113897
Approved by: https://github.com/awgu
2023-11-17 02:11:19 +00:00
9916d8a9ea Add torch._lazy_clone to create COW tensors (#113397)
Part of #109833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
ghstack dependencies: #113396
2023-11-17 01:58:51 +00:00
e2f090086b Add function to materialize COW storages (#113396)
Part of #109833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113396
Approved by: https://github.com/ezyang
2023-11-17 01:58:51 +00:00
a9134fa99a Skip cudagraphs when there is sparsity (#113791)
Fix for dlrm training

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113791
Approved by: https://github.com/Chillee
2023-11-17 01:36:03 +00:00
31459e3e56 [ONNX][dynamo_export] Add 'aten::rsub' type promotion (#113697)
The logic is the same as 'aten::sub'. Needed by llama2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113697
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
ghstack dependencies: #113404
2023-11-17 00:50:05 +00:00
b3308c4856 [FSDP][Docs] Omit "on CPU" (#113753)
This initialization can take place on CPU, GPU, or meta device and the current comment sort of implies users need to do it on CPU for this to work.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113753
Approved by: https://github.com/wz337
2023-11-17 00:15:41 +00:00
2ac33ad98a [dtensor] group dispatch unwrapping to a method (#113846)
This PR group the dispatch unwrapping logic to a method, so that even
custom handlers can reuses many parts of the dispatch logic to do custom
things.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113846
Approved by: https://github.com/wz337
2023-11-16 23:54:18 +00:00
769f924bc6 robustify parametrize default name (#113856)
#113340 was reverted initially due to a bad default parametrization name. The test looked like

```python
@common_utils.parametrize(
    "type_fn",
    [
        type,
        lambda obj: obj.__class__,
    ],
)
def test_access_class_method_from_user_class(self, type_fn):
```

This is a valid parametrization, but results in these default test names:

```bash
❯ pytest test/dynamo/test_export.py -k test_access_class_method_from_user_class --co -q
test/dynamo/test_export.py::ExportTests::test_access_class_method_from_user_class_type_fn_<class 'type'>
test/dynamo/test_export.py::ExportTests::test_access_class_method_from_user_class_type_fn_<function ExportTests_<lambda> at 0x7f3be5de0c10>
```

Ignoring the whitespace in the test names, which can lead to other issues down the line, the problem in #113340 was that the lambda parameter included a memory address. IIUC, internally, the tests are not collected and run in the same process. Meaning, the address of the lambda and in turn the test name is no longer valid on the runner. This is fixed earlier in the stack by giving the parametrization an explicit name with `subtest`, but this PR is about preventing issues in the default case.

`pytest` solves this by simply using the name of the parameter plus its index as id in the test name:

```python
import pytest

class Foo:
    def __repr__(self):
        return str(id(self))

@pytest.mark.parametrize(
    "bar",
    [
        pytest.param(type),
        pytest.param(lambda obj: obj.__class__),
        pytest.param(Foo()),
    ],
)
def test_foo(bar):
    pass
```

```
❯ pytest main.py --co -q
main.py::test_foo[type]
main.py::test_foo[<lambda>]
main.py::test_foo[bar2]
```

`pytest` has better defaults for `type` and `lambda` than we do, but is has a safe default for custom objects.

This PR aligns our default test name with `pytest`. Using the parametrization from above again, we now collect

```bash
❯ pytest test/dynamo/test_export.py -k test_access_class_method_from_user_class --co -q
test/dynamo/test_export.py::ExportTests::test_access_class_method_from_user_class_type_fn0
test/dynamo/test_export.py::ExportTests::test_access_class_method_from_user_class_type_fn1
```

which might not be as expressive at first glance, but at least prevents bugs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113856
Approved by: https://github.com/malfet, https://github.com/huydhn
ghstack dependencies: #113855
2023-11-16 23:25:04 +00:00
03bebd90f6 cleanup test parametrization (#113855)
Cleanup from https://github.com/pytorch/pytorch/pull/113340#issuecomment-1814020469.

```
❯ pytest test/dynamo/test_export.py -k test_access_class_method_from_user_class --co -q
test/dynamo/test_export.py::ExportTests::test_access_class_method_from_user_class_attr
test/dynamo/test_export.py::ExportTests::test_access_class_method_from_user_class_builtin
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113855
Approved by: https://github.com/lezcano, https://github.com/huydhn
2023-11-16 23:25:04 +00:00
277229d0c6 [dynamo] Fix incorrectly casting SymNode to int when input is bool (#113871)
Fixes https://github.com/pytorch/pytorch/issues/113393, https://github.com/pytorch/pytorch/pull/113848#issuecomment-1814624510

Incorrectly casting symnode type will cause it to take the wrong path in symbolic_shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113871
Approved by: https://github.com/jansel
2023-11-16 23:24:57 +00:00
986634a117 Add Pass to move constructors from cpu to cuda (#109665)
Sometimes indexing tensors are constructed on cpu and then used to index a cuda tensor. This prevents cudagraphs when it does not need to. Adding a pass which moves constructors from cpu->cuda when we can prove the downstream uses can be safely converted.

This pr allows us to cudagraph `clip` from the blueberries model which improves perf from ~1.5x -> ~4x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109665
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-11-16 23:19:57 +00:00
ec20c9044e [TD] Fix metric emission for split test files (#113789)
Fixes a bug in TD metrics generation where it wouldn't be able to find the rank and relevance that a heuristic gave a test run if that heuristic had divided that test into multiple test runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113789
Approved by: https://github.com/clee2000
2023-11-16 23:19:40 +00:00
1480c670a0 [AOTI] Delay the fallback kernel naming decision to the codegen time (#113660)
Summary: This is to prepare for a later change that changes AOTI's second-pass to perform codegen only.

Differential Revision: [D51382677](https://our.internmc.facebook.com/intern/diff/D51382677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113660
Approved by: https://github.com/chenyang78
2023-11-16 23:07:30 +00:00
bab41f44b8 [dynamo] Fix allow_in_graph decorator doesn't work on autograd.Function (#113510)
Fixes #111032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113510
Approved by: https://github.com/zou3519
2023-11-16 22:44:46 +00:00
3f6e5e87f8 Revert "[1/N] Fixes clang-tidy warnings in header files (#113608)"
This reverts commit cab039fe9b9466f09f98318a11d2dcafef235426.

Reverted https://github.com/pytorch/pytorch/pull/113608 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing with an internal build when -Wpessimizing-move is used ([comment](https://github.com/pytorch/pytorch/pull/113608#issuecomment-1815424448))
2023-11-16 22:38:41 +00:00
d9f2cf9974 [BE]: Enable ruff rule PIE800 - unnecessary nested dict expansion (#113880)
Adds an additional list which removes unnecessary dict literal unpacking, also applies the fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113880
Approved by: https://github.com/albanD
2023-11-16 22:34:38 +00:00
bdf0b196db Quantize bias for conv2d quantized op during setup (#113582)
Summary: Quantize bias in setup step so that we do not incur additional time on quantizing bias in the first iteration.

Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
.....
[----------] Global test environment tear-down
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[  PASSED  ] 78 tests.

  YOU HAVE 8 DISABLED TESTS

buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
[----------] 395 tests from VulkanAPITest (6515 ms total)
.....
[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[  PASSED  ] 394 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS

Reviewed By: yipjustin, copyrightly

Differential Revision: D50997531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113582
Approved by: https://github.com/yipjustin
2023-11-16 22:31:36 +00:00
e19ea53e1d Add optional torch.export.ExportGraphSignature to ONNXProgram (#113477)
When the ONNX model is exported from a torch.export.ExportedProgram, a
torch.export.ExportedGraphSignature is available with the specification
of the model inputs and outputs.

ExportedGraphSignature includes information such as the mapping between
the exported input/buffer/output ONNX name to the original pytorch input/buffer/output name.

It also specifies the kind of the input, such as user_input, parameter,
buffer or constant_tensor. Outputs kind can be user_output, loss_output,
buffer_mutation, etc

Such information can be useful to understand what the ONNX model expects
as inputs and how the output will look like when the ONNX input/output
differs from the original PyTorch input/output schema.

When the ONNX model is exported from a Callable or regular
torch.nn.MOdule, such information is not available and
ONNXProgram.model_signature will yield NOne
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113477
Approved by: https://github.com/BowenBao
2023-11-16 22:04:44 +00:00
9a9232956f Include job name in the emitted metrics (#113884)
What it says in the title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113884
Approved by: https://github.com/clee2000
2023-11-16 21:26:49 +00:00
2530d47cbe [dynamo] re-add option to log all guard check fails (#113585)
Followup to https://github.com/pytorch/pytorch/pull/110325 - re-add the `report_all_guard_failures config` as a logging artifact `recompiles_verbose` with the following changes:
- evaluating the check must be wrapped with exception handling because subsequent code parts following the first failure may result in errors if evaluated (e.g. if a guard checks first for size, then tries to index - a guard failure due to insufficient size would result in an index error for the latter check).
- Adding a test for this case

Sample:
```python
import torch

def fn(x):
    return torch.rand(x[-1], len(x))

opt_fn = torch.compile(fn)
opt_fn([4, 5, 6])
opt_fn([7, 8])
opt_fn([9])
```

Output (with `TORCH_LOGS="recompiles_verbose"`):
```bash
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG] Recompiling function fn in /data/users/williamwen/pytorch/playground5.py:15
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     triggered by the following guard failure(s):
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     guard 0 failures:
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     - len(L['x']) == 3
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     - L['x'][0] == 4
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     - L['x'][1] == 5
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG] Recompiling function fn in /data/users/williamwen/pytorch/playground5.py:15
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     triggered by the following guard failure(s):
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     guard 0 failures:
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     - len(L['x']) == 2
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG]
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     guard 1 failures:
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     - len(L['x']) == 3
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG]     - L['x'][0] == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113585
Approved by: https://github.com/jon-chuang, https://github.com/ezyang
2023-11-16 21:20:29 +00:00
40dfabf970 Revert "[export] make aot_export_module uses dynamo's fake_mode (#113681)"
This reverts commit 094beca0c6ebc2ac7d70c5badc271a1663e05de6.

Reverted https://github.com/pytorch/pytorch/pull/113681 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing an internal ExecuTorch test ([comment](https://github.com/pytorch/pytorch/pull/113681#issuecomment-1815329750))
2023-11-16 21:20:02 +00:00
2abb04d1dc [inductor] Relax symbolic guard for sizevars.evaluate_min (#113841)
We should shorten two conditional guards (guard_equals, guard_lt)
into only one (guard_leq). Then we can save re-compilation for
access-the-last-element-of-the-tensor op. [test_torchinductor.test_setitem_with_int_parameter](8efa6ad1fc/test/inductor/test_torchinductor.py (L6896C1-L6902))
will become `frame_count = 2 if torch._dynamo.config.assume_static_by_default else 1`.

Test plan:
`python test/inductor/test_torchinductor.py -k test_setitem_with_int_parameter_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113841
Approved by: https://github.com/peterbell10, https://github.com/aakhundov
2023-11-16 21:16:50 +00:00
98df3088c3 Revert "Make offsets dynamic by default (#113734)"
This reverts commit 9efbb4ea73009950a2d99e4d871351c898aae0dd.

Reverted https://github.com/pytorch/pytorch/pull/113734 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is causing a memory leak in one of the test 9efbb4ea73 ([comment](https://github.com/pytorch/pytorch/pull/113734#issuecomment-1815297222))
2023-11-16 20:56:27 +00:00
3c4e4d9947 Revert "[quant][pt2e] Refactor insert observer to do sharing checking in the same place (#113458)"
This reverts commit 585e315b3afc962bda4449957dc0d25eca3e4d4e.

Reverted https://github.com/pytorch/pytorch/pull/113458 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing executorch export test for llama2 ([comment](https://github.com/pytorch/pytorch/pull/113458#issuecomment-1815280715))
2023-11-16 20:43:38 +00:00
de4fd3843c [Inductor][fx pass] Fix a bug in the merge getitem cat pattern (#113822)
Summary: The split cat pattern in D50100667 may change the sliced node returned by split node if the getitem to be merged is not consecutive indices.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- --exact 'pytorch/benchmark/fb/test_gpu:run_test_gpu - test_train_mimo_cmf_30x_inductor_accuracy (pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu)' --run-disabled
```
Buck UI: https://www.internalfb.com/buck2/1fd8fa6a-83d1-4cfd-bf33-c7ddb28de5b5
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924659080211
Network: Up: 1.3GiB  Down: 48MiB  (reSessionID-acaa2760-abff-442e-989f-3eefd1d1e034)
Jobs completed: 75. Time elapsed: 18:37.5s.
Cache hits: 0%. Commands: 68 (cached: 0, remote: 0, local: 68)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0

```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- --exact 'pytorch/benchmark/fb/test_gpu:run_test_gpu - test_train_mimo_cmf_30x_inductor_speedup (pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu)'
```
Buck UI: https://www.internalfb.com/buck2/7de122c6-23e0-4f13-b2b4-934cf780b60b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498613412388
Network: Up: 90KiB  Down: 2.1MiB  (reSessionID-f75d6b7b-93ea-4d47-a52a-8d2429b30ad1)
Jobs completed: 6. Time elapsed: 17:28.0s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D51378532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113822
Approved by: https://github.com/xuzhao9
2023-11-16 20:40:03 +00:00
8dc4b12fa7 [Pytorch][Vulkan] refactor layer_norm (#113676)
Summary: Due to the implementation of `native_layer_norm`, we can simplify the implementation of `layer_norm` by just invoking `native_layer_norm`.

Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (7f66eb77b)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm*
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.layer_norm_invalid_inputs
[       OK ] VulkanAPITest.layer_norm_invalid_inputs (69 ms)
[ RUN      ] VulkanAPITest.layer_norm_2d
[       OK ] VulkanAPITest.layer_norm_2d (292 ms)
[ RUN      ] VulkanAPITest.layer_norm_3d
[       OK ] VulkanAPITest.layer_norm_3d (289 ms)
[ RUN      ] VulkanAPITest.layer_norm_4d
[       OK ] VulkanAPITest.layer_norm_4d (4 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_2d
[       OK ] VulkanAPITest.native_layer_norm_2d (5 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_3d
[       OK ] VulkanAPITest.native_layer_norm_3d (2 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_4d
[       OK ] VulkanAPITest.native_layer_norm_4d (4 ms)
[----------] 7 tests from VulkanAPITest (667 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test suite ran. (667 ms total)
[  PASSED  ] 7 tests.
```

Reviewed By: yipjustin

Differential Revision: D51297971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113676
Approved by: https://github.com/yipjustin
2023-11-16 20:39:58 +00:00
0d6d97d956 Relax constraints on test_cast_round_trip (#113872)
Results of float point operation can be affected by execution order and compiler is not guaranteed to make trivial optimization that might result in lost off precision while compiling in debug mode

Fixes https://github.com/pytorch/pytorch/issues/113829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113872
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2023-11-16 19:52:05 +00:00
c4c45ab9b5 Fix resize matrix_power.out dynamic shapes (#113695)
Fixes https://github.com/pytorch/pytorch/issues/113003

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113695
Approved by: https://github.com/bdhirsh, https://github.com/lezcano
2023-11-16 19:36:27 +00:00
8a183bf1ab [BE] Consistently query tracing context for fake mode in Dynamo (#113768)
Split from https://github.com/pytorch/pytorch/pull/113666

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113768
Approved by: https://github.com/bdhirsh
2023-11-16 19:31:10 +00:00
3a3a979984 Add torch.distributed.breakpoint (#113775)
I tested it works by patching

```
diff --git a/test/distributed/test_dynamo_distributed.py b/test/distributed/test_dynamo_distributed.py
index 96b3a82bdfa..dea9bac9302 100644
--- a/test/distributed/test_dynamo_distributed.py
+++ b/test/distributed/test_dynamo_distributed.py
@@ -18,6 +18,7 @@ from torch._dynamo import config
 from torch._dynamo.utils import same
 from torch._dynamo.testing import collect_results
 from torch.utils._triton import has_triton
+import torch.distributed as dist
 from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy, lambda_auto_wrap_policy
 from torch._higher_order_ops.wrap import tag_activation_checkpoint
 from torch.nn.parallel import DistributedDataParallel as DDP
@@ -398,6 +399,7 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase):
     @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
     def test_fsdp_activation_checkpointing(self):
         with _dynamo_dist_per_rank_init(self.rank, self.world_size):
+            dist.breakpoint()
             model, inputs = get_toy_model_for_activation_checkpointing(f"cuda:{self.rank}")
             is_inner = lambda module: isinstance(module, ToyInnerModel)  # noqa: E731
             wrap_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=is_inner)
```

and then running `python test/distributed/test_dynamo_distributed.py -k test_fsdp_activation_checkpointing`

It prints:

```
ATTENTION!!!

Type 'up' to get to the frame that called dist.breakpoint(rank=0)

> /data/users/ezyang/c/pytorch/torch/distributed/__init__.py(71)breakpoint()
-> barrier()
(Pdb) up
> /data/users/ezyang/c/pytorch/test/distributed/test_dynamo_distributed.py(402)test_fsdp_activation_checkpointing()
-> dist.breakpoint()
(Pdb) list
397
398         @skip_if_lt_x_gpu(1)
399         @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
400         def test_fsdp_activation_checkpointing(self):
401             with _dynamo_dist_per_rank_init(self.rank, self.world_size):
402  ->             dist.breakpoint()
403                 model, inputs = get_toy_model_for_activation_checkpointing(f"cuda:{self.rank}")
404                 is_inner = lambda module: isinstance(module, ToyInnerModel)  # noqa: E731
405                 wrap_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=is_inner)
406                 model = apply_fsdp_with_checkpointing(model, wrap_policy, is_inner)
407                 correct_outputs = model(inputs)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113775
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2023-11-16 19:30:57 +00:00
eddce3c054 [AOTInductor] Rename model_runner to model_container_runner (#111324)
Summary:
We rename the model_runner to model_container_runner to prepare for
adding tests of pure model without container.

Test Plan:
commit itself is a test.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111324
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2023-11-16 19:14:22 +00:00
1d96034816 [BE][easy] Simplify the registration of a few metafunctions (#113635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113635
Approved by: https://github.com/Skylion007
ghstack dependencies: #113634, #113674
2023-11-16 19:09:12 +00:00
ef982418df Add OpInfo test that tests meta functions binary ufuncs with different dtypes (#113674)
As per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113674
Approved by: https://github.com/peterbell10
ghstack dependencies: #113634
2023-11-16 19:09:12 +00:00
9b3e694f5d Fix metafunction for many pointwise operations (#113634)
The previous metafunction was completely broken.
It incorrectly used a metafunction that was designed for prims. It also
passed in an incorrect enum class for the type promotion.

Fixes https://github.com/pytorch/pytorch/issues/113119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113634
Approved by: https://github.com/peterbell10
2023-11-16 19:09:12 +00:00
3e3c6cc05e Do not error when printing view created in no-grad modified in-place in no-grad (#113716)
Fixes https://github.com/pytorch/pytorch/issues/99968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113716
Approved by: https://github.com/albanD
2023-11-16 18:57:56 +00:00
6cdb6234d6 [ROCm] Supports ROCm6.0 reorganization and cleanup (#111486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111486
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2023-11-16 18:37:12 +00:00
070b2d3cff cholesky_solve_backward: speed up using output_mask (#112981)
Introduces a faster path for `cholesky_solve_backward` when the gradient with respect to the cholesky factor isn't required.

Adds test coverage in `test_linalg.py`.

# Example

## Setup

```py
import torch
torch.set_num_threads(1)
mat = torch.randn(500, 1000)
mat = mat @ mat.T
L = torch.linalg.cholesky(mat, upper=False)

rhs = torch.randn(500, 1)
rhs.requires_grad = True

sol = torch.cholesky_solve(rhs, L, upper=False).sum(dim=0)
```

## Before
```
%timeit torch.autograd.grad(sol, rhs, retain_graph=True)
2.61 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

## After
```
%timeit torch.autograd.grad(sol, rhs, retain_graph=True)
109 µs ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112981
Approved by: https://github.com/lezcano
2023-11-16 18:30:57 +00:00
25fb88cf23 Add all 3.12 binary build for wheel. Let's see how it goes. V2 (#112882)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112882
Approved by: https://github.com/malfet, https://github.com/sammcj
2023-11-16 18:20:12 +00:00
275403be16 [doc] Add nn.parametrizations.weight_norm (#113783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113783
Approved by: https://github.com/albanD
2023-11-16 17:42:48 +00:00
62d86f27c2 Revert "Add Pass to move constructors from cpu to cuda (#109665)"
This reverts commit 3bac94b107bb808b158b005d248804895d844d40.

Reverted https://github.com/pytorch/pytorch/pull/109665 on behalf of https://github.com/eellison due to want to maek one last change ([comment](https://github.com/pytorch/pytorch/pull/109665#issuecomment-1814924579))
2023-11-16 17:39:49 +00:00
3bac94b107 Add Pass to move constructors from cpu to cuda (#109665)
Sometimes indexing tensors are constructed on cpu and then used to index a cuda tensor. This prevents cudagraphs when it does not need to. Adding a pass which moves constructors from cpu->cuda when we can prove the downstream uses can be safely converted.

This pr allows us to cudagraph `clip` from the blueberries model which improves perf from ~1.5x -> ~4x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109665
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-11-16 17:28:46 +00:00
7183926622 [HigherOrderOp][BE] consolidate UserFunctionVariable.call_function pattern to _make_inlined (#113814)
We saw some use cases in higher order operator that tries to directly inline a user-level function (e.g. pytree.tree_flatten and pytree.tree_unflatten) with no tensor operations by manually constructing a UserFunctionVariable and run call_function on it.

This PR consolidate this pattern a little bit by adding a _make_inlined helper function to make the UX better( i.e. the callilng convention is kept the same with the function that we'd like to inline) and also reduce redundancy, increase readability.

Test Plan:
Exisiting tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113814
Approved by: https://github.com/yanboliang
2023-11-16 16:56:24 +00:00
d19cef34fb Do not attempt to compile unwind.cpp on aarch64 (#113782)
Summary:
As almost entire unwinding logic is build around x86_64 ABI
In essence, this reverts https://github.com/pytorch/pytorch/pull/104707 and adds `#ifndef FBCODE_CAFFE2` guards around `symbolize` dummy

Use nested namespaces, as PyTorch is finally C++ compatible.
Remove extraneous semicolon spotted by clang-tidy.

Fixes https://github.com/pytorch/pytorch/issues/113208

Test Plan: CI + `buck2 build fbcode//mode/opt fbcode//caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -c fbcode.arch=aarch64`

Differential Revision: D51358469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113782
Approved by: https://github.com/aaronenyeshi
2023-11-16 16:08:47 +00:00
cyy
f9bf104c64 [2/N] Fixes clang-tidy warnings in header files (#113727)
This PR fixes more clang-tidy warnings in common headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113727
Approved by: https://github.com/Skylion007
2023-11-16 13:21:15 +00:00
ecf129565b Avoid adding to lazy device cache if cache size is 0 (#113710)
Fixes #113672

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113710
Approved by: https://github.com/antoniojkim, https://github.com/alanwaketan, https://github.com/desertfire
2023-11-16 12:45:34 +00:00
51cbe780cb [pytorch-vulkan][1/n] Enable Packing for Vulkan Tensors (#113627)
Summary:
The new implementation of mat-mul missed a critical step that does width-packing on GPU: see T169764697.

The existing implementation of mat-mul also missed a case when the "B" matrix is already in vulkan, it fails to do packing, leading to wrong results. (I have added a disabled unittest to reflect the issue).

We will take multiple steps to enable (width / height) packing and transformation between different packing on Vulkan.

This is a first diff that enable a critical toolset: It allows development to fetch values from the underlying tensor, making it possible to implement tests for the transformation shaders.

Test Plan:
P882053410,
P882053410

Differential Revision: D51291737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113627
Approved by: https://github.com/SS-JIA
2023-11-16 09:04:07 +00:00
5fb1d8f18a [NCCL PG] Enable storing nccl traces into storage and make it configurable (#113503)
This PR is to enable the store of NCCL flight recorder to storage and make it configurable by letting users register their own way of storing the debug info. We will then provide users a script to offline parse and process the dumped blobs.

One thing, this PR is not trying to resolve is to decide where to dump the debug info. I will send a follow-up PR to address that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113503
Approved by: https://github.com/zdevito
2023-11-16 07:44:15 +00:00
c1c4882367 [aps] Sync thrift (#113810)
Summary:
Based on discussions with Sherlock + Zhengxu in D51118067, updated the internal thrift schema to match the OSS schema.

Verifier failures:
* Test contains a None as input, resulting in no meta["val"]
* Test contains torch.autograd.grad_mode.set_grad_enabled as an op, which also results in no meta["val"]
* torch.autograd.grad_mode.set_grad_enabled is also not a valid op
* Test adds a "parameter" to the state dict but the parameter is not an nn.Parameter, causing an assertion failure

So to bypass these failures I did the following hacks(?):
* Before creating the exported program in deserialization, populate nodes w/o meta["val"] with meta["val"] = None
* Add torch.autograd.grad_mode.set_grad_enabled to the skip opset
* Duplicated ExportGraphSignature into aot_export.py so that the graph signature checks will be skipped

Configerator changes in D51343615

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D51342921

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113810
Approved by: https://github.com/zhxchen17
2023-11-16 07:42:30 +00:00
8033f65c0b Don't toggle torch logger to NOTSET if it is not set; always use pre-existing (#113842)
This is kind of hard to test.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113842
Approved by: https://github.com/wanchaol
2023-11-16 07:06:05 +00:00
9efbb4ea73 Make offsets dynamic by default (#113734)
Copied from @ezyang 's #113693.

The motivation for this change is that we'd like to guard on storage offset in inductor, to make assumptions about data alignment.

create_symbolic_sizes_strides_storage_offset() creates the sizes/strides/offset for fake tensors - they can either be integers or symints. This PR changes storage_offset to always be dynamic. In variables/builder.py, we remove a conditional so that all tensors get added to tracked_fakes. This is because the storage offset will be dynamic even if the other logic in builder.py suggests that it will be static; otherwise, we run into this issue:

1e260c851b/torch/fx/experimental/symbolic_shapes.py (L892-L895)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113734
Approved by: https://github.com/ezyang
2023-11-16 06:49:09 +00:00
b612e27221 [Easy] Fix typo in TagActivationCheckpoint comment (#113818)
As titled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113818
Approved by: https://github.com/Chillee, https://github.com/bdhirsh
2023-11-16 06:06:09 +00:00
cffea773e3 Fix bsr_dense_mm with a non-contiguous out argument. (#113801)
Fixes https://github.com/pytorch/pytorch/issues/113754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113801
Approved by: https://github.com/cpuhrsch
2023-11-16 05:56:17 +00:00
0a9dbbbaad Make _inductor/fx_utils.py, _dynamo/utils.py pass follow_imports typechecking (#113722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113722
Approved by: https://github.com/lezcano
2023-11-16 05:44:15 +00:00
bbd73c746e Revert "[ONNX][dynamo_export] Add 'aten::rsub' type promotion (#113697)"
This reverts commit 48800e9bb0fd0d8aa56f961fe207b1040922fa2e.

Reverted https://github.com/pytorch/pytorch/pull/113697 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing in trunk 48800e9bb0.  The failure on the PR is legit https://github.com/pytorch/pytorch/actions/runs/6884783862/job/18728219414, let me take a look on why Dr.CI marks it as flaky ([comment](https://github.com/pytorch/pytorch/pull/113697#issuecomment-1813790907))
2023-11-16 04:59:32 +00:00
8241fe6edb [quant][pt2][be] Rewrite QAT annotations using subgraph matcher (#113709)
Summary: This is the recommended way to write quantizers according
to https://pytorch.org/tutorials/prototype/pt2e_quantizer.html#a-note-on-ir-for-pt2e-quantization-flow.
It is agnostic to changes in the aten IR and can be easily extended
to support conv1d-bn and conv3d-bn fusion patterns in the future.
This is the first step towards rewriting XNNPACKQuantizer using
this subgraph matcher.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar

Differential Revision: [D51366525](https://our.internmc.facebook.com/intern/diff/D51366525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113709
Approved by: https://github.com/jerryzh168
2023-11-16 03:57:37 +00:00
8efa6ad1fc [vision hash update] update the pinned vision hash (#113821)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113821
Approved by: https://github.com/pytorchbot
2023-11-16 03:36:29 +00:00
48800e9bb0 [ONNX][dynamo_export] Add 'aten::rsub' type promotion (#113697)
The logic is the same as 'aten::sub'. Needed by llama2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113697
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
ghstack dependencies: #113404
2023-11-16 03:31:07 +00:00
670311190d [HigherOrderOp] Move _map.py to _higher_order_ops (#111152)
Differential Revision: [D50332159](https://our.internmc.facebook.com/intern/diff/D50332159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111152
Approved by: https://github.com/zou3519
2023-11-16 03:04:12 +00:00
1364f84b42 [easy] encapsulate fb changes from OSS (#113677)
Summary:
encapsulate fb changes into `torch._inductor.fx_passes.fb`, so that adding new passes (`fb.xxx`) won't need to touch OSS code like so:

```
# in torch/_inductor/fx_passes/pre_grad.py
if config.is_fbcode():
 from .fb import xxx  # every new fb/xxx.py would have needed this change in OSS code base
```

Test Plan: CI

Differential Revision: D51315193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113677
Approved by: https://github.com/khabinov, https://github.com/chenyang78
2023-11-16 03:03:57 +00:00
cebad9867b graph break on intermediate leaves that require grad (#113277)
fixes https://github.com/pytorch/pytorch/issues/90552. This is a simpler fix that just detects the situation where AOTAutograd can't create a proper backward graph for the situation and graph breaks. This was technically a silent correctness issue before.

This PR tries to always graph break when we see a factory function that returns a tensor requiring grad. I check this by seeing if the op returned a `TensorVariable` in dynamo, and if one of the input arguments was a `requires_grad=True` kwarg. I think this is high-fidelity enough, and I'm also hoping that this is uncommon enough that a graph break is reasonable here.

The fix to avoid the graph break in user land is also pretty easy - just instantiate your tensor outside of the compiled region and plumb it in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113277
Approved by: https://github.com/eellison
ghstack dependencies: #113267, #113416, #113584
2023-11-16 02:47:45 +00:00
c5f26a409a Build and test ExecuTorch on PyTorch (#113364)
This is the first part to start build and test ExecuTorch on PyTorch using a pinned commit.  There will be another PR later to update the pinned commit periodically.

* The pinned commit is in `.ci/docker/ci_commit_pins/executorch.txt` as part of PT Docker image
* I add one simple test `source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''`.  More could be added later, in fact, any ET tests on Linux could be run here
* Building and installation vision and audio need to be done in CI after building PyTorch because they will be broken otherwise

Next steps, in sequence:

* [ ] Update this pinned commit periodically, similar to https://github.com/pytorch/pytorch/pull/113499
* [ ] Increase ET coverage on PT CI, ideally, we should run all ET pull jobs?
* [ ] Switch ExecuTorch's torch, vision, and audio nightly pins to commit pins
* [ ] Update ExecuTorch's torch, vision, and audio commit pins periodically
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113364
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/guangy10
2023-11-16 02:19:58 +00:00
c41a32a3bf Move test_utils.py back to MYPY (#113745)
Since MYPYNOFOLLOW is about to turn on import following, there's no
reason to keep test_utils.py in the MYPYNOFOLLOW config. Moreover, I'm
not sure it still takes 10 minutes to typecheck this file; adding it to
the MYPY config takes `lintrunner --take MYPY --all-files` from 53s to
57s on my machine, which is substantial but not horrible. I guess we'll
see how it fares on CI.

(Note that we cannot simply merge MYPY and MYPYNOFOLLOW because the
latter config turns on `disallow_any_generics` and so is in that sense
stricter than the MYPY config.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113745
Approved by: https://github.com/clee2000
2023-11-16 01:57:58 +00:00
a3b859fc67 Drop dynamo-specific type hints on Tensor in favor of type-ignores (#113720)
Per [this][1] discussion, plus some offline discussion. The summary:
@albanD considers the core PyTorch types like Tensor to be extremely
brittle, and does not think the risk of adding these typed attributes to
be worth it.

@eellison mentioned that we could use `WeakTensorKeyDictionary` instead.
However, based on the sparse usage of these bonus attributes, I think
that would be overkill. So I've opted to go with a few more type-ignore
comments instead.

[1]: https://github.com/pytorch/pytorch/pull/113610#discussion_r1392907367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113720
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/eellison
ghstack dependencies: #113534, #113610
2023-11-16 01:54:00 +00:00
605d274300 [dynamo] Make {mutation_guard,symbolic_convert,side_effects}.py pass follow_imports typechecking (#113610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113610
Approved by: https://github.com/ezyang
ghstack dependencies: #113534
2023-11-16 01:54:00 +00:00
df9acc61fb [inductor] Make {freezing,ir}.py pass follow-imports typechecking (#113534)
I used a couple of type-ignore comments in ir.py because it constructs
short-lived instances of FixedLayout and GraphModuleSerializer, just to
call a single method on them that doesn't use all their members. Making
those unused members optional would make the rest of the code a lot
messier with sprinkled `assert` statements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113534
Approved by: https://github.com/albanD
2023-11-16 01:53:52 +00:00
d52b9ba6a8 [torch.compile + selective checkpoint] Attach context_fn to the checkpointed graph module, fixing flaky tests (#112672)
torch.compile + SAC unit test is causing adjacent unit tests to be flaky due to its modification of shared singleton object. This PR attaches the checkpoint context fn to the checkpointed GraphModule, and look it up during execution, avoiding the need to make the higher-order op stateful.

Specifically, we attach the `context_fn` to the checkpointed GraphModule. These two will be gc'ed at the same time, so it satisfies the lifetime requirement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112672
Approved by: https://github.com/wanchaol
2023-11-16 01:34:52 +00:00
b526aae95a test_lazy: skip HashTest.Scalar (#112747)
Fixes #99883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112747
Approved by: https://github.com/huydhn
2023-11-16 01:22:58 +00:00
72ce5dd13e [2D] Remove enable_2d_with_fsdp() API and make remove_enable_2d_with_fsdp private (#112473)
As we have our new 2D flow out, we want to remove `enable_2d_with_fsdp()`.
In addition, we change pre_dp_module_transform to private, as we may need to change the UX later on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112473
Approved by: https://github.com/fegin, https://github.com/wanchaol
2023-11-16 01:14:00 +00:00
c2c22dc427 [BE] Some debug logging for track_symint in produce_guards (#113774)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113774
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
2023-11-16 01:02:43 +00:00
bd6b3c4df4 [BE][profiler] add test for EventList (#113764)
EventList isn't really tested in CI because it seems to only really be used when kineto is not available.

Add a basic sanity test that would have caught #113756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113764
Approved by: https://github.com/malfet
2023-11-16 00:49:29 +00:00
f8eb46d623 index put device error checking (#113729)
Fix for https://github.com/pytorch/pytorch/issues/101371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113729
Approved by: https://github.com/bdhirsh
2023-11-16 00:39:04 +00:00
1e260c851b [ez] Don't retry onnx in shell (#113803)
is this important? not really, but the retries given by run_test.py on it's own should be enough
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113803
Approved by: https://github.com/BowenBao
2023-11-15 23:45:25 +00:00
5d170fce29 Revert "Support tensors as Dict keys (#111196)"
This reverts commit b0805fa5d0f73f3419129b1606a3e9a58eed2768.

Reverted https://github.com/pytorch/pytorch/pull/111196 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing internally. I will provide the details there ([comment](https://github.com/pytorch/pytorch/pull/111196#issuecomment-1813410149))
2023-11-15 23:08:00 +00:00
463489ec95 [ez] Add some more pyre related files to gitignore (#113796)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113796
Approved by: https://github.com/huydhn
2023-11-15 23:07:39 +00:00
7137f5f8c3 Revert "[easy]Remove specialized value (#112252)"
This reverts commit 149b9dfd04ba7dee88168758bf7a5c603dd79d72.

Reverted https://github.com/pytorch/pytorch/pull/112252 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but https://github.com/pytorch/pytorch/pull/111196 is failing internally. I will provide the details there ([comment](https://github.com/pytorch/pytorch/pull/112252#issuecomment-1813401896))
2023-11-15 23:02:49 +00:00
c99d88afa4 [AOTI] Remove try_find_schema (#113617)
Differential Revision: [D51350727](https://our.internmc.facebook.com/intern/diff/D51350727)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113617
Approved by: https://github.com/aakhundov, https://github.com/chenyang78, https://github.com/khabinov
2023-11-15 22:42:47 +00:00
b19cf868e8 Back out "Support fp8 in AOTInductor + support optional<> in C ABI (#112527)" (#113747)
Test Plan: sandcastle

Differential Revision: D51330618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113747
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2023-11-15 22:42:22 +00:00
094beca0c6 [export] make aot_export_module uses dynamo's fake_mode (#113681)
Fixes #110100 by making aot_export_modules uses dynamo.export's fake_mode in export.

Test Plan:
Add new tests. One of the test places the fake tensor on cuda devices manually and we are able to export the program and preserve the device information in the final produced graph module even on a machine that installs a cpu version of pytorch. One workaround we need to do is to set all tensor's requires_grad to false as fake tensor with cuda devices doesn't compose well with aot_autograd right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113681
Approved by: https://github.com/SherlockNoMad
2023-11-15 22:34:00 +00:00
6435fc17bb Remove ignore_sublcass from FakeTensorMode (#113795)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113795
Approved by: https://github.com/ezyang
2023-11-15 22:30:13 +00:00
97a62c715d [BE] Remove duplicate storage_offset equality test (#113790)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113790
Approved by: https://github.com/albanD
2023-11-15 22:25:07 +00:00
9b736c707c [Codemod][python/main_function] caffe2: (#113357)
Differential Revision: D51149464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113357
Approved by: https://github.com/huydhn
2023-11-15 22:17:31 +00:00
87aeb248c9 More random stepcurrent (#113620)
Distributed tests for different backends have the same name, so they end up clashing using the current stepcurrent key, so tests were not being run.

Disabled the following tests because they are failing:
test_ddp_has_finalized

test_broadcast_object_list
<details>

```

2023-11-14T06:44:01.0428686Z
2023-11-14T06:44:01.0430447Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_broadcast_object_list <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init
2023-11-14T06:44:01.0431048Z [1699943450.893723] [99f90b6e6ff3:10028:0]     ucc_context.c:402  UCC  ERROR failed to create tl context for cuda
2023-11-14T06:44:01.0431625Z [1699943450.914385] [99f90b6e6ff3:10029:0]     ucc_context.c:402  UCC  ERROR failed to create tl context for cuda
2023-11-14T06:44:01.0432314Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.0433178Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.0434677Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.0435435Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.0436895Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.0437500Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.0438917Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.0439637Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.0441122Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper
2023-11-14T06:44:01.0441873Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0443340Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper
2023-11-14T06:44:01.0444077Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     ret = func(*args, **kwargs)
2023-11-14T06:44:01.0445769Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list
2023-11-14T06:44:01.0446732Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return self._test_broadcast_object_list()
2023-11-14T06:44:01.0448433Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list
2023-11-14T06:44:01.0449187Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     dist.broadcast_object_list(
2023-11-14T06:44:01.0450553Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0451621Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0453161Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list
2023-11-14T06:44:01.0454065Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     broadcast(object_sizes_tensor, src=src, group=group)
2023-11-14T06:44:01.0455441Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0456183Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0457775Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast
2023-11-14T06:44:01.0458649Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     work = default_pg.broadcast([tensor], opts)
2023-11-14T06:44:01.0460923Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 1][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2
2023-11-14T06:44:01.0461471Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0462430Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.0463552Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list
2023-11-14T06:44:01.0464082Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0465136Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.0465945Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]  exiting process 1 with exit code: 10
2023-11-14T06:44:01.0466605Z [1699943451.005633] [99f90b6e6ff3:10029:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.0467303Z [1699943451.005633] [99f90b6e6ff3:10029:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.0467972Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.0468743Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.0470233Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.0471106Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.0472581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.0473162Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.0474581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.0475314Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.0476776Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper
2023-11-14T06:44:01.0477535Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0478993Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper
2023-11-14T06:44:01.0479886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     ret = func(*args, **kwargs)
2023-11-14T06:44:01.0481593Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list
2023-11-14T06:44:01.0482429Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return self._test_broadcast_object_list()
2023-11-14T06:44:01.0484145Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list
2023-11-14T06:44:01.0484886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     dist.broadcast_object_list(
2023-11-14T06:44:01.0486271Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0487018Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0488559Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list
2023-11-14T06:44:01.0489470Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     broadcast(object_sizes_tensor, src=src, group=group)
2023-11-14T06:44:01.0491078Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0491912Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0493369Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast
2023-11-14T06:44:01.0494419Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     work = default_pg.broadcast([tensor], opts)
2023-11-14T06:44:01.0496679Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2
2023-11-14T06:44:01.0497211Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0498198Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.0499291Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list
2023-11-14T06:44:01.0499838Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0500881Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.0501667Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]  exiting process 0 with exit code: 10
2023-11-14T06:44:01.0502343Z [1699943451.002362] [99f90b6e6ff3:10028:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.0503024Z [1699943451.002362] [99f90b6e6ff3:10028:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.0503411Z ('RERUN', {'yellow': True}) [6.1102s] [100%]
```
</details>

test_ddp_sync_bn_training_vs_eval

<details>

```

2023-11-14T06:44:01.1494815Z
2023-11-14T06:44:01.1496630Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_sync_bn_training_vs_eval <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init
2023-11-14T06:44:01.1497290Z [1699943779.976037] [99f90b6e6ff3:10758:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.1498119Z [1699943779.976037] [99f90b6e6ff3:10758:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.1498808Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
2023-11-14T06:44:01.1499465Z [1699943779.970792] [99f90b6e6ff3:10757:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.1500160Z [1699943779.970792] [99f90b6e6ff3:10757:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.1500820Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
2023-11-14T06:44:01.1501556Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:320] Completed Stage: Collection
2023-11-14T06:44:01.1502239Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:320] Completed Stage: Collection
2023-11-14T06:44:01.1502952Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
2023-11-14T06:44:01.1503678Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
2023-11-14T06:44:01.1504350Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.1505119Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.1506729Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.1507492Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.1508992Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.1509578Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.1510994Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.1511725Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.1513193Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T06:44:01.1513962Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.1515697Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1516529Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     self.assertNotEqual([], all_gather_calls)
2023-11-14T06:44:01.1518019Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual
2023-11-14T06:44:01.1518910Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     with self.assertRaises(AssertionError, msg=msg):
2023-11-14T06:44:01.1520177Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__
2023-11-14T06:44:01.1521062Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     self._raiseFailure("{} not raised".format(exc_name))
2023-11-14T06:44:01.1522238Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
2023-11-14T06:44:01.1523099Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     raise self.test_case.failureException(msg)
2023-11-14T06:44:01.1523923Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised
2023-11-14T06:44:01.1524470Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1525481Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.1526632Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1527180Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1528223Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.1529029Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]  exiting process 0 with exit code: 10
2023-11-14T06:44:01.1529786Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.1530576Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.1532383Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.1533127Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.1534608Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.1535194Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.1536817Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.1537575Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.1539036Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T06:44:01.1539800Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.1541531Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1542388Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     self.assertNotEqual([], all_gather_calls)
2023-11-14T06:44:01.1544015Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual
2023-11-14T06:44:01.1544907Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     with self.assertRaises(AssertionError, msg=msg):
2023-11-14T06:44:01.1546061Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__
2023-11-14T06:44:01.1546944Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     self._raiseFailure("{} not raised".format(exc_name))
2023-11-14T06:44:01.1548142Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
2023-11-14T06:44:01.1548991Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     raise self.test_case.failureException(msg)
2023-11-14T06:44:01.1549806Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised
2023-11-14T06:44:01.1550350Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1551304Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.1552462Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1553095Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1554166Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.1554976Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]  exiting process 1 with exit code: 10
2023-11-14T06:44:01.1555235Z ('RERUN', {'yellow': True}) [6.6107s] [100%]
```
</details>

test_backend_full_group
<details>

```
2023-11-14T22:51:56.4502470Z FAILED [5.2125s] distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_backend_full_group - RuntimeError: Process 0 exited with error code 10 and exception:
2023-11-14T22:51:56.4502665Z Traceback (most recent call last):
2023-11-14T22:51:56.4503603Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T22:51:56.4503796Z     getattr(self, test_name)()
2023-11-14T22:51:56.4504710Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T22:51:56.4504845Z     fn()
2023-11-14T22:51:56.4505737Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T22:51:56.4505896Z     method(*args, **kwargs)
2023-11-14T22:51:56.4506823Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T22:51:56.4506992Z     return func(*args, **kwargs)
2023-11-14T22:51:56.4508285Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group
2023-11-14T22:51:56.4508640Z     self._test_group_override_backend(self._init_full_group_test)
2023-11-14T22:51:56.4509798Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend
2023-11-14T22:51:56.4510104Z     group, group_id, rank = initializer(backend=new_backend)
2023-11-14T22:51:56.4510629Z UnboundLocalError: local variable 'new_backend' referenced before assignment
2023-11-14T22:51:56.4510650Z
2023-11-14T22:51:56.4510987Z To execute this test, run the following from the base repo dir:
2023-11-14T22:51:56.4511525Z      python test/distributed/test_distributed_spawn.py -k test_backend_full_group
2023-11-14T22:51:56.4511545Z
2023-11-14T22:51:56.4511970Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T22:51:56.4511989Z
2023-11-14T22:51:56.4512242Z Process 1 exited with error code 10 and exception:
2023-11-14T22:51:56.4512454Z Traceback (most recent call last):
2023-11-14T22:51:56.4513380Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T22:51:56.4513687Z     getattr(self, test_name)()
2023-11-14T22:51:56.4514612Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T22:51:56.4514746Z     fn()
2023-11-14T22:51:56.4515633Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T22:51:56.4515791Z     method(*args, **kwargs)
2023-11-14T22:51:56.4516708Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T22:51:56.4516895Z     return func(*args, **kwargs)
2023-11-14T22:51:56.4518008Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group
2023-11-14T22:51:56.4518352Z     self._test_group_override_backend(self._init_full_group_test)
2023-11-14T22:51:56.4519509Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend
2023-11-14T22:51:56.4519813Z     group, group_id, rank = initializer(backend=new_backend)
2023-11-14T22:51:56.4520334Z UnboundLocalError: local variable 'new_backend' referenced before assignment
2023-11-14T22:51:56.4520355Z
2023-11-14T22:51:56.4528843Z To execute this test, run the following from the base repo dir:
2023-11-14T22:51:56.4529492Z      python test/distributed/test_distributed_spawn.py -k test_backend_full_group
2023-11-14T22:51:56.4529681Z
2023-11-14T22:51:56.4530122Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T22:51:56.4530423Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
```
</details>

pretty sure the solution for this one is to add ucc in _test_group_override_backend
https://ossci-raw-job-status.s3.amazonaws.com/log/18651430019
https://ossci-raw-job-status.s3.amazonaws.com/log/18651430132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113620
Approved by: https://github.com/huydhn
2023-11-15 21:56:10 +00:00
4534cf102a Revert "[funcol] a few optimizations to funcol (#113324)"
This reverts commit 7117bffff916c44122ae73b5ce32a8411138db96.

Reverted https://github.com/pytorch/pytorch/pull/113324 on behalf of https://github.com/huydhn due to Sorry for reverting your change here, but it is failing internal test ([comment](https://github.com/pytorch/pytorch/pull/113324#issuecomment-1813317913))
2023-11-15 21:53:23 +00:00
dd28006d8d SGR/Assistant: making sure linker drops unnecessary dependencies (#112871)
Summary:
Assistant/SGR is linked in a way that links to all not-reference libraries are dropped: https://www.internalfb.com/code/fbsource/[c74911ac21d6b90d1fbca8f2de08d6269f44e1fc]/xplat/toolchains/android/ndk/ndk_toolchains.bzl?lines=931
However, `caffe2` overrides this setting https://www.internalfb.com/code/fbsource/[2536ee6849b08da1adcd5b9da0e455a4af3a06d1][blame]/xplat/caffe2/c2_defs.bzl?lines=496. That results in the build breaks like discussed here: https://fb.workplace.com/groups/llvm.gcc/permalink/25390586597229949/ : Assistant doesn't use libforce_dlopen but it sill requires it, and that library exist on device.

As we statically link all operators, the `caffe2` override doesn't seem to be necessary.

This diff adds a build parameter affecting `caffe2` linker options.

Test Plan:
Built supernova experimental build, made sure Assistant starts without operator issues.
Tried tts, ocr and asr command in SGR, made sure they work.

Verified that hypernova build doesn't required libforce_dlopen when D50695343 is applied.

Reviewed By: veselinp

Differential Revision: D50870489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112871
Approved by: https://github.com/vybv, https://github.com/PaliC
2023-11-15 21:12:33 +00:00
585e315b3a [quant][pt2e] Refactor insert observer to do sharing checking in the same place (#113458)
Summary:
Previously it is scatter in two different places: before inserting observer and during observer,
this PR moved everything before we insert observer

* Next: refactor QuantizationSpec and check more fields for sharing

Test Plan:
CI (regression tests)

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113458
Approved by: https://github.com/kimishpatel
2023-11-15 21:08:39 +00:00
deec2380c7 Add 0dim Tensor overload for _foreach_div (#113688)
This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors.

Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise.

After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113688
Approved by: https://github.com/mlazos, https://github.com/albanD
2023-11-15 20:59:32 +00:00
2164598c40 Improves comparison of state dicts for Checkpoint E2E Tests (#113181)
Addresses the following comment - https://github.com/pytorch/pytorch/pull/112541#discussion_r1380197424

Changes the comparison of models in the checkpointing E2E test to compare a non-parallelized model against distribued model after training, saving, & loading.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113181
Approved by: https://github.com/fegin, https://github.com/huydhn, https://github.com/wz337
2023-11-15 20:48:45 +00:00
275a4521a9 [ONNX] Fix scalar type promotion between fp16 tensor and fp32 scalar (#113404)
Fixes https://github.com/pytorch/pytorch/issues/104594.

The reason for the exporter behavior in original posted issue is explained as follows:
ONNX model track shape related computes that were done in pytorch by python
numbers as tensor computes. This is the only way for ONNX to track them properly
since ONNX only has tensor type, otherwise the computation result will be tracked
statically as constant, and the model won't work for another input that differs in shape.

Now for type promotion logic, scalars should be treated differently with tensors.
Exporter mistook the shape related scalars as tensors in this case and incorrectly promoted.

This PR fixes the behavior and relaxes the criteria of scalar recognition. For floating point,
previously only a value from model initializer that has dtype torch.double and rank 0 is
treated as scalar. Now it is relaxed to any intermediate value, as well as for dtype torch.float.
Previous assumption was that python number is traced as torch.double dtype, which also
appears to be invalid anymore.

NOTE that this might introduce regression that a REAL 0-rank tensor is now being recognized as
scalar. The downside is the model will drop in accuracy for these cases as certain computations
will happen in lower precision data types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113404
Approved by: https://github.com/justinchuby
2023-11-15 20:32:55 +00:00
12b2dd16b0 [Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)
Summary:
We are planning to lazily initialize CUPTI when profiling is actually performed. Therefore, we need to remove profiler init dependency on CUPTI Callbacks' RESOURCE_CONTEXT_CREATED.

Instead, we can initialize the profilers during torch profiler pybind, ie. THPAutograd_initExtension() and lazily in profilerStep().

Test Plan:
CI and ran internally, see internal diff logs.

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112623
Approved by: https://github.com/albanD
2023-11-15 20:26:13 +00:00
cc11c0d11b aot_autograd: keep input mutations on requires_grad=True tensor out of the graph for inference (#113584)
The original behavior of torch.compile w.r.t. input mutations maintains that if an input to a graph was mutated, **and** requires grad, we will keep the input mutation outside of the graph and replay it at runtime.

This is important because, e.g., an input can have outstanding aliases, and mutating the input in eager mode will cause autograd to change the `grad_fn` of all outstanding aliases.

It looks like landing https://github.com/pytorch/pytorch/pull/111347 changed this behavior slightly:
* The linked PR makes it possible for AOTAutograd to go down the inference code path, even if some inputs require grad (because all of the outputs of the graph were seen to not require grad)
* AOTAutograd's logic in the inference code path today is to **always** keep input mutations in the graph.

This PR fixes that regression: regardless of inference vs. training, we should always keep input mutations outside of the graph if the input requires_grad.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113584
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #113267, #113416
2023-11-15 19:55:47 +00:00
032e5a4528 handle cross-dtype views during AOTAutograd view-replay (#113416)
Fixes https://github.com/pytorch/pytorch/issues/109053

I think "partitioning views out of the graph" will be a more robust fix for the class of errors that we've seen around incorrectly regenerating views at runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113416
Approved by: https://github.com/ezyang
ghstack dependencies: #113267
2023-11-15 19:55:47 +00:00
720e866d18 graph break on out= ops with noncontiguous out args (#113267)
Fixes https://github.com/pytorch/pytorch/issues/113010

In eager mode, when you call an out= op like `add(..., out=out_arg)` with an out argument that is noncontiguous, the noncontiguous out arg will be returned directly. When we functionalize though, functionalization replaces it with a call to `add(...)` which ignores the contiguity of the original out arg.

Instead of trying to support this, this PR detects that situation and graph breaks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113267
Approved by: https://github.com/albanD
2023-11-15 19:55:47 +00:00
05d949279c [C10] cpuinfo error handling (#113771)
If `cpuinfo_initalize` returns false, call to subsequent cpuinfo functions may result in `abort()`
Also, `defaultNumThreads()` method is assumption if one method fails then try another, and finally return 1.

Alas there are no good way to test it on x86 platform, but on ARM one can replicate it by running `sudo chmod 750 /sys` and then `python3 -c "import torch;torch._C.profiler.gather_traceback(True, True, True)"`

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 4d942e8</samp>

> _`cpuinfo` fails_
> _avoid undefined behavior_
> _check before you count_

Partially addresses https://github.com/pytorch/pytorch/issues/113568
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113771
Approved by: https://github.com/atalman
2023-11-15 19:49:34 +00:00
c1315ae2b9 Only check significant strides in test torchinductor (#113389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113389
Approved by: https://github.com/int3
2023-11-15 19:47:55 +00:00
42b2b9e663 fixed pyi file for ReduceLROnPlateau (#113659)
Fixes #63143

Issue reappeared due to subclassing not present in stub file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113659
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kit1980
2023-11-15 19:33:36 +00:00
b3423889fe [inductor][fx pass] handle numpy compatibility arg names (#113078)
Fixes #113038

the "dim" kwarg can also be referred to with "axis" - handle this case.

21b6030ac3/torch/csrc/utils/python_arg_parser.cpp (L72-L77)

previously, if the "axis" kwarg was used, it would not be matched and "dim" would default to 0.

75adb9f371/torch/_inductor/fx_passes/split_cat.py (L172-L176)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113078
Approved by: https://github.com/eellison
2023-11-15 19:27:24 +00:00
ca9e654353 [FSDP] Fix FSDP submodule with DeviceMesh does not return DTensor state_dict error (#113593)
For scenarios where FSDP is not the root module, the `_use_dtensor` flag would not be switched on. This PR fixes it by checking whether the submodule has the `device_mesh` and turn `_use_dtensor` flag on accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113593
Approved by: https://github.com/fegin
2023-11-15 19:00:19 +00:00
277474f1a0 Revert "[2d] pass shape/stride during tensor unflatten (#113547)"
This reverts commit 93372455a73043332c16a71cb9dccdf3e0412a57.

Reverted https://github.com/pytorch/pytorch/pull/113547 on behalf of https://github.com/wanchaol due to broken compile test ([comment](https://github.com/pytorch/pytorch/pull/113547#issuecomment-1813048318))
2023-11-15 18:32:54 +00:00
c678c5ef38 [doc] caution torch.multinomial usage (#112892)
Fixes #107406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112892
Approved by: https://github.com/albanD
2023-11-15 18:20:48 +00:00
296c9e3ce7 upgrade lintrunner to the lowest supported versions on python 3.12 (#113562)
As per title, the current versions fail to install on 3.12.

The failures are related to https://github.com/numpy/numpy/issues/25147
They are fixed by adding manual annotations for the code in PyTorch and ignoring them on caffe2 as discussed with @malfet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113562
Approved by: https://github.com/malfet
2023-11-15 18:12:01 +00:00
7f9fafed53 Resolve docstring errors in throughput_benchmark.py, weak.py, _traceback.py, file_baton.py, _contextlib.py, _device.py, cpp_backtrace.py, bundled_inputs.py, run_cpu.py, hooks.py, mobile_optimizer.py, _freeze.py, __init__.py, mkldnn.py, dlpack.py (#113311)
Fixes #112633

Fixed errors relating to pydocstyle in the following files. The remaining errors are not covered in this issue. `torch/utils/dlpack.py` was not modified as the errors are relating to the function signature in the first line in the docstring which must be maintained as is for proper Sphinx interpretation.

```python
def from_dlpack(ext_tensor: Any) -> 'torch.Tensor':
    """from_dlpack(ext_tensor) -> Tensor
         .....
    """
```

pydocstyle torch/utils/_contextlib.py --count
before: 4
after: 0

pydocstyle torch/backends/mps/__init__.py --count
before: 8
after: 1

**remaining errors**
```
torch/backends/mps/__init__.py:1 at module level:
        D104: Missing docstring in public package
```

pydocstyle torch/backends/xeon/run_cpu.py --count
before: 13
after: 1

**remaining errors**
```
torch/backends/xeon/run_cpu.py:864 in public function `main`:
        D103: Missing docstring in public function
```

pydocstyle torch/backends/cpu/__init__.py --count
before: 2
after: 1

**remaining errors**
```
torch/backends/cpu/__init__.py:1 at module level:
        D104: Missing docstring in public package
```

pydocstyle torch/utils/cpp_backtrace.py --count
before: 4
after: 1

**remaining errors**
```
torch/utils/cpp_backtrace.py:1 at module level:
        D100: Missing docstring in public module
```

pydocstyle torch/utils/bundled_inputs.py --count
before: 8
after: 1

**remaining errors**
```
torch/utils/bundled_inputs.py:1 at module level:
        D100: Missing docstring in public module
```

pydocstyle torch/utils/file_baton.py --count
before: 8
after: 1

**remaining errors**
```
torch/utils/file_baton.py:1 at module level:
        D100: Missing docstring in public module
```

pydocstyle torch/utils/mobile_optimizer.py --count
before: 6
after: 1

**remaining errors**
```
torch/utils/mobile_optimizer.py:8 in public class `LintCode`:
        D101: Missing docstring in public class
```

pydocstyle torch/backends/opt_einsum/__init__.py --count
before: 7
after: 5

**remaining errors**
```
torch/backends/opt_einsum/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/backends/opt_einsum/__init__.py:67 in public function `set_flags`:
        D103: Missing docstring in public function
torch/backends/opt_einsum/__init__.py:77 in public function `flags`:
        D103: Missing docstring in public function
torch/backends/opt_einsum/__init__.py:93 in public class `OptEinsumModule`:
        D101: Missing docstring in public class
torch/backends/opt_einsum/__init__.py:94 in public method `__init__`:
        D107: Missing docstring in __init__
```

pydocstyle torch/utils/_device.py --count
before:  9
after: 6

**remaining errors**
```
torch/utils/_device.py:58 in public class `DeviceContext`:
        D101: Missing docstring in public class
torch/utils/_device.py:59 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_device.py:62 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/utils/_device.py:68 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/utils/_device.py:73 in public method `__torch_function__`:
        D105: Missing docstring in magic method
torch/utils/_device.py:80 in public function `device_decorator`:
        D103: Missing docstring in public function

```

pydocstyle torch/utils/_freeze.py --count
before: 15
after: 7

**remaining errors**
```
torch/utils/_freeze.py:77 in public function `indent_msg`:
        D103: Missing docstring in public function
torch/utils/_freeze.py:89 in public class `FrozenModule`:
        D101: Missing docstring in public class
torch/utils/_freeze.py:100 in public class `Freezer`:
        D101: Missing docstring in public class
torch/utils/_freeze.py:101 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_freeze.py:106 in public method `msg`:
        D102: Missing docstring in public method
torch/utils/_freeze.py:185 in public method `get_module_qualname`:
        D102: Missing docstring in public method
torch/utils/_freeze.py:206 in public method `compile_string`:
        D102: Missing docstring in public method

```

pydocstyle torch/utils/throughput_benchmark.py --count
before: 25
after: 8
**remaining errors**
```
torch/utils/throughput_benchmark.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/throughput_benchmark.py:27 in public class `ExecutionStats`:
        D101: Missing docstring in public class
torch/utils/throughput_benchmark.py:28 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/throughput_benchmark.py:33 in public method `latency_avg_ms`:
        D102: Missing docstring in public method
torch/utils/throughput_benchmark.py:37 in public method `num_iters`:
        D102: Missing docstring in public method
torch/utils/throughput_benchmark.py:46 in public method `total_time_seconds`:
        D102: Missing docstring in public method
torch/utils/throughput_benchmark.py:50 in public method `__str__`:
        D105: Missing docstring in magic method
torch/utils/throughput_benchmark.py:94 in public method `__init__`:
        D107: Missing docstring in __init__

```

pydocstyle torch/utils/hooks.py --count

before: 14
after: 11

**remaining errors**
```
torch/utils/hooks.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/hooks.py:23 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/hooks.py:34 in public method `remove`:
        D102: Missing docstring in public method
torch/utils/hooks.py:44 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/hooks.py:50 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/hooks.py:64 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/utils/hooks.py:67 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/utils/hooks.py:82 in public function `warn_if_has_hooks`:
        D103: Missing docstring in public function
torch/utils/hooks.py:103 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/hooks.py:188 in public method `setup_input_hook`:
        D102: Missing docstring in public method
torch/utils/hooks.py:197 in public method `setup_output_hook`:
        D102: Missing docstring in public method
```

pydocstyle torch/utils/_traceback.py --count
before: 19
after: 14

**remaining errors**
```
torch/utils/_traceback.py:47 in public function `report_compile_source_on_error`:
        D103: Missing docstring in public function
torch/utils/_traceback.py:160 in public class `CapturedTraceback`:
        D101: Missing docstring in public class
torch/utils/_traceback.py:163 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_traceback.py:167 in public method `cleanup`:
        D102: Missing docstring in public method
torch/utils/_traceback.py:170 in public method `summary`:
        D102: Missing docstring in public method
torch/utils/_traceback.py:182 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/_traceback.py:190 in public method `extract`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/_traceback.py:190 in public method `extract`:
        D400: First line should end with a period (not 't')
torch/utils/_traceback.py:213 in public method `format`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/_traceback.py:213 in public method `format`:
        D400: First line should end with a period (not 'f')
torch/utils/_traceback.py:213 in public method `format`:
        D401: First line should be in imperative mood (perhaps 'Format', not 'Formats')
torch/utils/_traceback.py:224 in public method `format_all`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/utils/_traceback.py:247 in private function `_extract_symbolized_tb`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/_traceback.py:247 in private function `_extract_symbolized_tb`:
        D400: First line should end with a period (not 'f')
```

pydocstyle torch/utils/mkldnn.py --count
before: 28
after: 26

**remaining errors**
```
torch/utils/mkldnn.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/mkldnn.py:4 in public class `MkldnnLinear`:
        D101: Missing docstring in public class
torch/utils/mkldnn.py:5 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/mkldnn.py:19 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/mkldnn.py:23 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/mkldnn.py:29 in public method `forward`:
        D102: Missing docstring in public method
torch/utils/mkldnn.py:75 in public class `MkldnnConv1d`:
        D101: Missing docstring in public class
torch/utils/mkldnn.py:76 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/mkldnn.py:82 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/mkldnn.py:88 in public class `MkldnnConv2d`:
        D101: Missing docstring in public class
torch/utils/mkldnn.py:89 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/mkldnn.py:100 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/mkldnn.py:110 in public class `MkldnnConv3d`:
        D101: Missing docstring in public class
torch/utils/mkldnn.py:111 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/mkldnn.py:122 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/mkldnn.py:133 in public class `MkldnnBatchNorm`:
        D101: Missing docstring in public class
torch/utils/mkldnn.py:136 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/mkldnn.py:155 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/mkldnn.py:163 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/mkldnn.py:171 in public method `forward`:
        D102: Missing docstring in public method
torch/utils/mkldnn.py:184 in public class `MkldnnPrelu`:
        D101: Missing docstring in public class
torch/utils/mkldnn.py:185 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/mkldnn.py:190 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/mkldnn.py:194 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/mkldnn.py:199 in public method `forward`:
        D102: Missing docstring in public method
torch/utils/mkldnn.py:205 in public function `to_mkldnn`:
        D103: Missing docstring in public function
```

pydocstyle torch/utils/weak.py --count
before: 32
after: 30

**remaining errors**
```
torch/utils/weak.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/weak.py:42 in public class `WeakIdRef`:
        D101: Missing docstring in public class
torch/utils/weak.py:45 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/weak.py:54 in public method `__call__`:
        D102: Missing docstring in public method
torch/utils/weak.py:61 in public method `__hash__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:64 in public method `__eq__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:84 in public class `WeakIdKeyDictionary`:
        D101: Missing docstring in public class
torch/utils/weak.py:87 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/weak.py:131 in public method `__delitem__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:135 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:138 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:145 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:148 in public method `__setitem__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:151 in public method `copy`:
        D102: Missing docstring in public method
torch/utils/weak.py:162 in public method `__deepcopy__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:172 in public method `get`:
        D102: Missing docstring in public method
torch/utils/weak.py:175 in public method `__contains__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:182 in public method `items`:
        D102: Missing docstring in public method
torch/utils/weak.py:189 in public method `keys`:
        D102: Missing docstring in public method
torch/utils/weak.py:198 in public method `values`:
        D102: Missing docstring in public method
torch/utils/weak.py:216 in public method `popitem`:
        D102: Missing docstring in public method
torch/utils/weak.py:224 in public method `pop`:
        D102: Missing docstring in public method
torch/utils/weak.py:228 in public method `setdefault`:
        D102: Missing docstring in public method
torch/utils/weak.py:231 in public method `update`:
        D102: Missing docstring in public method
torch/utils/weak.py:241 in public method `__ior__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:245 in public method `__or__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:252 in public method `__ror__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:262 in public method `__eq__`:
        D105: Missing docstring in magic method
torch/utils/weak.py:276 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/weak.py:280 in public method `__call__`:
        D102: Missing docstring in public method

```

@mikaylagawarecki @jbschlosser @svekars
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113311
Approved by: https://github.com/ezyang
2023-11-15 17:40:04 +00:00
e100ff42fd Fix chrome trace entry format (#113763)
Fix regression introduced by https://github.com/pytorch/pytorch/pull/107519

`'"args": {{}}}}, '` was part of format string, when curly braces a duplicated to get them printed single time, but ruff change left the string format as is

Fixes https://github.com/pytorch/pytorch/issues/113756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113763
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2023-11-15 17:07:40 +00:00
dedb47d94c Revert "Fix resize matrix_power.out dynamic shapes (#113695)"
This reverts commit c3918c18b5f7b98ef83f9022062a9f1990e3324d.

Reverted https://github.com/pytorch/pytorch/pull/113695 on behalf of https://github.com/ezyang due to sorry about that ([comment](https://github.com/pytorch/pytorch/pull/113695#issuecomment-1812705370))
2023-11-15 15:06:08 +00:00
6c187246d6 Add support for float8_e4m3fnuz and _e5m2fnuz (#107586)
This PR relates to the feature in [this feature submission](https://docs.google.com/document/d/1pF2T1xz54IPg1jG7FhykbrpbcJZVelQw0v8vBaoLkfs/edit). It has been based on #104242 which adds similar float8 types.

These new types added in this PR are described in the paper at https://arxiv.org/abs/2206.02915. A brief description and comparison of the types with other float8 types can be also found in the [OpenXLA RFC](https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107586
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-11-15 15:01:11 +00:00
c3918c18b5 Fix resize matrix_power.out dynamic shapes (#113695)
Fixes https://github.com/pytorch/pytorch/issues/113003

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113695
Approved by: https://github.com/bdhirsh, https://github.com/lezcano
2023-11-15 13:35:54 +00:00
9146ca6a07 use sourceless builder for builtin getattr (#113340)
In TorchVision we use the following (simplified) dispatch mechanism:

```python
import torch

def kernel1(tensor):
    return tensor + 2

def dispatcher1(input):
    kernel = get_kernel(dispatcher1, type(input))
    return kernel(input)

def kernel2(tensor):
    return tensor - 2

def dispatcher2(input):
    kernel = get_kernel(dispatcher2, type(input))
    return kernel(input)

# We actually use the function and type as keys, rather than their names.
# However, this currently not supported, but should be easy to add after
# https://github.com/pytorch/pytorch/pull/111196
REGISTRY = {
    "dispatcher1": {"Tensor": kernel1},
    "dispatcher2": {"Tensor": kernel2},
}

def get_kernel(dispatcher, input_type):
    dispatcher_registry = REGISTRY[dispatcher.__name__]
    for cls in input_type.__mro__:
        kernel = dispatcher_registry[cls.__name__]
        break
    return kernel
```

This can be compiled without graph breaks:

```python
cfn = torch.compile(dispatcher1, fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 5)

cfn = torch.compile(dispatcher2, fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 1)
```

However, if we start chaining these calls, we hit some issues:

```python
class Pipeline(torch.nn.Module):
    def forward(self, input):
        input = dispatcher1(input)
        input = dispatcher2(input)
        return input

cfn = torch.compile(Pipeline(), fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 3)
```

```
Can't access members of type(obj) for a generated custom object. Please use __class__ instead
```

The error message is not really helpful here. The following happens: when compiling `dispatcher1`, `get_kernel` gets inlined. That means when hitting `dispatcher2`, the `type` call no longer happens on an input with a source. Thus, in the first iteration we hit the top branch, while in the second we hit the bottom:

addb8e29cd/torch/_dynamo/variables/builtin.py (L1264-L1268)

And the error message I posted above originates from the type being treated as constant. This PR replaces this with a `SourcelessBuilder` instead.

With that fix in place, we hit another pointing to `input_type.__mro__`

```
AssertionError: Consider SourcelessBuilder for ephemeral objects, usually objects created locally.
```

Fix is similar: instead of using a `VariableBuilder` here, we use a `SourcelessBuilder` in case we have no `source`:

addb8e29cd/torch/_dynamo/variables/builtin.py (L1167-L1168)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113340
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-11-15 13:01:20 +00:00
50101d59ba [export][retry] Move lifted tensors out of state_dict (#113689)
Test Plan: CI

Differential Revision: D51321532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113689
Approved by: https://github.com/zhxchen17
2023-11-15 09:24:49 +00:00
17e2313dd3 Add an API to DDP for dynamically updating the underlying process group. (#113580)
# Motivation

If we would like to reinitialize DDP with a different PG with `torch.compile`, we need to do the following:

```
del old_ddp
del old_pg
pg = init_pg(...)
ddp = DDP(pg)
model = torch.compile(DDP)
```

This results in recompilation of the entire model and is very expensive. Since the only thing we need to update is the PG, we should be able to do this without having to compile the model again.

# Proposal

As a result, in this PR I've introduced an `_update_process_group` API which can dynamically update the underlying ProcessGroup used by DDP without needing to reinitialize DDP again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113580
Approved by: https://github.com/fduwjj
2023-11-15 09:05:02 +00:00
7f1eda8c29 Minor: fix a typo (#113648)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113648
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-11-15 08:42:58 +00:00
757f36b988 [docs] Fix torch.compile "tensorrt" backend docs (#113711)
- Update description from ONNX to current state (Torch-TensorRT)
- Add clarification about import

Fixes documentation on this page: https://pytorch.org/docs/stable/torch.compiler.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113711
Approved by: https://github.com/msaroufim
2023-11-15 08:42:53 +00:00
9b0f2f8d94 expose sdpa helpers to python (#110496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110496
Approved by: https://github.com/jbschlosser
2023-11-15 07:34:34 +00:00
78f3937ee8 [BE] Handle errors in set_num_threads (#113684)
and `set_num_interop_threads`

Before that, call `torch.set_num_threads(2**65)` resulted in segmentation fault, afterwards it becomes a good old runtime error:
```
% python -c "import torch;torch.set_num_threads(2**65)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: Overflow when unpacking long
```

Similar to https://github.com/pytorch/pytorch/pull/60073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113684
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-11-15 06:17:41 +00:00
1a8d076e0c [inductor cpp] simplify test for uint8 add/sub (#113407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113407
Approved by: https://github.com/lezcano
ghstack dependencies: #113261
2023-11-15 06:17:25 +00:00
dadca7aeec remove \ in cache_dir (#110945)
Fixes #110933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110945
Approved by: https://github.com/masnesral, https://github.com/shunting314
2023-11-15 06:01:08 +00:00
fda94124d7 [inductor] Make {cudagraph_trees,decomposition,post_grad}.py pass follow_imports typechecking (#113609)
I added explicit imports to `kernel/__init__.py` as mypy doesn't seem to
understand an empty `__init__.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113609
Approved by: https://github.com/eellison
2023-11-15 05:04:11 +00:00
6f4409073f [doc] two diff meanings of rv generated by torch.tensor.geometric_ and torch.distributions.geometric.Geometric (#113183)
The meaning of random variables generated by `torch.tensor.geometric_` and `torch.distributions.geometric.Geometric` are different, and they are defined by two different PMFs.
Inform the user, so the user can choose their desired one.

Background: https://github.com/pytorch/pytorch/pull/37984#issuecomment-630336511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113183
Approved by: https://github.com/albanD
2023-11-15 03:49:04 +00:00
fcdfcdeef9 [inductor cpp] fix non-contiguous reduction store (#113261)
Fix https://github.com/pytorch/pytorch/issues/113018

The reduction store in this case works on non-contiguous buffer. Previously, we only do scalar fallback for normal stores but not reduction stores. This PR fixes this.

Before fix
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0)));
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                        }
                        tmp_acc0_vec.store(out_ptr1 + static_cast<long>(x0 + (39L*x1))); // this is wrong since x0 is not vector dim
                    }
                }
                #pragma omp simd simdlen(8)
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))];
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
                        }
                        out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0;
                    }
                }
            }
```

After fix
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0)));
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                        }
                        { __at_align__ float tmpbuf[16*sizeof(float)/sizeof(float)]; tmp_acc0_vec.store(tmpbuf); for (long x1_inner = 0; x1_inner < 16; x1_inner++) out_ptr1[static_cast<long>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; }
                    }
                }
                #pragma omp simd simdlen(8)
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))];
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
                        }
                        out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0;
                    }
                }
            }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113261
Approved by: https://github.com/lezcano
2023-11-15 03:27:17 +00:00
f9ea697112 [quant][pt2][be] Refactor QAT tests for future patterns (#113658)
Summary: Currently the QAT tests are very specific to conv-bn-2d.
This makes it difficult to test new patterns like conv-bn-1d if
we want to add them. This commit refactors these tests so we can
add and test future patterns easily.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113658
Approved by: https://github.com/jerryzh168
2023-11-15 02:17:13 +00:00
77f66ade66 Revert "use sourceless builder for builtin getattr (#113340)"
This reverts commit d64bc8f0f81bd9b514eb1a5ee6f5b03094e4e6e9.

Reverted https://github.com/pytorch/pytorch/pull/113340 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/113340#issuecomment-1811684167))
2023-11-15 02:06:00 +00:00
84ee7453ad ci: Add clickable PR link to trymerge (#113712)
Adds a link to trymerge so that you can quickly click through the job to
the pull request for debugging.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113712
Approved by: https://github.com/clee2000, https://github.com/malfet
2023-11-15 01:55:33 +00:00
92e3f45f0e Revert "[dynamo] Refactor test cross importing (#113242)"
This reverts commit 4309d38f5d33530cbd875bded551e3fc08286c5d.

Reverted https://github.com/pytorch/pytorch/pull/113242 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113242#issuecomment-1811674395))
2023-11-15 01:53:07 +00:00
6bffde99b0 Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)"
This reverts commit 66d09f82170c528698b5ec606ba7838268ae1f8a.

Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1811666004))
2023-11-15 01:44:26 +00:00
45671be2a0 Revert "Only check significant strides in test torchinductor (#113389)"
This reverts commit 28228e1517738f66f11ba278ed8e821c36dcff63.

Reverted https://github.com/pytorch/pytorch/pull/113389 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is conflicting with this revert https://github.com/pytorch/pytorch/pull/113275#issuecomment-1811651388, so I need to revert this to clean thing up ([comment](https://github.com/pytorch/pytorch/pull/113389#issuecomment-1811663791))
2023-11-15 01:41:16 +00:00
6a25bb8545 [inductor] use fusion_log for verbose logs (#113701)
Fixes https://github.com/pytorch/pytorch/issues/113696

Previous logs hygeine not respected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113701
Approved by: https://github.com/ezyang
2023-11-15 01:39:03 +00:00
1e60174891 Revert "[dynamo] Add run_inductor_tests entrypoint (#113278)"
This reverts commit b00311ce9e430cf1b98d2103e21ed2179450a424.

Reverted https://github.com/pytorch/pytorch/pull/113278 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113278#issuecomment-1811646325))
2023-11-15 01:19:48 +00:00
9724d0fd87 docstyle _correct_bias.py _equalize.py _learnable_fake_quantize.py backend_config experimental fake_quantize.py fuse_modules.py fuser_method_mappings.py (#112992)
Fixes #112988

For files

__init__.py
_correct_bias.py
_equalize.py
_learnable_fake_quantize.py
backend_config
experimental
fake_quantize.py
fuse_modules.py
fuser_method_mappings.py

Correct the following

__init__.py:1 at module level:
        D104: Missing docstring in public package
__init__.py:144 in public function `default_eval_fn`:
        D205: 1 blank line required between summary line and description (found 0)
__init__.py:144 in public function `default_eval_fn`:
        D400: First line should end with a period (not 'f')
__init__.py:144 in public function `default_eval_fn`:
        D401: First line should be in imperative mood; try rephrasing (found 'Default')
__init__.py:152 in private class `_DerivedObserverOrFakeQuantize`:
        D204: 1 blank line required after class docstring (found 0)
__init__.py:152 in private class `_DerivedObserverOrFakeQuantize`:
        D205: 1 blank line required between summary line and description (found 0)
__init__.py:152 in private class `_DerivedObserverOrFakeQuantize`:
        D210: No whitespaces allowed surrounding docstring text
__init__.py:152 in private class `_DerivedObserverOrFakeQuantize`:
        D400: First line should end with a period (not 's')
_correct_bias.py:20 in public function `get_module`:
        D200: One-line docstring should fit on one line with quotes (found 2)
_correct_bias.py:20 in public function `get_module`:
        D210: No whitespaces allowed surrounding docstring text
_correct_bias.py:20 in public function `get_module`:
        D300: Use """triple double quotes""" (found '''-quotes)
_correct_bias.py:20 in public function `get_module`:
        D400: First line should end with a period (not 'l')
_correct_bias.py:25 in public function `parent_child_names`:
        D200: One-line docstring should fit on one line with quotes (found 2)
_correct_bias.py:25 in public function `parent_child_names`:
        D300: Use """triple double quotes""" (found '''-quotes)
_correct_bias.py:25 in public function `parent_child_names`:
        D400: First line should end with a period (not 'e')
_correct_bias.py:25 in public function `parent_child_names`:
        D401: First line should be in imperative mood (perhaps 'Split', not 'Splits')
_correct_bias.py:34 in public function `get_param`:
        D205: 1 blank line required between summary line and description (found 0)
_correct_bias.py:34 in public function `get_param`:
        D210: No whitespaces allowed surrounding docstring text
_correct_bias.py:34 in public function `get_param`:
        D300: Use """triple double quotes""" (found '''-quotes)
_correct_bias.py:34 in public function `get_param`:
        D400: First line should end with a period (not 's')
_correct_bias.py:44 in public class `MeanShadowLogger`:
        D204: 1 blank line required after class docstring (found 0)
_correct_bias.py:44 in public class `MeanShadowLogger`:
        D205: 1 blank line required between summary line and description (found 0)
_correct_bias.py:44 in public class `MeanShadowLogger`:
        D400: First line should end with a period (not 'n')
_correct_bias.py:47 in public method `__init__`:
        D107: Missing docstring in __init__
_correct_bias.py:56 in public method `forward`:
        D205: 1 blank line required between summary line and description (found 0)
_correct_bias.py:56 in public method `forward`:
        D210: No whitespaces allowed surrounding docstring text
_correct_bias.py:56 in public method `forward`:
        D300: Use """triple double quotes""" (found '''-quotes)
_correct_bias.py:56 in public method `forward`:
        D401: First line should be in imperative mood; try rephrasing (found 'The')
_correct_bias.py:77 in public method `clear`:
        D102: Missing docstring in public method
_correct_bias.py:85 in public function `bias_correction`:
        D205: 1 blank line required between summary line and description (found 0)
_correct_bias.py:85 in public function `bias_correction`:
        D210: No whitespaces allowed surrounding docstring text
_correct_bias.py:85 in public function `bias_correction`:
        D300: Use """triple double quotes""" (found '''-quotes)
_correct_bias.py:85 in public function `bias_correction`:
        D400: First line should end with a period (not 's')
_correct_bias.py:85 in public function `bias_correction`:
        D401: First line should be in imperative mood (perhaps 'Use', not 'Using')
_equalize.py:22 in public function `set_module_weight`:
        D103: Missing docstring in public function
_equalize.py:28 in public function `set_module_bias`:
        D103: Missing docstring in public function
_equalize.py:34 in public function `get_module_weight`:
        D103: Missing docstring in public function
_equalize.py:40 in public function `get_module_bias`:
        D103: Missing docstring in public function
_equalize.py:47 in public function `max_over_ndim`:
        D200: One-line docstring should fit on one line with quotes (found 2)
_equalize.py:47 in public function `max_over_ndim`:
        D210: No whitespaces allowed surrounding docstring text
_equalize.py:47 in public function `max_over_ndim`:
        D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:47 in public function `max_over_ndim`:
        D400: First line should end with a period (not 's')
_equalize.py:47 in public function `max_over_ndim`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
_equalize.py:55 in public function `min_over_ndim`:
        D200: One-line docstring should fit on one line with quotes (found 2)
_equalize.py:55 in public function `min_over_ndim`:
        D210: No whitespaces allowed surrounding docstring text
_equalize.py:55 in public function `min_over_ndim`:
        D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:55 in public function `min_over_ndim`:
        D400: First line should end with a period (not 's')
_equalize.py:55 in public function `min_over_ndim`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
_equalize.py:63 in public function `channel_range`:
        D200: One-line docstring should fit on one line with quotes (found 2)
_equalize.py:63 in public function `channel_range`:
        D210: No whitespaces allowed surrounding docstring text
_equalize.py:63 in public function `channel_range`:
        D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:63 in public function `channel_range`:
        D400: First line should end with a period (not 'l')
_equalize.py:63 in public function `channel_range`:
        D401: First line should be in imperative mood (perhaps 'Find', not 'finds')
_equalize.py:63 in public function `channel_range`:
        D403: First word of the first line should be properly capitalized ('Finds', not 'finds')
_equalize.py:76 in public function `cross_layer_equalization`:
        D205: 1 blank line required between summary line and description (found 0)
_equalize.py:76 in public function `cross_layer_equalization`:
        D210: No whitespaces allowed surrounding docstring text
_equalize.py:76 in public function `cross_layer_equalization`:
        D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:76 in public function `cross_layer_equalization`:
        D400: First line should end with a period (not 't')
_equalize.py:120 in public function `equalize`:
        D205: 1 blank line required between summary line and description (found 0)
_equalize.py:120 in public function `equalize`:
        D210: No whitespaces allowed surrounding docstring text
_equalize.py:120 in public function `equalize`:
        D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:120 in public function `equalize`:
        D400: First line should end with a period (not 'l')
_equalize.py:159 in public function `converged`:
        D205: 1 blank line required between summary line and description (found 0)
_equalize.py:159 in public function `converged`:
        D210: No whitespaces allowed surrounding docstring text
_equalize.py:159 in public function `converged`:
        D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:159 in public function `converged`:
        D400: First line should end with a period (not 's')
_equalize.py:159 in public function `converged`:
        D401: First line should be in imperative mood (perhaps 'Test', not 'Tests')
_learnable_fake_quantize.py:8 in private class `_LearnableFakeQuantize`:
        D204: 1 blank line required after class docstring (found 0)
_learnable_fake_quantize.py:8 in private class `_LearnableFakeQuantize`:
        D205: 1 blank line required between summary line and description (found 0)
_learnable_fake_quantize.py:8 in private class `_LearnableFakeQuantize`:
        D210: No whitespaces allowed surrounding docstring text
_learnable_fake_quantize.py:8 in private class `_LearnableFakeQuantize`:
        D400: First line should end with a period (not 'h')
_learnable_fake_quantize.py:68 in private method `enable_param_learning`:
        D205: 1 blank line required between summary line and description (found 0)
_learnable_fake_quantize.py:68 in private method `enable_param_learning`:
        D400: First line should end with a period (not 'd')
_learnable_fake_quantize.py:68 in private method `enable_param_learning`:
        D401: First line should be in imperative mood (perhaps 'Enable', not 'Enables')
_learnable_fake_quantize.py:78 in private method `enable_static_estimate`:
        D205: 1 blank line required between summary line and description (found 0)
_learnable_fake_quantize.py:78 in private method `enable_static_estimate`:
        D400: First line should end with a period (not 'f')
_learnable_fake_quantize.py:78 in private method `enable_static_estimate`:
        D401: First line should be in imperative mood (perhaps 'Enable', not 'Enables')
_learnable_fake_quantize.py:87 in private method `enable_static_observation`:
        D205: 1 blank line required between summary line and description (found 0)
_learnable_fake_quantize.py:87 in private method `enable_static_observation`:
        D400: First line should end with a period (not 't')
_learnable_fake_quantize.py:87 in private method `enable_static_observation`:
        D401: First line should be in imperative mood (perhaps 'Enable', not 'Enables')
fake_quantize.py:1 at module level:
        D205: 1 blank line required between summary line and description (found 0)
fake_quantize.py:1 at module level:
        D400: First line should end with a period (not 'n')
fake_quantize.py:61 in public class `FakeQuantizeBase`:
        D205: 1 blank line required between summary line and description (found 0)
fake_quantize.py:61 in public class `FakeQuantizeBase`:
        D210: No whitespaces allowed surrounding docstring text
fake_quantize.py:61 in public class `FakeQuantizeBase`:
        D400: First line should end with a period (not 'e')
fake_quantize.py:74 in public method `__init__`:
        D107: Missing docstring in __init__
fake_quantize.py:83 in public method `forward`:
        D102: Missing docstring in public method
fake_quantize.py:87 in public method `calculate_qparams`:
        D102: Missing docstring in public method
fake_quantize.py:91 in public method `enable_fake_quant`:
        D102: Missing docstring in public method
fake_quantize.py:95 in public method `disable_fake_quant`:
        D102: Missing docstring in public method
fake_quantize.py:99 in public method `enable_observer`:
        D102: Missing docstring in public method
fake_quantize.py:103 in public method `disable_observer`:
        D102: Missing docstring in public method
fake_quantize.py:107 in public method `with_args`:
        D102: Missing docstring in public method
fake_quantize.py:115 in public class `FakeQuantize`:
        D205: 1 blank line required between summary line and description (found 0)
fake_quantize.py:115 in public class `FakeQuantize`:
        D210: No whitespaces allowed surrounding docstring text
fake_quantize.py:115 in public class `FakeQuantize`:
        D412: No blank lines allowed between a section header and its content ('Attributes')
fake_quantize.py:150 in public method `__init__`:
        D107: Missing docstring in __init__
fake_quantize.py:188 in public method `calculate_qparams`:
        D102: Missing docstring in public method
fake_quantize.py:191 in public method `forward`:
        D102: Missing docstring in public method
fake_quantize.py:214 in public method `extra_repr`:
        D102: Missing docstring in public method
fake_quantize.py:262 in public class `FixedQParamsFakeQuantize`:
        D205: 1 blank line required between summary line and description (found 0)
fake_quantize.py:262 in public class `FixedQParamsFakeQuantize`:
        D210: No whitespaces allowed surrounding docstring text
fake_quantize.py:262 in public class `FixedQParamsFakeQuantize`:
        D400: First line should end with a period (not 'n')
fake_quantize.py:268 in public method `__init__`:
        D107: Missing docstring in __init__
fake_quantize.py:279 in public method `calculate_qparams`:
        D102: Missing docstring in public method
fake_quantize.py:283 in public method `extra_repr`:
        D102: Missing docstring in public method
fake_quantize.py:292 in public class `FusedMovingAvgObsFakeQuantize`:
        D205: 1 blank line required between summary line and description (found 0)
fake_quantize.py:292 in public class `FusedMovingAvgObsFakeQuantize`:
        D400: First line should end with a period (not 'e')
fake_quantize.py:307 in public method `__init__`:
        D107: Missing docstring in __init__
fake_quantize.py:322 in public method `calculate_qparams`:
        D102: Missing docstring in public method
fake_quantize.py:326 in public method `extra_repr`:
        D102: Missing docstring in public method
fake_quantize.py:342 in public method `forward`:
        D102: Missing docstring in public method
fake_quantize.py:480 in private function `_is_fake_quant_script_module`:
        D200: One-line docstring should fit on one line with quotes (found 2)
fake_quantize.py:480 in private function `_is_fake_quant_script_module`:
        D210: No whitespaces allowed surrounding docstring text
fake_quantize.py:480 in private function `_is_fake_quant_script_module`:
        D300: Use """triple double quotes""" (found '''-quotes)
fake_quantize.py:480 in private function `_is_fake_quant_script_module`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
fake_quantize.py:491 in public function `disable_fake_quant`:
        D400: First line should end with a period (not ':')
fake_quantize.py:502 in public function `enable_fake_quant`:
        D400: First line should end with a period (not ':')
fake_quantize.py:513 in public function `disable_observer`:
        D400: First line should end with a period (not ':')
fake_quantize.py:524 in public function `enable_observer`:
        D400: First line should end with a period (not ':')
fuse_modules.py:1 at module level:
        D100: Missing docstring in public module
fuse_modules.py:39 in public function `fuse_known_modules`:
        D205: 1 blank line required between summary line and description (found 0)
fuse_modules.py:39 in public function `fuse_known_modules`:
        D400: First line should end with a period (not 'd')
fuse_modules.py:39 in public function `fuse_known_modules`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
fuse_modules.py:104 in public function `fuse_modules`:
        D400: First line should end with a period (not 'e')
fuse_modules.py:167 in public function `fuse_modules_qat`:
        D200: One-line docstring should fit on one line with quotes (found 2)
fuse_modules.py:167 in public function `fuse_modules_qat`:
        D210: No whitespaces allowed surrounding docstring text
fuse_modules.py:167 in public function `fuse_modules_qat`:
        D400: First line should end with a period (not '`')
fuser_method_mappings.py:1 at module level:
        D100: Missing docstring in public module
fuser_method_mappings.py:18 in public function `fuse_conv_bn`:
        D400: First line should end with a period (not 'e')
fuser_method_mappings.py:55 in public function `fuse_conv_bn_relu`:
        D400: First line should end with a period (not 'e')
fuser_method_mappings.py:102 in public function `fuse_linear_bn`:
        D400: First line should end with a period (not 'e')
fuser_method_mappings.py:131 in public function `fuse_convtranspose_bn`:
        D400: First line should end with a period (not 'e')
fuser_method_mappings.py:154 in private function `_sequential_wrapper2`:
        D205: 1 blank line required between summary line and description (found 0)
fuser_method_mappings.py:154 in private function `_sequential_wrapper2`:
        D210: No whitespaces allowed surrounding docstring text
fuser_method_mappings.py:154 in private function `_sequential_wrapper2`:
        D400: First line should end with a period (not 's')
fuser_method_mappings.py:182 in public function `get_fuser_method`:
        D205: 1 blank line required between summary line and description (found 0)
fuser_method_mappings.py:182 in public function `get_fuser_method`:
        D210: No whitespaces allowed surrounding docstring text
fuser_method_mappings.py:182 in public function `get_fuser_method`:
        D300: Use """triple double quotes""" (found '''-quotes)
fuser_method_mappings.py:182 in public function `get_fuser_method`:
        D400: First line should end with a period (not ',')
fuser_method_mappings.py:205 in private function `_get_valid_patterns`:
        D205: 1 blank line required between summary line and description (found 0)
fuser_method_mappings.py:205 in private function `_get_valid_patterns`:
        D400: First line should end with a period (not ',')
fuser_method_mappings.py:205 in private function `_get_valid_patterns`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
fuser_method_mappings.py:238 in public function `get_fuser_method_new`:
        D205: 1 blank line required between summary line and description (found 0)
fuser_method_mappings.py:238 in public function `get_fuser_method_new`:
        D210: No whitespaces allowed surrounding docstring text
fuser_method_mappings.py:238 in public function `get_fuser_method_new`:
        D400: First line should end with a period (not 'd')
fuser_method_mappings.py:238 in public function `get_fuser_method_new`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112992
Approved by: https://github.com/kit1980
2023-11-15 00:59:44 +00:00
252e68a83b Revert "Add support for torch.Generator type in TorchScript (#110413)"
This reverts commit 54493fe8c4b1cca4c5ff993b99eb3e3dbc984226.

Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))
2023-11-15 00:51:23 +00:00
c892f1a318 Doc: Add and fix docstrings for torch.distributed files (#112735)
Fixes #112647

Fixed and tested docstings for all files as defined in the issue.

```
> pydocstyle '/Users/guptaaryan16/Desktop/OSS/pytorch/torch/distributed/pipeline/sync/skip/skippable.py' --count
Before: 15
After: 2

> pydocstyle torch/distributed/elastic/agent/server/local_elastic_agent.py --count
Before: 4
After: 2

> pydocstyle '/Users/guptaaryan16/Desktop/OSS/pytorch/torch/distributed/elastic/agent/server/api.py' --count
Before: 65
After: 12

> pydocstyle torch/distributed/elastic/agent/server/__init__.py --count
Before: 2
After: 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112735
Approved by: https://github.com/kit1980
2023-11-15 00:49:07 +00:00
b8b3c26d3d If we re-fakeify a FakeTensor with the same ShapeEnv, preserve symbols (#113651)
Subsumes half of https://github.com/pytorch/pytorch/pull/113605

We support fakeifying an already fake tensor, which will give you a new fake tensor mirroring the same structure as the original fake tensor, which is what is needed by https://github.com/pytorch/pytorch/issues/113643 . However, when this refakeification happens, we will naively reallocate all new sizes for all of the fake tensor. This is the right thing to do if you are re-fakeifying on a fresh ShapeEnv (because you're reparametrizing the sizes or something), but if you have two fake tensor modes which are sharing a shape environment, you would actually rather just reuse the original sizes/strides/offset from the original fake tensor. This ends up being pretty simple. I recommend viewing with whitespace diff turned off.

There's some fuzz around jagged tensor handling; that code is probably not quite right, but I fixed it for this particular case in the most straightforward way.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113651
Approved by: https://github.com/albanD, https://github.com/eellison, https://github.com/bdhirsh
2023-11-15 00:36:04 +00:00
cyy
cab039fe9b [1/N] Fixes clang-tidy warnings in header files (#113608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113608
Approved by: https://github.com/Skylion007
2023-11-15 00:32:43 +00:00
31e16847ea [doc] torch.tensor.geometric_, torch.tensor.uniform_ fix PMF vs PDF (#113109)
- Geometric distribution is discrete, fix to PMF (probability mass function)
- Continuous uniform distribution is continuous, fix to PDF (probability density function)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113109
Approved by: https://github.com/albanD
2023-11-15 00:30:19 +00:00
56c453233f [doc] clarify the range of sampled rv for torch.tensor.exponential_ (#113195)
Range of sampled random variable needs to be clarified for `torch.tensor.exponential_` whose supported interval is (0, inf) is different from [0, inf] of exponential distribution.

Background: https://github.com/pytorch/pytorch/pull/37984#discussion_r1059527457, https://github.com/pytorch/pytorch/issues/48841#issuecomment-1530439039, https://github.com/pytorch/pytorch/pull/91673#discussion_r1069955813

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113195
Approved by: https://github.com/albanD
2023-11-15 00:30:14 +00:00
f5ce4d8baf Fixed docstring errors in gradcheck.py, forward_ad.py, profiler_util.py, profiler_legacy.py, functional.py, grad_mode.py, function.py (#113266)
Fixes #112594

docstring updated.

Here are the output to with the number before and after.

1) torch/autograd/forward_ad.py

Before :

```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:23 in public function `enter_dual_level`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:23 in public function `enter_dual_level`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:42 in public function `exit_dual_level`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:42 in public function `exit_dual_level`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:62 in public function `make_dual`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:62 in public function `make_dual`:
        D400: First line should end with a period (not 'a')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:128 in public class `UnpackedDualTensor`:
        D204: 1 blank line required after class docstring (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:128 in public class `UnpackedDualTensor`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:128 in public class `UnpackedDualTensor`:
        D209: Multi-line docstring closing quotes should be on a separate line
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:134 in public function `unpack_dual`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:165 in public class `dual_level`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:165 in public class `dual_level`:
        D400: First line should end with a period (not 't')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:199 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:202 in public method `__exit__`:
        D105: Missing docstring in magic method
15
```

After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:205 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:208 in public method `__exit__`:
        D105: Missing docstring in magic method
3
```

2) torch/autograd/functional.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:262 in public function `vjp`:
        D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:262 in public function `vjp`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:262 in public function `vjp`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:262 in public function `vjp`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:359 in public function `jvp`:
        D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:359 in public function `jvp`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:359 in public function `jvp`:
        D400: First line should end with a period (not 'f')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:359 in public function `jvp`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:584 in public function `jacobian`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:841 in public function `hessian`:
        D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:841 in public function `hessian`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:973 in public function `vhp`:
        D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:973 in public function `vhp`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:973 in public function `vhp`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:973 in public function `vhp`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1076 in public function `hvp`:
        D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1076 in public function `hvp`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1076 in public function `hvp`:
        D400: First line should end with a period (not 'r')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1076 in public function `hvp`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
20
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1 at module level:
        D100: Missing docstring in public module
1
```
3) torch/autograd/profiler_legacy.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:27 in public class `profile`:
        D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:29 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:62 in public method `config`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:74 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:86 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:103 in public method `__repr__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:108 in public method `__str__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:117 in public method `table`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:141 in public method `export_chrome_trace`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:148 in public method `export_stacks`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:154 in public method `key_averages`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:161 in public method `total_average`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:170 in public method `self_cpu_time_total`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:170 in public method `self_cpu_time_total`:
        D400: First line should end with a period (not 'f')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:180 in private nested function `_get_record_key`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:180 in private nested function `_get_record_key`:
        D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:180 in private nested function `_get_record_key`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
18
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:29 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:62 in public method `config`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:74 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:86 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:103 in public method `__repr__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:108 in public method `__str__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:117 in public method `table`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:141 in public method `export_chrome_trace`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:148 in public method `export_stacks`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:154 in public method `key_averages`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:161 in public method `total_average`:
        D102: Missing docstring in public method
12
```

4) torch/autograd/gradcheck.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:27 in public class `GradcheckError`:
        D204: 1 blank line required after class docstring (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:27 in public class `GradcheckError`:
        D400: First line should end with a period (not '`')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:258 in private function `_get_numerical_jacobian`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:258 in private function `_get_numerical_jacobian`:
        D400: First line should end with a period (not 'f')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:258 in private function `_get_numerical_jacobian`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:308 in public function `get_numerical_jacobian`:
        D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:459 in public function `get_numerical_jacobian_wrt_specific_input`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:488 in private function `_get_analytical_jacobian_forward_ad`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:488 in private function `_get_analytical_jacobian_forward_ad`:
        D400: First line should end with a period (not 't')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:488 in private function `_get_analytical_jacobian_forward_ad`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:816 in public function `get_analytical_jacobian`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:1944 in public function `gradcheck`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:1944 in public function `gradcheck`:
        D400: First line should end with a period (not 'l')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:2133 in public function `gradgradcheck`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:2133 in public function `gradgradcheck`:
        D400: First line should end with a period (not 's')
16
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:463 in public function `get_numerical_jacobian_wrt_specific_input`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:820 in public function `get_analytical_jacobian`:
        D103: Missing docstring in public function
3
```
5) torch/autograd/function.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:27 in public class `FunctionCtx`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:29 in public method `save_for_backward`:
        D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:88 in public method `save_for_forward`:
        D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:141 in public method `mark_dirty`:
        D401: First line should be in imperative mood (perhaps 'Mark', not 'Marks')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:177 in public method `mark_shared_storage`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:185 in public method `mark_non_differentiable`:
        D401: First line should be in imperative mood (perhaps 'Mark', not 'Marks')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:217 in public method `set_materialize_grads`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:276 in public class `BackwardCFunction`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:277 in public method `apply`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:291 in public method `apply_jvp`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:308 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:322 in private method `forward`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:322 in private method `forward`:
        D400: First line should end with a period (not 's')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:322 in private method `forward`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:384 in private method `backward`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:384 in private method `backward`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:384 in private method `backward`:
        D401: First line should be in imperative mood (perhaps 'Define', not 'Defines')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:416 in private method `jvp`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:416 in private method `jvp`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:416 in private method `jvp`:
        D401: First line should be in imperative mood (perhaps 'Define', not 'Defines')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:439 in public class `Function`:
        D400: First line should end with a period (not '`')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:472 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:482 in public method `__call__`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:505 in public method `vmap`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:505 in public method `vmap`:
        D400: First line should end with a period (not 'h')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:505 in public method `vmap`:
        D401: First line should be in imperative mood (perhaps 'Define', not 'Defines')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:536 in public method `apply`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:564 in public function `once_differentiable`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:612 in public function `traceable`:
        D401: First line should be in imperative mood (perhaps 'Mark', not 'Marks')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:626 in public class `InplaceFunction`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:627 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:741 in public class `NestedIOFunction`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:761 in public method `backward`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:768 in public method `forward`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:775 in public method `save_for_backward`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:780 in public method `saved_tensors`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:784 in public method `mark_dirty`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:787 in public method `mark_non_differentiable`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:790 in public method `forward_extended`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:793 in public method `backward_extended`:
        D102: Missing docstring in public method
41
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:27 in public class `FunctionCtx`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:177 in public method `mark_shared_storage`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:276 in public class `BackwardCFunction`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:277 in public method `apply`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:291 in public method `apply_jvp`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:308 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:471 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:481 in public method `__call__`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:536 in public method `apply`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:564 in public function `once_differentiable`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:626 in public class `InplaceFunction`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:627 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:741 in public class `NestedIOFunction`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:761 in public method `backward`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:768 in public method `forward`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:775 in public method `save_for_backward`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:780 in public method `saved_tensors`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:784 in public method `mark_dirty`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:787 in public method `mark_non_differentiable`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:790 in public method `forward_extended`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:793 in public method `backward_extended`:
        D102: Missing docstring in public method
22
```
6) torch/autograd/profiler_util.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:26 in public class `EventList`:
        D400: First line should end with a period (not ')')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:28 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:46 in public method `__str__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:70 in private method `_populate_cpu_children`:
        D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:70 in private method `_populate_cpu_children`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:70 in private method `_populate_cpu_children`:
        D401: First line should be in imperative mood (perhaps 'Populate', not 'Populates')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:166 in public method `self_cpu_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:179 in public method `table`:
        D401: First line should be in imperative mood (perhaps 'Print', not 'Prints')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:210 in public method `export_chrome_trace`:
        D401: First line should be in imperative mood (perhaps 'Export', not 'Exports')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:266 in public method `supported_export_stacks_metrics`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:273 in public method `export_stacks`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:354 in private function `_format_time`:
        D400: First line should end with a period (not 't')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:354 in private function `_format_time`:
        D401: First line should be in imperative mood (perhaps 'Define', not 'Defines')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:365 in private function `_format_time_share`:
        D400: First line should end with a period (not 't')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:365 in private function `_format_time_share`:
        D401: First line should be in imperative mood (perhaps 'Define', not 'Defines')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:373 in private function `_format_memory`:
        D400: First line should end with a period (not 'g')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:373 in private function `_format_memory`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:408 in public method `cpu_time`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:412 in public method `cuda_time`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:416 in public method `privateuse1_time`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:420 in public class `Interval`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:421 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:425 in public method `elapsed_us`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:435 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:488 in public method `append_kernel`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:504 in public method `set_cpu_parent`:
        D400: First line should end with a period (not 't')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:518 in public method `self_cpu_memory_usage`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:526 in public method `self_cuda_memory_usage`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:534 in public method `self_privateuse1_memory_usage`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:542 in public method `self_cpu_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:550 in public method `cuda_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:567 in public method `self_cuda_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:579 in public method `cpu_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:586 in public method `self_privateuse1_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:598 in public method `privateuse1_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:615 in public method `key`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:618 in public method `__repr__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:659 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:687 in public method `add`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:726 in public method `__iadd__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:729 in public method `__repr__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:763 in public class `StringTable`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:764 in public method `__missing__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:773 in public class `MemRecordsAcc`:
        D400: First line should end with a period (not 'l')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:775 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:783 in public method `in_interval`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:846 in private function `_build_table`:
        D401: First line should be in imperative mood (perhaps 'Print', not 'Prints')
48
```
After :
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:28 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:46 in public method `__str__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:166 in public method `self_cpu_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:266 in public method `supported_export_stacks_metrics`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:273 in public method `export_stacks`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:408 in public method `cpu_time`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:412 in public method `cuda_time`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:416 in public method `privateuse1_time`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:420 in public class `Interval`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:421 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:425 in public method `elapsed_us`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:435 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:488 in public method `append_kernel`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:518 in public method `self_cpu_memory_usage`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:526 in public method `self_cuda_memory_usage`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:534 in public method `self_privateuse1_memory_usage`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:542 in public method `self_cpu_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:550 in public method `cuda_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:567 in public method `self_cuda_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:579 in public method `cpu_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:586 in public method `self_privateuse1_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:598 in public method `privateuse1_time_total`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:615 in public method `key`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:618 in public method `__repr__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:659 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:687 in public method `add`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:726 in public method `__iadd__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:729 in public method `__repr__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:763 in public class `StringTable`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:764 in public method `__missing__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:775 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:783 in public method `in_interval`:
        D102: Missing docstring in public method
33
```
7) torch/autograd/grad_mode.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:73 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:78 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:82 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:133 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:137 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:182 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:187 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:190 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:193 in public method `clone`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:198 in public class `inference_mode`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:250 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:257 in public method `__new__`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:262 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:266 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:269 in public method `clone`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:301 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:306 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:309 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:312 in public method `clone`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:354 in private class `_unsafe_preserve_version_counter`:
        D400: First line should end with a period (not '!')
21
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:1 at module level:
        D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:73 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:78 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:82 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:133 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:137 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:182 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:187 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:190 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:193 in public method `clone`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:250 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:257 in public method `__new__`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:262 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:266 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:269 in public method `clone`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:301 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:306 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:309 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:312 in public method `clone`:
        D102: Missing docstring in public method
19
```

@svekars @kit1980 @subramen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113266
Approved by: https://github.com/aaronenyeshi, https://github.com/soulitzer, https://github.com/kit1980
2023-11-14 23:39:43 +00:00
28228e1517 Only check significant strides in test torchinductor (#113389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113389
Approved by: https://github.com/int3
2023-11-14 22:45:09 +00:00
cf6e9f572e Update xla pin (#113603)
Fixes XLA workflow CI failures
```

======================================================================
344FAIL: test_set (__main__.TestAtenXlaTensor)
345----------------------------------------------------------------------
346Traceback (most recent call last):
347  File "/tmp/pytorch/xla/test/test_operations.py", line 1007, in test_set
348    self.assertEqual(met.counter_value('DestroyXlaTensor'), 6)
349  File "/tmp/pytorch/xla/test/test_utils.py", line 301, in assertEqual
350    super(XlaTestCase, self).assertLessEqual(abs(x - y), prec, message)
351AssertionError: 1 not less than or equal to 1e-05 :
352
353----------------------------------------------------------------------
```

We've disabled the failing test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113603
Approved by: https://github.com/JackCaoG, https://github.com/bdhirsh, https://github.com/malfet
2023-11-14 22:32:06 +00:00
91973e1c31 Issue113185 (#113523)
Fixes #113185

I have fixed the given docstring errors. The followings are the outputs with numbers before and after the changes:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113523
Approved by: https://github.com/kit1980
2023-11-14 22:25:28 +00:00
6b01126df5 [Easy] [Dynamo] Catch OSError when calling inspect.getfile (#113671)
Fixes #111328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113671
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2023-11-14 22:15:32 +00:00
1d640566d4 [BE] Do not warn when safely loading legacy dicts (#113614)
Use the same strategy as for unsafe pickler, i.e. use dummy `torch.serialization.StorageType` to represent legacy typed storage classes during deserialization. Add `_dtype` property to be able to use it for both new and legacy format deserialization.

Parametrize `test_serialization_new_format_old_format_compat`

Add regression test to validate that loading legacy modes can be done
without any warnings

Before the change:
```
% python test_serialization.py -v -k test_serialization_new_format_old_format_compat_
test_serialization_new_format_old_format_compat_cpu (__main__.TestBothSerializationCPU) ... ok
test_serialization_new_format_old_format_compat_safe_cpu (__main__.TestBothSerializationCPU) ... /Users/nshulga/git/pytorch/pytorch/torch/_utils.py:836: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
ok

----------------------------------------------------------------------
Ran 2 tests in 0.116s

OK
```
Without the change but update test to catch warnings:
```
 % python test_serialization.py -v -k test_serialization_new_format_old_format_compat_
test_serialization_new_format_old_format_compat_weights_only_False_cpu (__main__.TestBothSerializationCPU) ... ok
test_serialization_new_format_old_format_compat_weights_only_True_cpu (__main__.TestBothSerializationCPU) ... FAIL

======================================================================
FAIL: test_serialization_new_format_old_format_compat_weights_only_True_cpu (__main__.TestBothSerializationCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2536, in wrapper
    method(*args, **kwargs)
  File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 415, in instantiated_test
    result = test(self, **param_kwargs)
  File "/Users/nshulga/git/pytorch/pytorch/test/test_serialization.py", line 807, in test_serialization_new_format_old_format_compat
    self.assertTrue(len(w) == 0, msg=f"Expected no warnings but got {[str(x) for x in w]}")
AssertionError: False is not true : Expected no warnings but got ["{message : UserWarning('TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()'), category : 'UserWarning', filename : '/Users/nshulga/git/pytorch/pytorch/torch/_utils.py', lineno : 836, line : None}"]

To execute this test, run the following from the base repo dir:
     python test/test_serialization.py -k test_serialization_new_format_old_format_compat_weights_only_True_cpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 2 tests in 0.109s

FAILED (failures=1)

```

Fixes problem reported in https://github.com/pytorch/pytorch/issues/52181#issuecomment-1715738910
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113614
Approved by: https://github.com/kit1980, https://github.com/albanD
2023-11-14 22:09:10 +00:00
538114db65 [MPS] Fix and refactor unary/binary ops with non-zero offset or non-contiguous output (#97085)
Fixes #100764

This PR fixes the unary ops implementation and refactors the binary ops implementation a bit.

For unary ops:
Previously we didn't take into account unary ops that have a non-contiguous/storage-offset output, causing an incorrect result (because the MPS graph kernel always writes the buffer contiguously). Therefore, this PR creates a temporary output tensor for the graph first and then copy the result back to the original output tensor. We currently do not have a better fix other than this I think.

For binary ops, see https://github.com/pytorch/pytorch/pull/97085#discussion_r1140999125

See the added test for repro.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97085
Approved by: https://github.com/malfet
2023-11-14 22:03:21 +00:00
9f71452331 Disable atomic_add fallback for cpu (#113655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113655
Approved by: https://github.com/eellison
2023-11-14 21:40:29 +00:00
18d7b8e4f7 [BE]: ruff apply rule PLW1510 to find silent subprocess errors (#113644)
Reopens #111682 that I messed up due to a bad rebase and triggered some issues with CLA. This explicitly adds check=True or False to any subprocess calls where appropriate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113644
Approved by: https://github.com/ezyang, https://github.com/kit1980
2023-11-14 20:59:40 +00:00
53e7de4b65 Issue 112599 - fix pydocstyle errors (#113177)
Fixes #112599

Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and at methods within each module, `forward()`, `reset_parameters`, `__init__` ..etc

pydocstyle torch/nn/modules/pooling.py --count
before: 49
after: 29

**remaining errors:**
```
torch/nn/modules/pooling.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/modules/pooling.py:90 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:163 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:240 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:315 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/pooling.py:321 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:402 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/pooling.py:408 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:472 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/pooling.py:478 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:541 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/pooling.py:550 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:620 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/pooling.py:630 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:706 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/pooling.py:716 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:720 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/nn/modules/pooling.py:774 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/pooling.py:792 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:845 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/pooling.py:863 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:925 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:979 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:1026 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:1068 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:1111 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:1150 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:1189 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pooling.py:1228 in public method `forward`:
        D102: Missing docstring in public method
```

pydocstyle torch/nn/modules/upsampling.py --count
before: 14
after: 7

**remaining:**
```
torch/nn/modules/upsampling.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/modules/upsampling.py:142 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/upsampling.py:156 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/upsampling.py:160 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/nn/modules/upsampling.py:166 in public method `extra_repr`:
        D102: Missing docstring in public method
torch/nn/modules/upsampling.py:216 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/upsampling.py:263 in public method `__init__`:
        D107: Missing docstring in __init__
```

pydocstyle torch/nn/modules/rnn.py --count
before: 47
after: 40

**remaining**
```
torch/nn/modules/rnn.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/modules/rnn.py:59 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:160 in public method `__setattr__`:
        D105: Missing docstring in magic method
torch/nn/modules/rnn.py:225 in public method `reset_parameters`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:230 in public method `check_input`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:242 in public method `get_expected_hidden_size`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:256 in public method `check_hidden_size`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:272 in public method `check_forward_args`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:278 in public method `permute_hidden`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:284 in public method `extra_repr`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:305 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/nn/modules/rnn.py:313 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/nn/modules/rnn.py:355 in public method `all_weights`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:471 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:478 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:481 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:503 in public method `forward` (skipping F811):
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:762 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:768 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:771 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:774 in public method `get_expected_cell_size`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:786 in public method `check_forward_args`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:798 in public method `permute_hidden`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:809 in public method `forward` (skipping F811):
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:820 in public method `forward` (skipping F811):
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:1030 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1036 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1039 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1046 in public method `forward` (skipping F811):
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:1054 in public method `forward` (skipping F811):
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:1123 in public class `RNNCellBase`:
        D101: Missing docstring in public class
torch/nn/modules/rnn.py:1134 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1152 in public method `extra_repr`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:1160 in public method `reset_parameters`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:1224 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1230 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:1327 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1332 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/rnn.py:1422 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1427 in public method `forward`:
        D102: Missing docstring in public method
```

pydocstyle torch/nn/modules/pixelshuffle.py --count
before: 13
after: 8

**remaining:**
```
torch/nn/modules/pixelshuffle.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/modules/pixelshuffle.py:52 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/pixelshuffle.py:56 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pixelshuffle.py:59 in public method `extra_repr`:
        D102: Missing docstring in public method
torch/nn/modules/pixelshuffle.py:105 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/pixelshuffle.py:109 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/pixelshuffle.py:112 in public method `extra_repr`:
        D102: Missing docstring in public method
```

pydocstyle torch/nn/modules/sparse.py --count
before: 14
after: 8

**remaining errors:**
```
torch/nn/modules/sparse.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/modules/sparse.py:124 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/sparse.py:153 in public method `reset_parameters`:
        D102: Missing docstring in public method
torch/nn/modules/sparse.py:162 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/modules/sparse.py:167 in public method `extra_repr`:
        D102: Missing docstring in public method
torch/nn/modules/sparse.py:320 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/modules/sparse.py:350 in public method `reset_parameters`:
        D102: Missing docstring in public method
torch/nn/modules/sparse.py:396 in public method `extra_repr`:
        D102: Missing docstring in public method
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113177
Approved by: https://github.com/ezyang
2023-11-14 20:55:22 +00:00
a05639cea6 Add some checks about Device and Layout when create/convert named tensor (#113628)
Fixes #113597

As the title stated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113628
Approved by: https://github.com/ezyang
2023-11-14 20:40:27 +00:00
20eaa49dde [PT-D] Made _get_registry return None if no APIs applied (#113654)
I prefer to not modify the module if it does not have any of our APIs applied. The side effect of inserting a registry on the module when calling a getter is non-intuitive to me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113654
Approved by: https://github.com/fegin
2023-11-14 20:28:11 +00:00
afef32bd23 [Pytorch][Vulkan] native_layer_norm (#113573)
Summary: We implement `native_layer_norm`. Compared to [`layer_norm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html), the output of native_layer_norm is a tuple of tensors containing: layer_norm, mean, 1/sqrt(var + eps).

Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (2b2052666|remote/fbandroid/stable)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*native_layer_norm*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *native_layer_norm*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.native_layer_norm_2d
[       OK ] VulkanAPITest.native_layer_norm_2d (352 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_3d
[       OK ] VulkanAPITest.native_layer_norm_3d (308 ms)
[ RUN      ] VulkanAPITest.native_layer_norm_4d
[       OK ] VulkanAPITest.native_layer_norm_4d (6 ms)
[----------] 3 tests from VulkanAPITest (667 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (667 ms total)
[  PASSED  ] 3 tests.
```
full test result in P881016177

Reviewed By: yipjustin

Differential Revision: D51247030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113573
Approved by: https://github.com/yipjustin
2023-11-14 20:11:32 +00:00
b7b2178204 [BE]: Remove useless lambdas (#113602)
Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602
Approved by: https://github.com/albanD
2023-11-14 20:06:48 +00:00
2a8a7425be Fix to wrap jagged dims for split() / split_with_sizes() (#113591)
Still need OpInfo-style tests to catch things like this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113591
Approved by: https://github.com/soulitzer
2023-11-14 19:36:08 +00:00
ea39cc34f9 Refactor NestedTensor subclass to remove ragged_size from constructor (#113491)
This PR removes the need for passing `ragged_size` into the `NestedTensor` constructor. This was an artifact of fake-ification, where sometimes we needed the NT to have a symbolic singleton symint shape for the ragged dimension. The new way of achieving this is to also store mappings between fake / functional tensors -> symbolic symints in the ragged structure registry. Now the `NestedTensor` constructor can just query this registry for the `ragged_size`.

Old: `NestedTensor(values, offsets, *, ragged_size=None, **kwargs)`
New: `NestedTensor(values, offsets, **kwargs)`

This makes it possible to have a `_nested_view_from_values_offsets(values, offsets)` without needing to pass a `ragged_size`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113491
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2023-11-14 19:32:21 +00:00
cdc9a05c89 cudagraph_trees.py: remove duplicate line (#113624)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113624
Approved by: https://github.com/eellison
2023-11-14 19:20:23 +00:00
149b9dfd04 [easy]Remove specialized value (#112252)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112252
Approved by: https://github.com/jansel
ghstack dependencies: #111196
2023-11-14 19:14:03 +00:00
b0805fa5d0 Support tensors as Dict keys (#111196)
This prepares the PR where we implement sets in terms of dicts.
To do so, rather than storing internally a dictionary that maps literals
to VariableTrackers, it stores (pretty much) a dictionary from VTs to VTs.
To do so, keys are wrapped in an opaque internal class `_Hashable`.
The Hashable class is opaque on purpose so that it fails hard if
if it inadvertently leaks back into user code.

We also found and fixed a number of latent bugs and inconsistencies
in the way dynamo checked what can be a dict key. More generally, we
make much clearer what are the things that need to be modified to add
a new supported key type to Dicts.

Fixes https://github.com/pytorch/pytorch/issues/107595
Fixes https://github.com/pytorch/pytorch/issues/111603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111196
Approved by: https://github.com/jansel
2023-11-14 19:14:03 +00:00
f22486b0fc [doc] scale parameter notation for torch.Tensor.cauchy_ is misleading (#113178)
Scale parameter notation currently used for `torch.Tensor.cauchy_` is misleading.
Sigma (σ) is usually used to denote square root of variance. Variance is undefined in Cauchy distribution.
Replace sigma (σ) with gamma (γ).

Background: https://github.com/pytorch/pytorch/pull/37984#discussion_r1059551749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113178
Approved by: https://github.com/mingxzhao, https://github.com/albanD
2023-11-14 18:55:42 +00:00
e6bffc6b87 Fix docstring errors in default_hooks.py, optimizer_overlap.py, checkpoint_wrapper.py, copy.py, benchmark_ddp_rpc.py, utils.py, dependency.py, phony.py, pipeline.py, checkpoint.py, worker.py, batchnorm.py, quantization.py (#113511)
Fixes #112645

Updated the files by fixing the docstring errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113511
Approved by: https://github.com/weifengpy
2023-11-14 18:52:41 +00:00
3b80577212 [Memory Snapshot] Add timestamps to memory events collected in snapshots (#112266)
Summary: Use the same clock as the profiler to collect the timestamps on when memory events occurred. Save these to the snapshot dicts as well, so that they can be saved with the raw memory events.

Test Plan:
CI

Observed that trace_entry will now have time_us field, and it is ascending. For example:
```
trace entry: {'action': 'free_requested', 'addr': 140366476918784, 'size': 8192, 'stream': 0, 'time_us': 1698326576864190}
trace entry: {'action': 'free_completed', 'addr': 140366476918784, 'size': 8192, 'stream': 0, 'time_us': 1698326576864190}
trace entry: {'action': 'free_requested', 'addr': 140366476936192, 'size': 8192, 'stream': 0, 'time_us': 1698326576864194}
trace entry: {'action': 'free_completed', 'addr': 140366476936192, 'size': 8192, 'stream': 0, 'time_us': 1698326576864194}
trace entry: {'action': 'free_requested', 'addr': 140366641430528, 'size': 8192000, 'stream': 0, 'time_us': 1698326576864205}
trace entry: {'action': 'free_completed', 'addr': 140366641430528, 'size': 8192000, 'stream': 0, 'time_us': 1698326576864205}
trace entry: {'action': 'free_requested', 'addr': 140366403571712, 'size': 4000, 'stream': 0, 'time_us': 1698326576864209}
trace entry: {'action': 'free_completed', 'addr': 140366403571712, 'size': 4000, 'stream': 0, 'time_us': 1698326576864209}
```

Differential Revision: D50602011

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112266
Approved by: https://github.com/zdevito
2023-11-14 18:48:59 +00:00
5465f2bb6c Revert "Improves comparison of state dicts for Checkpoint E2E Tests (#113181)"
This reverts commit 8f5fead86ea9a9eac85d20c6aee780e06ce04eb7.

Reverted https://github.com/pytorch/pytorch/pull/113181 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing distribute test in trunk 8f5fead86e with a not defined DTensor error ([comment](https://github.com/pytorch/pytorch/pull/113181#issuecomment-1810925052))
2023-11-14 18:42:40 +00:00
cyy
79e3833703 Enable clang-tidy in torch/csrc/quantized and some fixes (#113604)
This PR enables clang-tidy checks in torch/csrc/quantized/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113604
Approved by: https://github.com/Skylion007
2023-11-14 16:51:18 +00:00
14eb92cb43 [quant][pt2][be] Remove add/relu from conv-bn QAT pattern (#113006)
Summary: This commit significantly simplifies the QAT fusion
code for the `conv-bn` pattern by removing add and relu nodes
from the match and replacement patterns. This does not reduce
functionality; patterns like `conv-bn-relu`, `conv-bn-add`,
and `conv-bn-add-relu` are still supported. We simply do not
match these extra nodes, since there is actually no need to
replace them.

This has the additional benefit of reducing the number of
patterns being matched by 16x, since for each add and relu
variant of the `conv-bn` pattern there is also an in-place
variant. This also enables more flexible `conv-bn` pattern
matching in the future and keeps the number of patterns
more scalable.

One important change needed in this commit was to remove
the match filter that requires the input and output
activations to be quantized. This was necessary because
otherwise we would always expect q-dq nodes immediately
after the getitem node, instead of after the add or relu
nodes for example. This has another side benefit of
keeping QAT fusion flexible enough to support weight
only quantization.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113006
Approved by: https://github.com/jerryzh168
2023-11-14 16:08:37 +00:00
a7b75f586a [RELAND] Disallow skipping dynamo (#110222)
Previous discussion: https://github.com/pytorch/pytorch/pull/109476

In this PR, I made following additions to the original PR:
1) Unlifted graph module now runs the runtime assertions in its' forward call.
2) When we retrace, we make sure we run the assertions to make sure user is tracing the module with correct inputs with respect to the assumptions we made during first tracing. The way I do is that I create new graph module type with modified call method. And the runtime assertions happen under torchdynamo.disable so that it is just run in eager directly. The reason is we don't this to be traced part of the graph.
3) Both ep.module and capture_pre_autograd now returns _UnliftedGraphModule.

Differential Revision: [D51078056](https://our.internmc.facebook.com/intern/diff/D51078056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110222
Approved by: https://github.com/zhxchen17
2023-11-14 16:02:01 +00:00
8f5fead86e Improves comparison of state dicts for Checkpoint E2E Tests (#113181)
Addresses the following comment - https://github.com/pytorch/pytorch/pull/112541#discussion_r1380197424

Changes the comparison of models in the checkpointing E2E test to compare a non-parallelized model against distribued model after training, saving, & loading.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113181
Approved by: https://github.com/fegin
2023-11-14 14:54:40 +00:00
93372455a7 [2d] pass shape/stride during tensor unflatten (#113547)
as titled, built on top of the work @wz337 enabled, this could save some
runtime CPU time to recreate DTensor parameters with correct
shape/stride, and avoid issues when un-even sharding parameters

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113547
Approved by: https://github.com/XilunWu
ghstack dependencies: #113323, #113324
2023-11-14 09:28:09 +00:00
7117bffff9 [funcol] a few optimizations to funcol (#113324)
Apply a few optimizations to funcol:

- allgather on non-0 dim, the resulting tensor already needs to access
data in order to do torch.cat, so we sync wait here so that we don;t
need to go through ACT dispatch for chunk + cat alltogether
- have a fast return logic to aten.view as it's a commonly hit op for
view related ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113324
Approved by: https://github.com/XilunWu
ghstack dependencies: #113323
2023-11-14 09:28:09 +00:00
b16e3b5373 [funcol] add two APIs: wait() and numpy() (#113323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113323
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/wconstab
2023-11-14 09:27:45 +00:00
a1e3c50165 A small fix for do_bench_using_profiling (#113611)
ATT, there are cases where multiple kernel invocations have same kernel names, and key_averages() will wrongly get average results across different invocations. This fix uses cuda_time_total / n_repeat instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113611
Approved by: https://github.com/chenyang78
2023-11-14 06:31:22 +00:00
c21320b3b1 CPU Publish: Fix Assign device error, when module has multiple devices (#109149) (#113509)
Summary:
new version of this: https://www.internalfb.com/diff/D49110166?dst_version_fbid=252052334533986

Fix Assign device error, when module has multiple devices
If fc_fp16_quantization enabled for CPU model.
And module REMOTE_OTHER has multiple devices: {device(type='meta'), device(type='cpu')}
We fail on this assertion:
fbcode/caffe2/torch/ao/quantization/fx/utils.py
232
    assert len(devices) <= 1, (
Since CPU models work on CPU devices, added a condition before the assertion.
In case, we have CPU in module list of devices. Set device as CPU.
Please see debug details:
https://docs.google.com/document/d/1pMPCeJyMPA15NhFc2uAyNDkS9azR40uaNyOP0DIgHjU/edit

Test Plan:
AIMP_DISAGG_CPU=true buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true lego/scripts:lego_cli -- run-locally --model_entity_id 959168967 --config_version 28 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":12, "worker_thread_per_process_number": 6, "use_work_assignment": true}' 2>&1 | tee /tmp/gmpp_lc.txt
Snapshot:
https://www.internalfb.com/manifold/explorer/ads_storage_fblearner/tree/user/facebook/fblearner/predictor/959168967/47

Differential Revision: D51226114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113509
Approved by: https://github.com/jerryzh168
2023-11-14 06:15:32 +00:00
b3a76ccc12 [BE] Make legacy type storage warning point to the caller (#113601)
`@classproperty` decorator adds another wrapper, so warning with default stacklevel (2) would  always point to the wrapper implementation rather than at callee.

For example, before this change following code
```python
import torch
print(torch.FloatStorage.dtype)
```
will produce inactionable warning:
```
/Users/nshulga/git/pytorch/pytorch/torch/_utils.py:836: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()

```
But after the change warning turns into:
```
/Users/nshulga/test/bar.py:2: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  print(torch.FloatStorage.dtype)
```

Discovered while reading https://github.com/pytorch/pytorch/issues/109108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113601
Approved by: https://github.com/kit1980
2023-11-14 04:37:57 +00:00
ffc3731dc4 Update TensorBase.to()'s' signature; create {guards,compiled_autograd}.pyi (#113536)
I had to explicitly import submodules in torch/_C/_dynamo/__init__.pyi
because mypy doesn't seem to understand empty `__init__.py[i]` files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113536
Approved by: https://github.com/ezyang
ghstack dependencies: #113412, #113535
2023-11-14 04:31:12 +00:00
5b95715bc0 Make {Tracing,Compile}Context.get() return non-optional type (#113535)
They are used in many contexts that don't actually check if the returned
type is `None`. I have also created `try_get()` for the cases where we
do actually want an Optional type returned.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113535
Approved by: https://github.com/ezyang
ghstack dependencies: #113412
2023-11-14 04:31:12 +00:00
d561654d99 [ONNX] Support more sympy operations in fx-onnx exporter (#112758)
Fix https://github.com/microsoft/onnx-converters-private/issues/190

This PR retires built-in function mapping by adding built-in ops into torchlib (https://github.com/microsoft/onnxscript/pull/1135), and provide a runtime tests to guard the operation conversion.

More built-in ops are supported in torchlib as well.

~~NOTE: `native_batch_norm` regression is caused by https://github.com/microsoft/onnxscript/issues/1140. Will fix it before I merge this.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112758
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-11-14 03:40:48 +00:00
78ae49d104 [vision hash update] update the pinned vision hash (#113598)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113598
Approved by: https://github.com/pytorchbot, https://github.com/PaliC
2023-11-14 03:35:25 +00:00
567db94d87 Add markDynamoStrictTest (#112768)
Add markDynamoStrictTest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112768
Approved by: https://github.com/zou3519
2023-11-14 02:52:12 +00:00
edd967fe78 Add testing for foreach scalar Tensor overloads in inductor (#111600)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111600
Approved by: https://github.com/mlazos
2023-11-14 02:05:06 +00:00
d94bfaff2e Add TorchFix to the CI (#113403)
Enable flake8 plugin for https://github.com/pytorch/test-infra/tree/main/tools/torchfix - TorchFix 0.1.1.
Disable TorchFix codes that don't make sense for PyTorch itself.
Update deprecated TorchVision APIs to make TorchFix pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113403
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-11-14 01:26:06 +00:00
e1c872e009 Add optimal triton kernel parameters to bsr_dense_mm and scatter_mm for bfloat16 and float32 dtypes (#113553)
As in the title.

This PR is a follow-up to PR https://github.com/pytorch/pytorch/pull/112737 to address bfloat16 and float32 dtype cases. The performance increase is as follows (`NVIDIA A100-SXM4-80GB`):

- bsr_scatter_mm and bfloat16
  - for blocksize 16x16, the average/maximum speed up is about 29/75 %.
  - for blocksize 32x32, the average/maximum speed up is about 23/58 %.
  - for blocksize 64x64, the average/maximum speed up is about 27/66 %.
  - for blocksize 128x128, the average/maximum speed up is about 33/72 %.
- bsr_dense_mm and bfloat16
  - for blocksize 16x16, the average/maximum speed up is about 47/61 %.
  - for blocksize 32x32, the average/maximum speed up is about 29/43 %.
  - for blocksize 64x64, the average/maximum speed up is about 21/41 %.
  - for blocksize 128x128, the average/maximum speed up is about 12/29 %.
- bsr_dense_mm and  float32
  - for blocksize 16x16, the average/maximum speed up is about 35/49 %.
  - for blocksize 32x32, the average/maximum speed up is about 2/5 %.
  - for blocksize 64x64, the average/maximum speed up is about 2/21 %.
  - for blocksize 128x128, the average/maximum speed up is about 79/84 %.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113553
Approved by: https://github.com/cpuhrsch
2023-11-14 00:47:59 +00:00
cyy
ff82dcd8fa [2/N] Enable clang-tidy checks in torch/csrc/profiler (#113439)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113439
Approved by: https://github.com/Skylion007
2023-11-14 00:39:54 +00:00
a43c757275 Fixed error with cuda_ver in cpp_extension.py (#113555)
Reported in 71ca42787f (r132390833)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113555
Approved by: https://github.com/ezyang
2023-11-14 00:12:22 +00:00
4b09b08d2e Fix recompilation issue with content store (#113533)
While running the accuracy minifier, I was getting the error:
```
NotImplementedError("xor_sum only implemented with inductor")
```

The logs showed that the cache limit was exceeded, and it was falling back to
eager mode which doesn't work for this function. The cache failures was due to
the code guarding on the id of the function being compiled which in this case is
a closure that gets re-created for each function call so the guard always fails.

This fixes the issue by making the storage hash kernel a global function and
working around the dynamo dependency by the `lazy_compile` helper which defers
the `torch.compile` call to the first invocation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113533
Approved by: https://github.com/Skylion007
2023-11-13 23:58:13 +00:00
ad06e9f060 Support logging aliases to list of modules (#113567)
When SymNode was refactored into its own module, this broke logging for this file, as the `dynamic` alias no longer covered it. This PR adds supports for an alias to point to multiple qualified module names. To drive the refactor, I renamed `log_alias_to_log_qname` to `log_alias_to_log_qnames` and then audited all use sites. I invite you to do so as well.

For good measure, I also add dynamic to dynamo, so that I always get dynamic logs when dynamo is enabled. Empirically this will be helpful because people keep sending me dynamo debug logs that don't have dynamic logs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113567
Approved by: https://github.com/Skylion007, https://github.com/lezcano, https://github.com/mlazos
ghstack dependencies: #113566
2023-11-13 23:35:18 +00:00
92ebf74ac1 Refactor loggers to use NOTSET when not set by user (#113566)
Previously, the way our logging system worked was that for every registered log, we would explicit set a log level for it. This would lead to unintuitive behavior when you had multiple overlapping loggers, e.g., from the module hierarchy. Specifically, if you had `TORCH_LOGS=torch`, this would not actually set the logging level for torch._dynamo to be INFO, because the default log level is WARNING, and because torch._dynamo has a registered logger 'dynamo' this would end up setting the level on torch._dynamo to be WARNING, thereby overriding the level of the parent module torch. The 'all' logger did not display this behavior, but only because it was special cased to directly modify the default log level of all other loggers (and so this would not work for any sub-hierarchies).

This PR refactors our code into a much more logical setup using NOTSET. Instead of setting the level of all loggers to some level, we instead default all loggers to NOTSET, unless a user explicitly requested logging from some logger. This means that if we have some logger which isn't explicitly mentioned by the user, parent loggers now have a chance to influence their log behavior. With this, I can eliminate the 'all' special case; 'all' really just means 'torch'. (I keep special handling directing all to torch for backwards compatibility, though arguably I can probably just turn all into an alias.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113566
Approved by: https://github.com/mlazos, https://github.com/Skylion007
2023-11-13 23:35:18 +00:00
54493fe8c4 Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-13 23:18:14 +00:00
3eacdaf1b3 [HigherOrderOp] add pytree operands tests for cond (#112661)
This is a follow-up of #111611. After this PR, we allow pytree with tensor-only leaves as operands of branches.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112661
Approved by: https://github.com/zou3519
2023-11-13 23:09:46 +00:00
68278cf7a8 [dynamo] Initialize tensor_weakref_to_sizes_strides with a weak dict (#113412)
Spotted while working on getting output_graph.py to typecheck.

The type hint indicates that it was intended to be initialized with a
WeakIdKeyDictionary, but the actual runtime value was a regular dict.
Not sure if there's some kind of test we should add for this fix.

Looks like the code was originally added in
https://github.com/pytorch/pytorch/pull/100128.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113412
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
ghstack dependencies: #113413, #113518, #113519
2023-11-13 22:53:47 +00:00
6ed20af10e [dtensor] refactor op dispatch and fix is_same_size/equal (#112927)
torch.equal/is_same_size currently skips sharding prop and directly do
local tensor compute, this is wrong. for these two ops:

- torch.equal: should not skip sharding prop, need to have two DTensor
have the SAME sharding before compare local shard values
- torch.is_same_size: need to completely skip both sharding prop and
local compute

This PR refactors the existing op_dispatch to make it a class instance
so that we can do custom op handling, then fixes both torch.equal and
torch.is_same_size

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112927
Approved by: https://github.com/fduwjj, https://github.com/XilunWu
2023-11-13 22:46:31 +00:00
9062e429db Fixed docstring errors in torch/nn/functional.py (Docathon H2) (#112856)
Fixes #112597
### Output:
**BEFORE:**
```functional.py:1 at module level:
        D400: First line should end with a period (not 'e')
functional.py:438 in public function `fractional_max_pool2d_with_indices`:
        D400: First line should end with a period (not ')')
functional.py:537 in public function `fractional_max_pool3d_with_indices`:
        D400: First line should end with a period (not ')')
functional.py:646 in public function `max_pool1d_with_indices`:
        D400: First line should end with a period (not ')')
functional.py:732 in public function `max_pool2d_with_indices`:
        D400: First line should end with a period (not ')')
functional.py:818 in public function `max_pool3d_with_indices`:
        D400: First line should end with a period (not ')')
functional.py:932 in public function `max_unpool1d`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
functional.py:968 in public function `max_unpool2d`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
functional.py:1000 in public function `max_unpool3d`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
functional.py:1031 in public function `lp_pool2d`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:1031 in public function `lp_pool2d`:
        D400: First line should end with a period (not 'f')
functional.py:1031 in public function `lp_pool2d`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1056 in public function `lp_pool1d`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:1056 in public function `lp_pool1d`:
        D400: First line should end with a period (not 'f')
functional.py:1056 in public function `lp_pool1d`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1077 in public function `adaptive_max_pool1d_with_indices`:
        D400: First line should end with a period (not ')')
functional.py:1119 in public function `adaptive_max_pool2d_with_indices`:
        D400: First line should end with a period (not ')')
functional.py:1163 in public function `adaptive_max_pool3d_with_indices`:
        D400: First line should end with a period (not ')')
functional.py:1220 in public function `adaptive_avg_pool2d`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:1220 in public function `adaptive_avg_pool2d`:
        D400: First line should end with a period (not 'f')
functional.py:1220 in public function `adaptive_avg_pool2d`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1237 in public function `adaptive_avg_pool3d`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:1237 in public function `adaptive_avg_pool3d`:
        D400: First line should end with a period (not 'f')
functional.py:1237 in public function `adaptive_avg_pool3d`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1255 in public function `dropout`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:1255 in public function `dropout`:
        D400: First line should end with a period (not 't')
functional.py:1275 in public function `alpha_dropout`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1287 in public function `dropout1d`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:1287 in public function `dropout1d`:
        D400: First line should end with a period (not ',')
functional.py:1325 in public function `dropout2d`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:1325 in public function `dropout2d`:
        D400: First line should end with a period (not ',')
functional.py:1369 in public function `dropout3d`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:1369 in public function `dropout3d`:
        D400: First line should end with a period (not ',')
functional.py:1408 in public function `feature_alpha_dropout`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:1408 in public function `feature_alpha_dropout`:
        D400: First line should end with a period (not ',')
functional.py:1466 in public function `relu`:
        D400: First line should end with a period (not 'r')
functional.py:1466 in public function `relu`:
        D402: First line should not be the function's "signature"
functional.py:1491 in public function `glu`:
        D400: First line should end with a period (not 'r')
functional.py:1491 in public function `glu`:
        D402: First line should not be the function's "signature"
functional.py:1516 in public function `hardtanh`:
        D400: First line should end with a period (not 'r')
functional.py:1516 in public function `hardtanh`:
        D402: First line should not be the function's "signature"
functional.py:1542 in public function `relu6`:
        D400: First line should end with a period (not 'r')
functional.py:1542 in public function `relu6`:
        D402: First line should not be the function's "signature"
functional.py:1558 in public function `elu`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1582 in public function `selu`:
        D400: First line should end with a period (not 'r')
functional.py:1582 in public function `selu`:
        D402: First line should not be the function's "signature"
functional.py:1611 in public function `celu`:
        D400: First line should end with a period (not 'r')
functional.py:1611 in public function `celu`:
        D402: First line should not be the function's "signature"
functional.py:1638 in public function `leaky_relu`:
        D400: First line should end with a period (not 'r')
functional.py:1638 in public function `leaky_relu`:
        D402: First line should not be the function's "signature"
functional.py:1688 in public function `rrelu`:
        D400: First line should end with a period (not 'r')
functional.py:1688 in public function `rrelu`:
        D402: First line should not be the function's "signature"
functional.py:1755 in public function `tanhshrink`:
        D400: First line should end with a period (not 'r')
functional.py:1755 in public function `tanhshrink`:
        D402: First line should not be the function's "signature"
functional.py:1767 in public function `softsign`:
        D400: First line should end with a period (not 'r')
functional.py:1767 in public function `softsign`:
        D402: First line should not be the function's "signature"
functional.py:1806 in public function `softmin`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1832 in public function `softmax`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1868 in public function `gumbel_softmax`:
        D401: First line should be in imperative mood (perhaps 'Sample', not 'Samples')
functional.py:1930 in public function `log_softmax`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1969 in public function `tanh`:
        D400: First line should end with a period (not 'r')
functional.py:1969 in public function `tanh`:
        D402: First line should not be the function's "signature"
functional.py:1980 in public function `sigmoid`:
        D400: First line should end with a period (not 'r')
functional.py:1980 in public function `sigmoid`:
        D402: First line should not be the function's "signature"
functional.py:1990 in public function `hardsigmoid`:
        D400: First line should end with a period (not 'n')
functional.py:1990 in public function `hardsigmoid`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2057 in public function `silu`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:2057 in public function `silu`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2081 in public function `mish`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:2081 in public function `mish`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2100 in public function `hardswish`:
        D400: First line should end with a period (not ':')
functional.py:2100 in public function `hardswish`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2136 in public function `embedding`:
        D202: No blank lines allowed after function docstring (found 1)
functional.py:2136 in public function `embedding`:
        D401: First line should be in imperative mood; try rephrasing (found 'A')
functional.py:2254 in public function `embedding_bag`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:2254 in public function `embedding_bag`:
        D400: First line should end with a period (not 'e')
functional.py:2254 in public function `embedding_bag`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
functional.py:2462 in public function `batch_norm`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2507 in public function `instance_norm`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:2507 in public function `instance_norm`:
        D400: First line should end with a period (not 'a')
functional.py:2507 in public function `instance_norm`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2540 in public function `layer_norm`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2554 in public function `group_norm`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2567 in public function `local_response_norm`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:2567 in public function `local_response_norm`:
        D400: First line should end with a period (not 'f')
functional.py:2567 in public function `local_response_norm`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2611 in public function `ctc_loss`:
        D401: First line should be in imperative mood; try rephrasing (found 'The')
functional.py:2679 in public function `nll_loss`:
        D401: First line should be in imperative mood; try rephrasing (found 'The')
functional.py:2895 in public function `kl_div`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:2895 in public function `kl_div`:
        D400: First line should end with a period (not 's')
functional.py:2895 in public function `kl_div`:
        D401: First line should be in imperative mood; try rephrasing (found 'The')
functional.py:2978 in public function `cross_entropy`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
functional.py:3069 in public function `binary_cross_entropy`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:3069 in public function `binary_cross_entropy`:
        D400: First line should end with a period (not 't')
functional.py:3069 in public function `binary_cross_entropy`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
functional.py:3139 in public function `binary_cross_entropy_with_logits`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:3139 in public function `binary_cross_entropy_with_logits`:
        D400: First line should end with a period (not 't')
functional.py:3139 in public function `binary_cross_entropy_with_logits`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
functional.py:3211 in public function `smooth_l1_loss`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:3211 in public function `smooth_l1_loss`:
        D400: First line should end with a period (not 'e')
functional.py:3211 in public function `smooth_l1_loss`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
functional.py:3251 in public function `huber_loss`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:3251 in public function `huber_loss`:
        D400: First line should end with a period (not 'e')
functional.py:3251 in public function `huber_loss`:
        D401: First line should be in imperative mood; try rephrasing (found 'Function')
functional.py:3282 in public function `l1_loss`:
        D400: First line should end with a period (not 'r')
functional.py:3282 in public function `l1_loss`:
        D402: First line should not be the function's "signature"
functional.py:3313 in public function `mse_loss`:
        D400: First line should end with a period (not 'r')
functional.py:3313 in public function `mse_loss`:
        D402: First line should not be the function's "signature"
functional.py:3346 in public function `margin_ranking_loss`:
        D400: First line should end with a period (not 'r')
functional.py:3346 in public function `margin_ranking_loss`:
        D402: First line should not be the function's "signature"
functional.py:3382 in public function `hinge_embedding_loss`:
        D400: First line should end with a period (not 'r')
functional.py:3382 in public function `hinge_embedding_loss`:
        D402: First line should not be the function's "signature"
functional.py:3411 in public function `multilabel_margin_loss`:
        D400: First line should end with a period (not 'r')
functional.py:3411 in public function `multilabel_margin_loss`:
        D402: First line should not be the function's "signature"
functional.py:3439 in public function `soft_margin_loss`:
        D400: First line should end with a period (not 'r')
functional.py:3439 in public function `soft_margin_loss`:
        D402: First line should not be the function's "signature"
functional.py:3462 in public function `multilabel_soft_margin_loss`:
        D400: First line should end with a period (not 'r')
functional.py:3462 in public function `multilabel_soft_margin_loss`:
        D402: First line should not be the function's "signature"
functional.py:3510 in public function `cosine_embedding_loss`:
        D400: First line should end with a period (not 'r')
functional.py:3510 in public function `cosine_embedding_loss`:
        D402: First line should not be the function's "signature"
functional.py:3543 in public function `multi_margin_loss`:
        D400: First line should end with a period (not 'r')
functional.py:3543 in public function `multi_margin_loss`:
        D402: First line should not be the function's "signature"
functional.py:3708 in public function `upsample` (skipping F811,B950):
        D103: Missing docstring in public function
functional.py:3713 in public function `upsample` (skipping F811,B950):
        D103: Missing docstring in public function
functional.py:3718 in public function `upsample` (skipping F811):
        D205: 1 blank line required between summary line and description (found 0)
functional.py:3718 in public function `upsample` (skipping F811):
        D400: First line should end with a period (not 'n')
functional.py:3783 in private function `_is_integer`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:3794 in public function `interpolate` (skipping F811,B950):
        D103: Missing docstring in public function
functional.py:3799 in public function `interpolate` (skipping F811,B950):
        D103: Missing docstring in public function
functional.py:3804 in public function `interpolate` (skipping F811,B950):
        D103: Missing docstring in public function
functional.py:3809 in public function `interpolate` (skipping F811):
        D103: Missing docstring in public function
functional.py:3821 in public function `interpolate` (skipping F811,B950):
        D205: 1 blank line required between summary line and description (found 0)
functional.py:3821 in public function `interpolate` (skipping F811,B950):
        D400: First line should end with a period (not 'n')
functional.py:4062 in public function `upsample_nearest` (skipping F811):
        D103: Missing docstring in public function
functional.py:4067 in public function `upsample_nearest` (skipping F811):
        D103: Missing docstring in public function
functional.py:4100 in public function `upsample_bilinear` (skipping F811):
        D103: Missing docstring in public function
functional.py:4107 in public function `upsample_bilinear` (skipping F811):
        D103: Missing docstring in public function
functional.py:4114 in public function `upsample_bilinear` (skipping F811):
        D103: Missing docstring in public function
functional.py:4121 in public function `upsample_bilinear` (skipping F811):
        D103: Missing docstring in public function
functional.py:4174 in public function `grid_sample`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:4174 in public function `grid_sample`:
        D400: First line should end with a period (not 'e')
functional.py:4315 in public function `affine_grid`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:4315 in public function `affine_grid`:
        D400: First line should end with a period (not 'f')
functional.py:4315 in public function `affine_grid`:
        D401: First line should be in imperative mood (perhaps 'Generate', not 'Generates')
functional.py:4608 in public function `triplet_margin_loss`:
        D200: One-line docstring should fit on one line with quotes (found 3)
functional.py:4608 in public function `triplet_margin_loss`:
        D400: First line should end with a period (not 's')
functional.py:4643 in public function `triplet_margin_with_distance_loss`:
        D200: One-line docstring should fit on one line with quotes (found 3)
functional.py:4705 in public function `normalize`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
functional.py:4733 in public function `assert_int_or_pair`:
        D103: Missing docstring in public function
functional.py:4743 in public function `unfold`:
        D401: First line should be in imperative mood (perhaps 'Extract', not 'Extracts')
functional.py:4773 in public function `fold`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:4773 in public function `fold`:
        D400: First line should end with a period (not 'g')
functional.py:4773 in public function `fold`:
        D401: First line should be in imperative mood (perhaps 'Combine', not 'Combines')
functional.py:4800 in private function `_in_projection_packed`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:4800 in private function `_in_projection_packed`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
functional.py:4867 in private function `_in_projection`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:4867 in private function `_in_projection`:
        D400: First line should end with a period (not 'y')
functional.py:4867 in private function `_in_projection`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
functional.py:5128 in public function `multi_head_attention_forward`:
        D205: 1 blank line required between summary line and description (found 0)
functional.py:5128 in public function `multi_head_attention_forward`:
        D400: First line should end with a period (not ':')
160
```

**AFTER:**

```
functional.py:3709 in public function `upsample` (skipping F811,B950):
        D103: Missing docstring in public function
functional.py:3714 in public function `upsample` (skipping F811,B950):
        D103: Missing docstring in public function
functional.py:3798 in public function `interpolate` (skipping F811,B950):
        D103: Missing docstring in public function
functional.py:3803 in public function `interpolate` (skipping F811,B950):
        D103: Missing docstring in public function
functional.py:3808 in public function `interpolate` (skipping F811,B950):
        D103: Missing docstring in public function
functional.py:3813 in public function `interpolate` (skipping F811):
        D103: Missing docstring in public function
functional.py:4068 in public function `upsample_nearest` (skipping F811):
        D103: Missing docstring in public function
functional.py:4073 in public function `upsample_nearest` (skipping F811):
        D103: Missing docstring in public function
functional.py:4106 in public function `upsample_bilinear` (skipping F811):
        D103: Missing docstring in public function
functional.py:4113 in public function `upsample_bilinear` (skipping F811):
        D103: Missing docstring in public function
functional.py:4120 in public function `upsample_bilinear` (skipping F811):
        D103: Missing docstring in public function
functional.py:4127 in public function `upsample_bilinear` (skipping F811):
        D103: Missing docstring in public function
functional.py:4742 in public function `assert_int_or_pair`:
        D103: Missing docstring in public function
13
```

The file contained several docstring errors. I have fixed all of them(hopefully) and have tried to improve the over all readability of the code. For most part, I have included relevant description of functions (referred from official PyTorch Docs). In some cases where functions are purely mathematical or it is difficult to give one line description, I have just included references.

For testing, I relied on local system and created a separate file. For final edits, I directly changed the contents of forked repo as visible already.

Kindly review @svekars @subramen @kit1980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112856
Approved by: https://github.com/kit1980
2023-11-13 22:16:49 +00:00
a2552d5521 Fixed docstring errors inside torch/cuda/ and torch/optim/ (Docathon H2) (#112964)
Fixes #112592
1) **File: torch/cuda/random.py**
```
Before:
/content/pytorch/torch/cuda/random.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/cuda/random.py:21 in public function `get_rng_state`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/random.py:43 in public function `get_rng_state_all`:
        D202: No blank lines allowed after function docstring (found 1)
/content/pytorch/torch/cuda/random.py:43 in public function `get_rng_state_all`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/random.py:54 in public function `set_rng_state`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:79 in public function `set_rng_state_all`:
        D208: Docstring is over-indented
/content/pytorch/torch/cuda/random.py:79 in public function `set_rng_state_all`:
        D209: Multi-line docstring closing quotes should be on a separate line
/content/pytorch/torch/cuda/random.py:79 in public function `set_rng_state_all`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:79 in public function `set_rng_state_all`:
        D414: Section has no content ('Args')
/content/pytorch/torch/cuda/random.py:88 in public function `manual_seed`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/random.py:88 in public function `manual_seed`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:110 in public function `manual_seed_all`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/random.py:110 in public function `manual_seed_all`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:128 in public function `seed`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/random.py:128 in public function `seed`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:146 in public function `seed_all`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/random.py:146 in public function `seed_all`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:167 in public function `initial_seed`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
18
```

```
After:
/content/pytorch/torch/cuda/random.py:1 at module level:
        D100: Missing docstring in public module
1

```
2) **File: torch/cuda/amp/autocast_mode.py**
```
Before: /content/pytorch/torch/cuda/amp/autocast_mode.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/cuda/amp/autocast_mode.py:18 in public class `autocast`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/autocast_mode.py:23 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/cuda/amp/autocast_mode.py:38 in public method `__enter__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/autocast_mode.py:44 in public method `__exit__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/autocast_mode.py:49 in public method `__call__`:
        D102: Missing docstring in public method
/content/pytorch/torch/cuda/amp/autocast_mode.py:90 in public function `custom_fwd`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/autocast_mode.py:90 in public function `custom_fwd`:
        D400: First line should end with a period (not 'f')
/content/pytorch/torch/cuda/amp/autocast_mode.py:90 in public function `custom_fwd`:
        D401: First line should be in imperative mood; try rephrasing (found 'Helper')
/content/pytorch/torch/cuda/amp/autocast_mode.py:130 in public function `custom_bwd`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/autocast_mode.py:130 in public function `custom_bwd`:
        D400: First line should end with a period (not 'f')
/content/pytorch/torch/cuda/amp/autocast_mode.py:130 in public function `custom_bwd`:
        D401: First line should be in imperative mood; try rephrasing (found 'Helper')
12
```
```
After:
/content/pytorch/torch/cuda/amp/autocast_mode.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/cuda/amp/autocast_mode.py:23 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/cuda/amp/autocast_mode.py:38 in public method `__enter__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/autocast_mode.py:44 in public method `__exit__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/autocast_mode.py:49 in public method `__call__`:
        D102: Missing docstring in public method
5
```

3)  **File: torch/cuda/amp/grad_scaler.py**
```
Before: /content/pytorch/torch/cuda/amp/grad_scaler.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/cuda/amp/grad_scaler.py:17 in private class `_MultiDeviceReplicator`:
        D200: One-line docstring should fit on one line with quotes (found 3)
/content/pytorch/torch/cuda/amp/grad_scaler.py:39 in public class `OptState`:
        D101: Missing docstring in public class
/content/pytorch/torch/cuda/amp/grad_scaler.py:50 in public class `GradScaler`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/grad_scaler.py:50 in public class `GradScaler`:
        D400: First line should end with a period (not 'g')
/content/pytorch/torch/cuda/amp/grad_scaler.py:115 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/cuda/amp/grad_scaler.py:354 in public method `step`:
        D400: First line should end with a period (not ':')
/content/pytorch/torch/cuda/amp/grad_scaler.py:456 in public method `update`:
        D401: First line should be in imperative mood (perhaps 'Update', not 'Updates')
/content/pytorch/torch/cuda/amp/grad_scaler.py:529 in public method `get_scale`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:544 in public method `get_growth_factor`:
        D200: One-line docstring should fit on one line with quotes (found 3)
/content/pytorch/torch/cuda/amp/grad_scaler.py:544 in public method `get_growth_factor`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:550 in public method `set_growth_factor`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/grad_scaler.py:550 in public method `set_growth_factor`:
        D400: First line should end with a period (not ':')
/content/pytorch/torch/cuda/amp/grad_scaler.py:557 in public method `get_backoff_factor`:
        D200: One-line docstring should fit on one line with quotes (found 3)
/content/pytorch/torch/cuda/amp/grad_scaler.py:557 in public method `get_backoff_factor`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:563 in public method `set_backoff_factor`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/grad_scaler.py:563 in public method `set_backoff_factor`:
        D400: First line should end with a period (not ':')
/content/pytorch/torch/cuda/amp/grad_scaler.py:570 in public method `get_growth_interval`:
        D200: One-line docstring should fit on one line with quotes (found 3)
/content/pytorch/torch/cuda/amp/grad_scaler.py:570 in public method `get_growth_interval`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:576 in public method `set_growth_interval`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/grad_scaler.py:576 in public method `set_growth_interval`:
        D400: First line should end with a period (not ':')
/content/pytorch/torch/cuda/amp/grad_scaler.py:592 in public method `is_enabled`:
        D200: One-line docstring should fit on one line with quotes (found 3)
/content/pytorch/torch/cuda/amp/grad_scaler.py:592 in public method `is_enabled`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:598 in public method `state_dict`:
        D400: First line should end with a period (not ':')
/content/pytorch/torch/cuda/amp/grad_scaler.py:598 in public method `state_dict`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:624 in public method `load_state_dict`:
        D401: First line should be in imperative mood (perhaps 'Load', not 'Loads')
/content/pytorch/torch/cuda/amp/grad_scaler.py:649 in public method `__getstate__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/grad_scaler.py:665 in public method `__setstate__`:
        D105: Missing docstring in magic method
28
```
```
After:
/content/pytorch/torch/cuda/amp/grad_scaler.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/cuda/amp/grad_scaler.py:40 in public class `OptState`:
        D101: Missing docstring in public class
/content/pytorch/torch/cuda/amp/grad_scaler.py:117 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/cuda/amp/grad_scaler.py:647 in public method `__getstate__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/grad_scaler.py:663 in public method `__setstate__`:
        D105: Missing docstring in magic method
5
```
4) **File: torch/optim/_functional.py**
```
Before:
/content/pytorch/torch/optim/_functional.py:1 at module level:
        D400: First line should end with a period (not 'e')
1
```
```
After:
0

```
5) **File: torch/optim/__init__.py**
```
Before:
/content/pytorch/torch/optim/__init__.py:1 at module level:
        D205: 1 blank line required between summary line and description (found 0)
1
```
```
After:
0

```
6) **File: torch/optim/lbfgs.py**
```
Before:
/content/pytorch/torch/optim/lbfgs.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/lbfgs.py:185 in public class `LBFGS`:
        D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/optim/lbfgs.py:185 in public class `LBFGS`:
        D400: First line should end with a period (not 'c')
/content/pytorch/torch/optim/lbfgs.py:215 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/lbfgs.py:285 in public method `step`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
5
```
```
After:
/content/pytorch/torch/optim/lbfgs.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/lbfgs.py:217 in public method `__init__`:
        D107: Missing docstring in __init__
2
```
7)**File: torch/optim/sparse_adam.py**
```
Before: /content/pytorch/torch/optim/sparse_adam.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/sparse_adam.py:7 in public class `SparseAdam`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/sparse_adam.py:8 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/sparse_adam.py:40 in public method `step`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
4
```
```
After:
/content/pytorch/torch/optim/sparse_adam.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/sparse_adam.py:7 in public class `SparseAdam`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/sparse_adam.py:8 in public method `__init__`:
        D107: Missing docstring in __init__
3
```
8) **File:torch/optim/adadelta.py**
```
Before:
/content/pytorch/torch/optim/adadelta.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/adadelta.py:11 in public class `Adadelta`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/adadelta.py:12 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/adadelta.py:44 in public method `__setstate__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/optim/adadelta.py:82 in public method `step`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/adadelta.py:193 in public function `adadelta`:
        D202: No blank lines allowed after function docstring (found 1)
6
```
```
After:
/content/pytorch/torch/optim/adadelta.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/adadelta.py:11 in public class `Adadelta`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/adadelta.py:12 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/adadelta.py:44 in public method `__setstate__`:
        D105: Missing docstring in magic method
4
```
9) **File: torch/optim/adagrad.py**
```
Before:
/content/pytorch/torch/optim/adagrad.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/adagrad.py:11 in public class `Adagrad`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/adagrad.py:12 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/adagrad.py:63 in public method `__setstate__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/optim/adagrad.py:78 in public method `share_memory`:
        D102: Missing docstring in public method
/content/pytorch/torch/optim/adagrad.py:100 in public method `step`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/adagrad.py:201 in public function `adagrad`:
        D202: No blank lines allowed after function docstring (found 1)
7
```
```
After:
/content/pytorch/torch/optim/adagrad.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/adagrad.py:11 in public class `Adagrad`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/adagrad.py:12 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/adagrad.py:63 in public method `__setstate__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/optim/adagrad.py:78 in public method `share_memory`:
        D102: Missing docstring in public method
5
```
10) **File: torch/optim/adam.py**
```
Before:
/content/pytorch/torch/optim/adam.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/adam.py:14 in public class `Adam`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/adam.py:15 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/adam.py:65 in public method `__setstate__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/optim/adam.py:135 in public method `step`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/adam.py:281 in public function `adam`:
        D202: No blank lines allowed after function docstring (found 1)
/content/pytorch/torch/optim/adam.py:281 in public function `adam`:
        D205: 1 blank line required between summary line and description (found 0)
7
```
```
After:
/content/pytorch/torch/optim/adam.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/adam.py:14 in public class `Adam`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/adam.py:15 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/adam.py:65 in public method `__setstate__`:
        D105: Missing docstring in magic method
4

```
11) **File: torch/optim/adamax.py**
```
Before:
/content/pytorch/torch/optim/adamax.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/adamax.py:12 in public class `Adamax`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/adamax.py:13 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/adamax.py:47 in public method `__setstate__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/optim/adamax.py:91 in public method `step`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/adamax.py:203 in public function `adamax`:
        D202: No blank lines allowed after function docstring (found 1)
6
```
```
After:
/content/pytorch/torch/optim/adamax.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/adamax.py:12 in public class `Adamax`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/adamax.py:13 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/adamax.py:47 in public method `__setstate__`:
        D105: Missing docstring in magic method
4
```
12) **File: torch/optim/adamw.py**
```
Before:
/content/pytorch/torch/optim/adamw.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/adamw.py:12 in public class `AdamW`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/adamw.py:13 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/adamw.py:73 in public method `__setstate__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/optim/adamw.py:153 in public method `step`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/adamw.py:304 in public function `adamw`:
        D202: No blank lines allowed after function docstring (found 1)
6

```
```
After:
/content/pytorch/torch/optim/adamw.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/adamw.py:12 in public class `AdamW`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/adamw.py:13 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/adamw.py:73 in public method `__setstate__`:
        D105: Missing docstring in magic method
4

```
13) **File: torch/optim/asgd.py**
```
Before:
/content/pytorch/torch/optim/asgd.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/asgd.py:17 in public class `ASGD`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/asgd.py:18 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/asgd.py:52 in public method `__setstate__`:
        D105: Missing docstring in magic method
/content/pytorch/torch/optim/asgd.py:107 in public method `step`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/asgd.py:195 in public function `asgd`:
        D202: No blank lines allowed after function docstring (found 1)
6
```
```
After:
/content/pytorch/torch/optim/asgd.py:1 at module level:
        D100: Missing docstring in public module
/content/pytorch/torch/optim/asgd.py:17 in public class `ASGD`:
        D101: Missing docstring in public class
/content/pytorch/torch/optim/asgd.py:18 in public method `__init__`:
        D107: Missing docstring in __init__
/content/pytorch/torch/optim/asgd.py:52 in public method `__setstate__`:
        D105: Missing docstring in magic method
4
```
Resolved docstring errors as listed. I initially changed in the main branch of forked repo which caused changes to appear in my PR to other issue. I have fixed that and hope this PR won't have any conflicts.
Kindly review @svekars @jbschlosser.
In case of any other issues please let me know. Thanks!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112964
Approved by: https://github.com/kit1980
2023-11-13 22:16:44 +00:00
27c3774320 Forward fix efficient attention rocm failure (#113588)
See #110495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113588
Approved by: https://github.com/malfet
2023-11-13 22:15:18 +00:00
b3a7d9208b disable test int_mm for sm90 or later (#113327)
disable test int_mm for sm90 or later

```
python test/test_linalg.py -k test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_False_cuda

_ TestLinalgCUDA.test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_False_cuda _
Traceback (most recent call last):
  File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2410, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2410, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_device_type.py", line 428, in instantiated_test
    raise rte
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_device_type.py", line 415, in instantiated_test
    result = test(self, **param_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_device_type.py", line 1084, in only_fn
    return fn(slf, *args, **kwargs)
  File "/opt/pytorch/pytorch/test/test_linalg.py", line 5719, in test__int_mm
    _test(17, k, n, use_transpose_a, use_transpose_b)
  File "/opt/pytorch/pytorch/test/test_linalg.py", line 5680, in _test
    c_int32 = torch._int_mm(a_int8, b_int8)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 0 transpose_mat2 0 m 32 n 17 k 32 mat1_ld 32 mat2_ld 32 result_ld 32 abType 3 cType 10 computeType 72 scaleType 10
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113327
Approved by: https://github.com/malfet
2023-11-13 22:13:44 +00:00
01478f1afa Fix pydocstyle errors listed in issue 112589 (#113227)
Fixes #112589

Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and at methods within each module (see details below)

pydocstyle torch/cuda/_utils.py --count
before: 3
after: 0

pydocstyle torch/cuda/jiterator.py --count
before: 3
after: 1

**remaining errors:**
```
torch/cuda/jiterator.py:1 at module level:
        D100: Missing docstring in public module
```

pydocstyle torch/cuda/graphs.py --count
before: 25
after: 7

**remaining errors:**
```
torch/cuda/graphs.py:1 at module level:
        D100: Missing docstring in public module
torch/cuda/graphs.py:54 in public method `__new__`:
        D102: Missing docstring in public method
torch/cuda/graphs.py:108 in public method `debug_dump`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/graphs.py:108 in public method `debug_dump`:
        D400: First line should end with a period (not ':')
torch/cuda/graphs.py:150 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/graphs.py:172 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/cuda/graphs.py:186 in public method `__exit__`:
        D105: Missing docstring in magic method
```

pydocstyle torch/cuda/_sanitizer.py --count
before: 35
after: 31

**remaining errors:**
```
torch/cuda/_sanitizer.py:43 in public class `AccessType`:
        D101: Missing docstring in public class
torch/cuda/_sanitizer.py:47 in public method `__str__`:
        D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:84 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:96 in public method `__str__`:
        D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:139 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:142 in public method `__str__`:
        D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:218 in public class `StreamSynchronizations`:
        D101: Missing docstring in public class
torch/cuda/_sanitizer.py:219 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:256 in public method `create_stream`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:268 in public method `create_event`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:272 in public method `delete_event`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:276 in public method `update_seq_num`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:280 in public method `record_state`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:291 in public method `stream_wait_for_event`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:298 in public method `all_streams_wait_for_event`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:307 in public method `all_streams_wait_for_stream`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:316 in public method `sync_all_streams`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:323 in public method `is_ordered_after`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:339 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:460 in public function `zip_by_key`:
        D103: Missing docstring in public function
torch/cuda/_sanitizer.py:466 in public function `zip_arguments`:
        D103: Missing docstring in public function
torch/cuda/_sanitizer.py:478 in public class `ArgumentHandler`:
        D101: Missing docstring in public class
torch/cuda/_sanitizer.py:479 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:505 in public method `parse_inputs`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:520 in public method `parse_outputs`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:527 in public class `CUDASanitizerDispatchMode`:
        D101: Missing docstring in public class
torch/cuda/_sanitizer.py:528 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:562 in public method `__torch_dispatch__`:
        D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:597 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:601 in public method `enable`:
        D102: Missing docstring in public method
torch/cuda/_sanitizer.py:605 in public method `__del__`:
        D105: Missing docstring in magic method
```

pydocstyle torch/storage.py --count
before: 90
after: 37

**remaining errors:**
```
torch/storage.py:1 at module level:
        D100: Missing docstring in public module
torch/storage.py:310 in public class `UntypedStorage`:
        D101: Missing docstring in public class
torch/storage.py:311 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/storage.py:317 in public method `is_cuda`:
        D102: Missing docstring in public method
torch/storage.py:321 in public method `is_hpu`:
        D102: Missing docstring in public method
torch/storage.py:325 in public method `share_memory_`:
        D102: Missing docstring in public method
torch/storage.py:444 in public class `TypedStorage`:
        D101: Missing docstring in public class
torch/storage.py:453 in public method `fill_`:
        D102: Missing docstring in public method
torch/storage.py:458 in public method `__new__`:
        D102: Missing docstring in public method
torch/storage.py:530 in public method `__init__`:
        D107: Missing docstring in __init__
torch/storage.py:599 in public method `is_cuda`:
        D102: Missing docstring in public method
torch/storage.py:604 in public method `is_hpu`:
        D102: Missing docstring in public method
torch/storage.py:624 in public method `__len__`:
        D105: Missing docstring in magic method
torch/storage.py:653 in public method `__setitem__`:
        D105: Missing docstring in magic method
torch/storage.py:681 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/storage.py:715 in public method `copy_`:
        D102: Missing docstring in public method
torch/storage.py:723 in public method `nbytes`:
        D102: Missing docstring in public method
torch/storage.py:731 in public method `type`:
        D102: Missing docstring in public method
torch/storage.py:744 in public method `cuda`:
        D102: Missing docstring in public method
torch/storage.py:751 in public method `hpu`:
        D102: Missing docstring in public method
torch/storage.py:758 in public method `element_size`:
        D102: Missing docstring in public method
torch/storage.py:766 in public method `get_device`:
        D102: Missing docstring in public method
torch/storage.py:770 in public method `__str__`:
        D105: Missing docstring in magic method
torch/storage.py:781 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/storage.py:785 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/storage.py:789 in public method `__copy__`:
        D105: Missing docstring in magic method
torch/storage.py:793 in public method `__deepcopy__`:
        D105: Missing docstring in magic method
torch/storage.py:801 in public method `__sizeof__`:
        D105: Missing docstring in magic method
torch/storage.py:877 in public method `device`:
        D102: Missing docstring in public method
torch/storage.py:881 in public method `size`:
        D102: Missing docstring in public method
torch/storage.py:891 in public method `pickle_storage_type`:
        D102: Missing docstring in public method
torch/storage.py:902 in public method `__reduce__`:
        D105: Missing docstring in magic method
torch/storage.py:907 in public method `data_ptr`:
        D102: Missing docstring in public method
torch/storage.py:915 in public method `resize_`:
        D102: Missing docstring in public method
torch/storage.py:931 in public method `from_buffer`:
        D102: Missing docstring in public method
torch/storage.py:1032 in public method `from_file`:
        D402: First line should not be the function's "signature"
torch/storage.py:1075 in public method `is_shared`:
        D102: Missing docstring in public method

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113227
Approved by: https://github.com/kit1980
2023-11-13 22:05:45 +00:00
0e6b6a2483 Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)"
This reverts commit 3afb4e5cf7b0162c532449fb5c9e7c7058a4c803.

Reverted https://github.com/pytorch/pytorch/pull/111554 on behalf of https://github.com/clee2000 due to the xla failure is real sorry, log classifier is showing the wrong line ([comment](https://github.com/pytorch/pytorch/pull/111554#issuecomment-1809177978))
2023-11-13 21:46:57 +00:00
cfee3bcf97 Add inheritance to ONNX's InputAdaptStep and OutputAdaptSet impl (#113476)
This is a minor compliance change that specifies the InputAdaptStep and
OutputAdapStep as the base class for the actual implementations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113476
Approved by: https://github.com/justinchuby
2023-11-13 21:27:44 +00:00
b01e89587e [ROCM][CI] Introduce tests-to-include as rocm-test workflow input (#110511)
Fixes https://github.com/pytorch/pytorch/issues/110181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110511
Approved by: https://github.com/huydhn
2023-11-13 21:25:49 +00:00
2ea3d64f47 fix docstring issues in torch.utils.tensorboard (#113336)
Fixes #112637

Fixed all the issues listed.

### Error Counts

|File | Count Before | Count now|
|---- | ---- | ---- |
|`torch/utils/tensorboard/_proto_graph.py` | 9 | 0|
|`torch/utils/tensorboard/_pytorch_graph.py` | 27 | 14|
|`torch/utils/tensorboard/_utils.py` | 5 | 2|
|`torch/utils/tensorboard/summary.py` | 27 | 12|
|`torch/utils/tensorboard/writer.py` | 42 | 4|
|`torch/utils/tensorboard/_caffe2_graph.py` | 19 | 0|
|`torch/utils/hipify/constants.py` | 2 | 0|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113336
Approved by: https://github.com/ezyang
2023-11-13 20:50:01 +00:00
a144eb502a [aotinductor] add versions for the sdpa shim api (#113487)
In our first implemenation of the sdpa shim api, we didn't consider
the case where the optional scale argument could be None. It was
unnoticed because we always got a default argument for the cuda backend.
The issue was detected with the cpu backend.

This PR implements versioning for shim kernels. Currently, we only
have different versions for the sdpa api. We expect we would only
maintain a very small number of abi-compatible shim APIs that
had different versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113487
Approved by: https://github.com/int3, https://github.com/desertfire
2023-11-13 20:18:58 +00:00
6ea20f5dc5 [AOTI] Use expr_printer to print sympy expr (#113317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113317
Approved by: https://github.com/aakhundov, https://github.com/chenyang78
2023-11-13 20:14:04 +00:00
c0b57d4e3b fix docstring issues in torch.distributed (#113337)
Fixes #112643

Fixes all the issues listed

### Error Count

|File | Count Before | Count now|
|---- | ---- | ---- |
|`torch/distributed/optim/named_optimizer.py` | 13 | 1|
|`torch/distributed/nn/functional.py` | 7 | 1|
|`torch/distributed/nn/api/remote_module.py` | 25 | 3|
|`torch/distributed/algorithms/join.py` | 43 | 4|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113337
Approved by: https://github.com/ezyang
2023-11-13 19:37:29 +00:00
5e10dd2c78 fix docstring issues in torch.utils (#113335)
Fixes #112634

Fixes all the issues listed except in `torch/utils/_pytree.py` as the file no longer exists.

### Error counts

|File | Count Before | Count now|
|---- | ---- | ---- |
|`torch/utils/collect_env.py` | 39 | 25|
|`torch/utils/cpp_extension.py` | 51 | 13|
|`torch/utils/flop_counter.py` | 25 | 8|
|`torch/utils/_foreach_utils.py.py` | 2 | 0|
|`torch/utils/_python_dispatch.py.py` | 26 | 25|
|`torch/utils/backend_registration.py` | 15 | 4|
|`torch/utils/checkpoint.py` | 29 | 21|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113335
Approved by: https://github.com/ezyang
2023-11-13 19:37:25 +00:00
44367c59b2 Update skip reason for failing unit tests on ROCm 5.7 (#113286)
Follow up to https://github.com/pytorch/pytorch/pull/110465. Updated skip reason for failing unit tests on ROCm 5.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113286
Approved by: https://github.com/malfet
2023-11-13 19:29:04 +00:00
1aece432ba Implement narrow from a regular tensor to jagged tensor (#112770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112770
Approved by: https://github.com/cpuhrsch
2023-11-13 19:09:59 +00:00
3700894099 Fix FSDP summon_full_params(..., with_grads=True) when grad precision is not fp32 (#112746)
Fixes #112717

I moved the `torch.empty` call after the conditional so that we don't need to check whether `flat_param.grad` is None

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112746
Approved by: https://github.com/awgu
2023-11-13 19:04:24 +00:00
47a59ee4d1 [ONNX] Update exporter issue report instructions for quantized models (#113494)
Update the instructions to point users to the right place for creating issues.

https://github.com/onnx/onnx/issues/5674#issuecomment-1806505240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113494
Approved by: https://github.com/jerryzh168
2023-11-13 18:18:19 +00:00
c46fc46dba expose mem-eff to autograd (#110495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110495
Approved by: https://github.com/jbschlosser
2023-11-13 17:47:40 +00:00
3afb4e5cf7 AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:

(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)

(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.

(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.

(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).

I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation

**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
    x_view = x.view(-1)
    x.set_(torch.ones(2))
    x_view.mul_(2)
    return
```

If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:

(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input

(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```

def functionalized_f(x):
    x_view = x.view(-1)
    # set_() desugars into a no-op; later usages of x will use x_output
    x_output = torch.ones(2)
    # functionalize the mutation on x_view
    x_view_updated = x.mul(2)
    x_updated = x_view_updated.view(x.shape)
    # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
    # We need to return both updated tensors in our graph
    return x_updated, x_output
def runtime_wrapper(x):
    x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
    # First, perform the data mutation on x's old storage
    x.copy_(x_data_mutation_result)
    # Then, swap out the storage of x with the new storage
    x.set_(x_set_mutation_result)
```

There are two things that make this difficult to do though:

(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.

(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.

It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
2023-11-13 16:39:25 +00:00
1d9919c46d Fix pydocstyle for issue 112591 (#113233)
Fixes #112591

Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and methods within each module (see details below).

pydocstyle torch/cuda/_memory_viz.py --count
before: 7
after: 4

**remaining errors:**
```
torch/cuda/_memory_viz.py:77 in public function `format_flamegraph`:
        D103: Missing docstring in public function
torch/cuda/_memory_viz.py:121 in public function `segments`:
        D103: Missing docstring in public function
torch/cuda/_memory_viz.py:128 in public function `memory`:
        D103: Missing docstring in public function
torch/cuda/_memory_viz.py:135 in public function `compare`:
        D103: Missing docstring in public function
```

pydocstyle torch/cuda/streams.py --count
before: 29
after: 8

**remaining errors:**
```
torch/cuda/streams.py:1 at module level:
        D100: Missing docstring in public module
torch/cuda/streams.py:31 in public method `__new__`:
        D102: Missing docstring in public method
torch/cuda/streams.py:105 in public method `__eq__`:
        D105: Missing docstring in magic method
torch/cuda/streams.py:110 in public method `__hash__`:
        D105: Missing docstring in magic method
torch/cuda/streams.py:113 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/cuda/streams.py:135 in public method `__new__`:
        D102: Missing docstring in public method
torch/cuda/streams.py:163 in public method `__new__`:
        D102: Missing docstring in public method
torch/cuda/streams.py:237 in public method `__repr__`:
        D105: Missing docstring in magic method
```

pydocstyle torch/cuda/__init__.py --count
before: 100
after: 46

**remaining errors:**
```
torch/cuda/__init__.py:251 in public class `DeferredCudaCallError`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:327 in public function `cudart`:
        D103: Missing docstring in public function
torch/cuda/__init__.py:332 in public class `cudaStatus`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:337 in public class `CudaError`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:338 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/__init__.py:343 in public function `check_error`:
        D103: Missing docstring in public function
torch/cuda/__init__.py:369 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/__init__.py:373 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/cuda/__init__.py:376 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/cuda/__init__.py:391 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/__init__.py:473 in public class `StreamContext`:
        D204: 1 blank line required after class docstring (found 0)
torch/cuda/__init__.py:485 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/__init__.py:499 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/cuda/__init__.py:514 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/cuda/__init__.py:541 in public function `set_stream`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:838 in public function `current_blas_handle`:
        D400: First line should end with a period (not 'e')
torch/cuda/__init__.py:894 in public function `memory_usage`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:894 in public function `memory_usage`:
        D400: First line should end with a period (not ')')
torch/cuda/__init__.py:913 in public function `utilization`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:913 in public function `utilization`:
        D400: First line should end with a period (not 'r')
torch/cuda/__init__.py:949 in public function `power_draw`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:949 in public function `power_draw`:
        D400: First line should end with a period (not ')')
torch/cuda/__init__.py:1089 in public class `ByteStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1091 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1100 in public class `DoubleStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1102 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1111 in public class `FloatStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1113 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1122 in public class `HalfStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1124 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1133 in public class `LongStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1135 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1144 in public class `IntStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1146 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1155 in public class `ShortStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1157 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1166 in public class `CharStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1168 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1177 in public class `BoolStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1179 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1188 in public class `BFloat16Storage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1190 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1199 in public class `ComplexDoubleStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1201 in public method `dtype`:
        D102: Missing docstring in public method
torch/cuda/__init__.py:1210 in public class `ComplexFloatStorage`:
        D101: Missing docstring in public class
torch/cuda/__init__.py:1212 in public method `dtype`:
        D102: Missing docstring in public method
```

@mikaylagawarecki @albanD @svekars @jbschlosser

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113233
Approved by: https://github.com/malfet
2023-11-13 16:24:53 +00:00
0fd856ca22 Revert "[ONNX] Fix scalar type promotion between fp16 tensor and fp32 scalar (#113404)"
This reverts commit 39ca5a3226331428465a84d53d5b50dfb4406cfe.

Reverted https://github.com/pytorch/pytorch/pull/113404 on behalf of https://github.com/jeanschmidt due to sorry it is breaking CI jobs on main ([comment](https://github.com/pytorch/pytorch/pull/113404#issuecomment-1808314277))
2023-11-13 14:56:35 +00:00
d64bc8f0f8 use sourceless builder for builtin getattr (#113340)
In TorchVision we use the following (simplified) dispatch mechanism:

```python
import torch

def kernel1(tensor):
    return tensor + 2

def dispatcher1(input):
    kernel = get_kernel(dispatcher1, type(input))
    return kernel(input)

def kernel2(tensor):
    return tensor - 2

def dispatcher2(input):
    kernel = get_kernel(dispatcher2, type(input))
    return kernel(input)

# We actually use the function and type as keys, rather than their names.
# However, this currently not supported, but should be easy to add after
# https://github.com/pytorch/pytorch/pull/111196
REGISTRY = {
    "dispatcher1": {"Tensor": kernel1},
    "dispatcher2": {"Tensor": kernel2},
}

def get_kernel(dispatcher, input_type):
    dispatcher_registry = REGISTRY[dispatcher.__name__]
    for cls in input_type.__mro__:
        kernel = dispatcher_registry[cls.__name__]
        break
    return kernel
```

This can be compiled without graph breaks:

```python
cfn = torch.compile(dispatcher1, fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 5)

cfn = torch.compile(dispatcher2, fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 1)
```

However, if we start chaining these calls, we hit some issues:

```python
class Pipeline(torch.nn.Module):
    def forward(self, input):
        input = dispatcher1(input)
        input = dispatcher2(input)
        return input

cfn = torch.compile(Pipeline(), fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 3)
```

```
Can't access members of type(obj) for a generated custom object. Please use __class__ instead
```

The error message is not really helpful here. The following happens: when compiling `dispatcher1`, `get_kernel` gets inlined. That means when hitting `dispatcher2`, the `type` call no longer happens on an input with a source. Thus, in the first iteration we hit the top branch, while in the second we hit the bottom:

addb8e29cd/torch/_dynamo/variables/builtin.py (L1264-L1268)

And the error message I posted above originates from the type being treated as constant. This PR replaces this with a `SourcelessBuilder` instead.

With that fix in place, we hit another pointing to `input_type.__mro__`

```
AssertionError: Consider SourcelessBuilder for ephemeral objects, usually objects created locally.
```

Fix is similar: instead of using a `VariableBuilder` here, we use a `SourcelessBuilder` in case we have no `source`:

addb8e29cd/torch/_dynamo/variables/builtin.py (L1167-L1168)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113340
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-11-13 14:29:17 +00:00
115da02432 [xla hash update] update the pinned xla hash (#113549)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113549
Approved by: https://github.com/pytorchbot
2023-11-13 12:07:20 +00:00
44f1c6e41c [inductor] Handle variance corrections larger than number of data points (#113284)
Fixes #113167

When correction is larger than the number of data points, we should return a nan
by dividing by zero, as is done in the eager implementation.

5ea76f1760/aten/src/ATen/native/SharedReduceOps.h (L137)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113284
Approved by: https://github.com/lezcano
2023-11-13 11:16:17 +00:00
2bcff4d8e3 [state_dict][11/N] Implement cpu_offload and full_state_dict for get_state_dict (#112837)
As title

Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112837
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #112836, #112885
2023-11-13 10:03:06 +00:00
b910d9eaa6 Add tensor.is_privateuseone (#113421)
We found a scenario where ``tensor.device().is_privateuseone()`` is used to determine whether a tensor is privateuse1 but fails.
In the code of ``Autograd``, for example:
```
::std::tuple<at::Tensor,at::Tensor,at::Tensor> native_batch_norm(c10::DispatchKeySet ks, const at::Tensor & input, const c10::optional<at::Tensor> & weight, const c10::optional<at::Tensor> & bias, const c10::optional<at::Tensor> & running_mean, const c10::optional<at::Tensor> & running_var, bool training, double momentum, double eps) {
  auto& input_ = unpack(input, "input", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( input, weight, bias );

  [[maybe_unused]] auto _any_has_forward_grad_result0 = (isFwGradDefined(input) || isFwGradDefined(weight) || isFwGradDefined(bias));
  check_no_requires_grad(running_mean, "running_mean", "native_batch_norm");
  check_no_requires_grad(running_var, "running_var", "native_batch_norm");
  std::shared_ptr<NativeBatchNormBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<NativeBatchNormBackward0>(new NativeBatchNormBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( input, weight, bias ));
    grad_fn->eps = eps;
    grad_fn->input_ = SavedVariable(input, false);
    grad_fn->running_mean_ = SavedVariable(running_mean, false);
    grad_fn->running_var_ = SavedVariable(running_var, false);
    grad_fn->training = training;
    grad_fn->weight_ = SavedVariable(weight, false);
  }
  ...
}
```
When ``weight`` is ``None``, an empty tensor is automatically generated and will be transferred to the backward calculation:
c7e12c7427/torch/csrc/autograd/saved_variable.cpp (L121-L128)
At the beginning of the backward calculation in our scenario, we need to determine whether the input tensor is ``PrivateUse1`` . However, if we use ``tensor.device().is_privateuseone()``, we will get an error ``"tensor does not have a device"``:
c7e12c7427/c10/core/TensorImpl.h (L1223-L1235)
I think this part of the code can be optimized, what do you think?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113421
Approved by: https://github.com/albanD
2023-11-13 01:51:27 +00:00
7afb503e3c [inductor] Label align() with [[maybe_unused]] (#113502)
This squelches the "defined but not used" warning that occurs when
memory planning is disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113502
Approved by: https://github.com/jansel
2023-11-12 16:33:47 +00:00
8b61daaf73 Prune more unnecessary includes from CUDA transformers (#113493)
These kernels are incredibly slow to compile and for the most part are
completely independant of ATen/c10 yet they still end up including
half of `c10` transitively through `CUDAGeneratorImpl.h` and
`CUDAContext.h`.

This trims the fat so `mem_eff_attention` doesn't depend on ATen/c10 at all,
and `flash_attn` now only depends on `PhiloxUtils.cuh` (split out from
`CUDAGeneratorImpl.h`) and `CUDAContextLight.h` which doesn't transitively
include `TensorImpl.h`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113493
Approved by: https://github.com/lezcano
2023-11-12 16:00:05 +00:00
9c331be919 [pytorch] Remove dot if no suffix (#113273)
Summary: Add the suffix to the version string shouldn't happen if there is no suffix.

Test Plan:
```
/data/users/wbland/fbsource/buck-out/v2/gen/fbcode/param_bench/train/comms/pt/comms.par \
--backend nccl --device cuda --collective all_gather \
--master-ip <snip> --log INFO --b 256 --e 1K \
--num-coll-per-iteration 10 --mode comms--num_iters 5 --w 1 --z 1
...
I1108 07:58:33.852557 2344130 ProcessGroupNCCL.cpp:990] [Rank 0] ProcessGroupNCCL initialization options: NCCL version: 2.17.1, NCCL_ASYNC_ERROR_HANDLING: 3, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=139992854228992
...
```

Differential Revision: D51116095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113273
Approved by: https://github.com/kwen2501
2023-11-12 15:41:27 +00:00
7f1cbc8b5a remove intel_extension_for_pytorch from THIRDPARTY_SKIPLIST (#112840)
Motivation: Since `intel_extension_for_pytorch` is added to `THIRDPARTY_SKIPLIST`, when the IPEX optimized model uses `torch.compile`, the functions defined in IPEX will be skipped, these functions will not be able to generate the corresponding FX graph through dynamo, cannot be optimized by the compiler, and unnecessary graph breaks occurred. This PR is to remove `intel_extension_for_pytorch` from `THIRDPARTY_SKIPLIST` so that IPEX and torch.compile can work better together.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112840
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-11-12 09:40:51 +00:00
70064ac416 [Dynamo] Match closures by code ID (#109427)
Closes https://github.com/pytorch/pytorch/issues/107866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109427
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-11-12 08:20:14 +00:00
fe5d8850e2 Fixed docstring errors in _fuser.py, _state.py, __init__.py, _freeze.py, _async.py, _recursive.py, _tensorboard_vis.py, _trace.py, _await.py, _check.py, _serialization.py, _script.py, annotations.py, _monkeytype_config.py (#113371)
Fixes #113194

docstrings updated.

Here are the outputs with the number before and after:-

1) torch/sparse/__init__.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:1 at module level:
        D104: Missing docstring in public package
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:183 in public function `sum`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:183 in public function `sum`:
        D400: First line should end with a period (not 'n')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:183 in public function `sum`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:391 in public class `check_sparse_tensor_invariants`:
        D207: Docstring is under-indented
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:436 in public method `is_enabled`:
        D207: Docstring is under-indented
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:436 in public method `is_enabled`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:448 in public method `enable`:
        D207: Docstring is under-indented
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:468 in public method `disable`:
        D207: Docstring is under-indented
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:475 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:479 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:486 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:492 in public method `__call__`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:502 in public function `as_sparse_gradcheck`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:502 in public function `as_sparse_gradcheck`:
        D400: First line should end with a period (not 'l')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:502 in public function `as_sparse_gradcheck`:
        D401: First line should be in imperative mood (perhaps 'Decorate', not 'Decorator')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:518 in private nested function `gradcheck_with_sparse_support`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:518 in private nested function `gradcheck_with_sparse_support`:
        D400: First line should end with a period (not 's')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:518 in private nested function `gradcheck_with_sparse_support`:
        D401: First line should be in imperative mood; try rephrasing (found 'Same')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:528 in private nested function `convert_to_strided_representation`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:528 in private nested function `convert_to_strided_representation`:
        D400: First line should end with a period (not 'n')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:559 in private nested function `restore_from_strided_representation`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:559 in private nested function `restore_from_strided_representation`:
        D400: First line should end with a period (not 'd')
23
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:1 at module level:
        D104: Missing docstring in public package
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:476 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:480 in public method `__enter__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:487 in public method `__exit__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:493 in public method `__call__`:
        D102: Missing docstring in public method
5
```
2) torch/contrib/_tensorboard_vis.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/contrib/_tensorboard_vis.py:21 in public function `dump_tensorboard_summary`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/contrib/_tensorboard_vis.py:54 in public function `visualize_graph_executor`:
        D401: First line should be in imperative mood (perhaps 'Append', not 'Appends')
2
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/contrib/_tensorboard_vis.py:21 in public function `dump_tensorboard_summary`:
        D103: Missing docstring in public function
1
```
3) torch/jit/_state.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:1 at module level:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:20 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:25 in public method `parse_env`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:41 in public method `__bool__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:48 in public function `disable`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:52 in public function `enable`:
        D103: Missing docstring in public function
6
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:20 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:25 in public method `parse_env`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:41 in public method `__bool__`:
        D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:48 in public function `disable`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:52 in public function `enable`:
        D103: Missing docstring in public function
5
```
4) torch/jit/_monkeytype_config.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:27 in public function `is_torch_native_class`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:40 in public function `get_type`:
        D200: One-line docstring should fit on one line with quotes (found 3)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:40 in public function `get_type`:
        D401: First line should be in imperative mood; try rephrasing (found 'Helper')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:62 in public function `get_optional_of_element_type`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:62 in public function `get_optional_of_element_type`:
        D400: First line should end with a period (not 'l')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:62 in public function `get_optional_of_element_type`:
        D401: First line should be in imperative mood; try rephrasing (found 'Helper')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:75 in public function `get_qualified_name`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:84 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:87 in public method `log`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:90 in public class `JitTypeTraceStore`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:91 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:98 in public method `add`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:103 in public method `filter`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:111 in public method `analyze`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:122 in public method `consolidate_types`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:139 in public method `get_args_types`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:142 in public class `JitTypeTraceConfig`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:143 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:148 in public method `trace_logger`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:148 in public method `trace_logger`:
        D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:148 in public method `trace_logger`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:154 in public method `trace_store`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:157 in public method `code_filter`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:163 in public class `JitTypeTraceStoreLogger`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:164 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:167 in public class `JitTypeTraceStore`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:168 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:171 in public class `JitTypeTraceConfig`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:172 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:179 in public function `jit_code_filter`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:179 in public function `jit_code_filter`:
        D401: First line should be in imperative mood; try rephrasing (found 'Custom')
31
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:27 in public function `is_torch_native_class`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:74 in public function `get_qualified_name`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:83 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:86 in public method `log`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:89 in public class `JitTypeTraceStore`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:90 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:97 in public method `add`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:102 in public method `filter`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:110 in public method `analyze`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:121 in public method `consolidate_types`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:138 in public method `get_args_types`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:141 in public class `JitTypeTraceConfig`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:142 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:150 in public method `trace_store`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:153 in public method `code_filter`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:159 in public class `JitTypeTraceStoreLogger`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:160 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:163 in public class `JitTypeTraceStore`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:164 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:167 in public class `JitTypeTraceConfig`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:168 in public method `__init__`:
        D107: Missing docstring in __init__
21
```
5) torch/jit/_fuser.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:9 in public function `optimized_execution`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:9 in public function `optimized_execution`:
        D400: First line should end with a period (not 'n')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:9 in public function `optimized_execution`:
        D401: First line should be in imperative mood; try rephrasing (found 'A')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:23 in public function `fuser`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:23 in public function `fuser`:
        D400: First line should end with a period (not 'n')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:23 in public function `fuser`:
        D401: First line should be in imperative mood; try rephrasing (found 'A')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:136 in public function `set_fusion_strategy`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
7
```
After:
```
0
```
6) torch/jit/_async.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:1 at module level:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:1 at module level:
        D400: First line should end with a period (not 'I')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:20 in public function `fork`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:20 in public function `fork`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:20 in public function `fork`:
        D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:88 in public function `wait`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:88 in public function `wait`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:88 in public function `wait`:
        D401: First line should be in imperative mood (perhaps 'Force', not 'Forces')
8
```
After:
```
0
```
7) torch/jit/_await.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:11 in private function `_awaitable`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:11 in private function `_awaitable`:
        D400: First line should end with a period (not ',')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:11 in private function `_awaitable`:
        D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:19 in private function `_awaitable_wait`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:19 in private function `_awaitable_wait`:
        D400: First line should end with a period (not ',')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:19 in private function `_awaitable_wait`:
        D401: First line should be in imperative mood (perhaps 'Request', not 'Requests')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:27 in private function `_awaitable_nowait`:
        D200: One-line docstring should fit on one line with quotes (found 3)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:27 in private function `_awaitable_nowait`:
        D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
8
```
After:
```
0
```
8) torch/jit/_check.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:10 in public class `AttributeTypeIsSupportedChecker`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:10 in public class `AttributeTypeIsSupportedChecker`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:10 in public class `AttributeTypeIsSupportedChecker`:
        D412: No blank lines allowed between a section header and its content ('Example')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:61 in public method `check`:
        D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:110 in public method `visit_Assign`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:110 in public method `visit_Assign`:
        D400: First line should end with a period (not 'n')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:132 in public method `visit_AnnAssign`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:132 in public method `visit_AnnAssign`:
        D400: First line should end with a period (not '`')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:187 in public method `visit_Call`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:187 in public method `visit_Call`:
        D400: First line should end with a period (not '`')
10
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:58 in public method `check`:
        D102: Missing docstring in public method
1
```
9) torch/jit/_freeze.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:1 at module level:
        D400: First line should end with a period (not 'g')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:16 in public function `freeze`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:16 in public function `freeze`:
        D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:127 in public function `run_frozen_optimizations`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:127 in public function `run_frozen_optimizations`:
        D401: First line should be in imperative mood (perhaps 'Run', not 'Runs')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:182 in public function `optimize_for_inference`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:182 in public function `optimize_for_inference`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:182 in public function `optimize_for_inference`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
8
```
After:
```
0
```
10) torch/jit/_recursive.py

Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:69 in public function `make_stub`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:75 in public function `make_stub_from_method`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:90 in public function `make_stubs_from_exported_methods`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:103 in public function `jit_ignored_properties`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:155 in public class `SourceContext`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:156 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:160 in public function `get_annotations`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:186 in public function `infer_concrete_type_builder`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:186 in public function `infer_concrete_type_builder`:
        D400: First line should end with a period (not 's')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:423 in public class `ConcreteTypeStore`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:427 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:434 in public method `get_or_create_concrete_type`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:434 in public method `get_or_create_concrete_type`:
        D400: First line should end with a period (not 'T')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:459 in public function `create_methods_and_properties_from_stubs`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:474 in public function `create_hooks_from_stubs`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:485 in public function `get_module_concrete_type`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:485 in public function `get_module_concrete_type`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:485 in public function `get_module_concrete_type`:
        D401: First line should be in imperative mood (perhaps 'Get', not 'Gets')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:539 in public function `create_script_module`:
        D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:539 in public function `create_script_module`:
        D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:725 in public function `script_model_defines_attr`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:735 in public function `add_python_attr_to_scripted_model`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:740 in public function `get_overload_annotations`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:772 in public function `get_overload_name_mapping`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:797 in public function `make_stubs_for_overloads`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:816 in public function `check_module_initialized`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:842 in public function `infer_methods_to_compile`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:842 in public function `infer_methods_to_compile`:
        D400: First line should end with a period (not 'g')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:842 in public function `infer_methods_to_compile`:
        D401: First line should be in imperative mood (perhaps 'Implement', not 'Implements')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:904 in public function `get_hook_stubs`:
        D200: One-line docstring should fit on one line with quotes (found 3)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:904 in public function `get_hook_stubs`:
        D400: First line should end with a period (not 's')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:904 in public function `get_hook_stubs`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:940 in public function `get_property_stubs`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:940 in public function `get_property_stubs`:
        D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:963 in public function `interface_script`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:963 in public function `interface_script`:
        D400: First line should end with a period (not 'r')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:963 in public function `interface_script`:
        D401: First line should be in imperative mood (perhaps 'Make', not 'Makes')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:977 in private nested function `infer_interface_methods_to_compile`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:977 in private nested function `infer_interface_methods_to_compile`:
        D400: First line should end with a period (not 'h')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:989 in public function `try_compile_fn`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1014 in public function `wrap_cpp_class`:
        D200: One-line docstring should fit on one line with quotes (found 3)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1021 in public function `wrap_cpp_module`:
        D200: One-line docstring should fit on one line with quotes (found 3)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1021 in public function `wrap_cpp_module`:
        D400: First line should end with a period (not 's')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1040 in public function `compile_unbound_method`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1052 in public function `lazy_bind`:
        D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1052 in public function `lazy_bind`:
        D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1052 in public function `lazy_bind`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
47
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:69 in public function `make_stub`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:75 in public function `make_stub_from_method`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:90 in public function `make_stubs_from_exported_methods`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:103 in public function `jit_ignored_properties`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:155 in public class `SourceContext`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:156 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:160 in public function `get_annotations`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:424 in public class `ConcreteTypeStore`:
        D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:428 in public method `__init__`:
        D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:457 in public function `create_methods_and_properties_from_stubs`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:472 in public function `create_hooks_from_stubs`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:724 in public function `script_model_defines_attr`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:734 in public function `add_python_attr_to_scripted_model`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:739 in public function `get_overload_annotations`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:771 in public function `get_overload_name_mapping`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:796 in public function `make_stubs_for_overloads`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:815 in public function `check_module_initialized`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:979 in public function `try_compile_fn`:
        D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1026 in public function `compile_unbound_method`:
        D103: Missing docstring in public function
19
```

@svekars

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113371
Approved by: https://github.com/davidberard98
2023-11-12 03:19:02 +00:00
15a2caea8e Enables copy/clone/reshape/contiguous operations for bits types (#113508)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113508
Approved by: https://github.com/albanD
2023-11-11 22:51:50 +00:00
d00c983b63 [dynamo] Make {testing,debug_utils,utils}.py pass follow_imports typechecking (#113519)
Notes:

* `debug_insert_nops` in testing.py was passing `None` to the compiler_fn
parameter of `OutputGraph`, hence the modifications there.
* I added `disable-error-code="method-assign"` to debug_utils.py as it
does several such assignments. I guess mypy doesn't like it because it
makes code near-impossible to safely typecheck.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113519
Approved by: https://github.com/Skylion007
ghstack dependencies: #113413, #113518
2023-11-11 22:15:46 +00:00
6805d1e1d6 [inductor] Make graph.py pass follow_imports typechecking (#113518)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113518
Approved by: https://github.com/Skylion007
ghstack dependencies: #113413
2023-11-11 22:15:46 +00:00
a8cf04fd2a [inductor] Make {output_graph,pad_mm}.py pass follow_imports typechecking (#113413)
I changed OutputGraph.nn_modules' type to `Dict[str, Any]` because it
seems that `register_attr_or_module` can populate it with essentially
any type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113413
Approved by: https://github.com/Skylion007
2023-11-11 22:15:46 +00:00
8d41a5c605 [indictor] Fix cat decomp when first tensor is empty (#113514)
Summary: Previously, when the first tensor argument to `aten.cat` was empty and there was only one non-empty tensor argument, the first (empty) tensor was erroneously returned by the `aten.cat` decomposition. Here we fix the bug.

Test Plan:

```
$ python test/inductor/test_torchinductor.py -k test_cat_empty
...
----------------------------------------------------------------------
Ran 2 tests in 5.760s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113514
Approved by: https://github.com/jansel
2023-11-11 20:34:22 +00:00
39ca5a3226 [ONNX] Fix scalar type promotion between fp16 tensor and fp32 scalar (#113404)
Fixes https://github.com/pytorch/pytorch/issues/104594.

The reason for the exporter behavior in original posted issue is explained as follows:
ONNX model track shape related computes that were done in pytorch by python
numbers as tensor computes. This is the only way for ONNX to track them properly
since ONNX only has tensor type, otherwise the computation result will be tracked
statically as constant, and the model won't work for another input that differs in shape.

Now for type promotion logic, scalars should be treated differently with tensors.
Exporter mistook the shape related scalars as tensors in this case and incorrectly promoted.

This PR fixes the behavior and relaxes the criteria of scalar recognition. For floating point,
previously only a value from model initializer that has dtype torch.double and rank 0 is
treated as scalar. Now it is relaxed to any intermediate value, as well as for dtype torch.float.
Previous assumption was that python number is traced as torch.double dtype, which also
appears to be invalid anymore.

NOTE that this might introduce regression that a REAL 0-rank tensor is now being recognized as
scalar. The downside is the model will drop in accuracy for these cases as certain computations
will happen in lower precision data types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113404
Approved by: https://github.com/justinchuby
2023-11-11 15:08:07 +00:00
b00311ce9e [dynamo] Add run_inductor_tests entrypoint (#113278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113278
Approved by: https://github.com/yanboliang
2023-11-11 08:54:43 +00:00
fb9a136383 [pytorch-vulkan] Add operator<< for uvec3 (#112113)
Summary: Useful for debugging.

Test Plan:
```
LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin  | pastry
```
Test all pass: P865112285

Differential Revision: D50676692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112113
Approved by: https://github.com/manuelcandales
2023-11-11 08:21:35 +00:00
ef49f61f19 [vision hash update] update the pinned vision hash (#113499)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113499
Approved by: https://github.com/pytorchbot
2023-11-11 03:21:37 +00:00
66d09f8217 [inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)
This PR is just moving things around, so code shared by multiple tests files is in torch/testing/_internal/inductor_utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113275
Approved by: https://github.com/yanboliang
ghstack dependencies: #113242
2023-11-11 03:17:35 +00:00
4309d38f5d [dynamo] Refactor test cross importing (#113242)
Having tests import tests is a bit annoying because fbcode/oss have different paths.  This moves that stuff into a helper function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113242
Approved by: https://github.com/yanboliang
2023-11-11 03:17:35 +00:00
5e03af8295 [inductor] Enable floor_div indexing to work under ABI-compat mode (#113276)
Previously, floor_div operations were defined in
ATen/native/BinaryOps.h. Since this header was not included under
ABI-compat mode, trying to use those indexing operations would result in
compilation errors.

Technically, it is safe to use aten::native::floor_div_* functions in
ABI-compat mode as they are header-only; we could simply include
BinaryOps.h. However, there are other declarations in BinaryOps.h that
are not binary-compatible, so this is not ideal. Thus, I have moved those
functions into a separate file, and put them under c10/util, since they
don't really have tensor-specific logic.

c10 functions are not all header-only, so this still isn't ideal, but
this still seems like an improvement. Moreover, cpp_prefix.h -- used
when compiling cpp kernels -- already includes c10 header files, so
ABI-compatibility already depends on maintaining some c10 functions as
header-only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-11-11 02:51:29 +00:00
e75e01e6b9 Skip if the max element is 0 to avoid invalid config for CAT (#113321)
Summary:
We observe cuda invalid configuration during training. Here is an example log: https://www.internalfb.com/phabricator/paste/view/P876519113 It's actually caused by grid dim is 0

Here is an example failed job: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-zorror-996644c19c?version=0&env=PRODUCTION

Test Plan: unit test

Differential Revision: D51136494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113321
Approved by: https://github.com/jianyuh
2023-11-11 02:45:43 +00:00
3b915f9de0 [pt2] enable meta tests for foreach ops (#113484)
Try https://github.com/pytorch/pytorch/pull/113059 again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113484
Approved by: https://github.com/lezcano
2023-11-11 02:43:41 +00:00
28e11f54ab [dynamo] skip test_internal_error_suppress_errors in fbcode (#113482)
Summary: This test generates a different stack trace in fbcode and seems to have been failing for a while.

Test Plan: sandcastle

Differential Revision: D51210355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113482
Approved by: https://github.com/oulgen
2023-11-11 02:41:29 +00:00
575be044c3 [TD] Disable HistoricalClassFailurCorrelation (#113497)
Ex. https://github.com/pytorch/pytorch/actions/runs/6829618325/job/18576593307
```
  File "test/run_test.py", line 1806, in main
    test_stats = aggregated_heuristics.get_test_stats(test)
  File "/var/lib/jenkins/pytorch/tools/testing/target_determination/heuristics/interface.py", line 391, in get_test_stats
    metrics = heuristic_results.get_priority_info_for_test(test)
  File "/var/lib/jenkins/pytorch/tools/testing/target_determination/heuristics/interface.py", line 307, in get_priority_info_for_test
    relevance = self._get_test_relevance_group(test_run)
  File "/var/lib/jenkins/pytorch/tools/testing/target_determination/heuristics/interface.py", line 278, in _get_test_relevance_group
    raise ValueError(f"Test {test_run} not found in any relevance group")
ValueError: Test test_cuda_expandable_segments not found in any relevance group
```
I believe that the root cause is that HistoricalClassFailurCorrelation splits `test_cuda_expandable_segments` into two sets: one with a class and one without the class.  Then, when the entire `test_cuda_expandable_segments` fails (because we currently don't do class level granularity in TD), it is unable to find what HistoricalClassFailurCorrelation ranked the test as since it's split into two.

I don't think this is that important for normal CI users since this code only runs if a test failed in the first place.  However, it does mean that we can't gather TD stats, so I am going to disable it for now.

One possible solution is to switch the contains call and take the worst or best of the bunch, which I think is what https://github.com/pytorch/pytorch/blob/main/tools/testing/target_determination/heuristics/interface.py#L272 is trying to do? unclear

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113497
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/malfet
2023-11-11 02:34:27 +00:00
3cb6cf1e8a Revert "[ONNX] Fix scalar type promotion between fp16 tensor and fp32 scalar (#113404)"
This reverts commit f2cd68102a56cd0427f25b748bbe3b463d43807b.

Reverted https://github.com/pytorch/pytorch/pull/113404 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing in trunk f2cd68102a, may be a landrace or flaky of sort ([comment](https://github.com/pytorch/pytorch/pull/113404#issuecomment-1806613497))
2023-11-11 02:09:22 +00:00
9f15fbae53 [Dynamo]fix bug for bytecode hook and leave a test case (#113457)
Fixes https://github.com/pytorch/pytorch/pull/113234#issuecomment-1805584787 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113457
Approved by: https://github.com/jansel
2023-11-11 01:59:48 +00:00
670abff6ff docs: Fix docstring lint errors in torch/distributed/fsdp/_flat_param.py & torch/distributed/fsdp/_init_utils.py (#113358)
Fixes #113189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113358
Approved by: https://github.com/kit1980
2023-11-11 01:53:02 +00:00
4916a7e94f Revert "[Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)"
This reverts commit a62a88bb84f633581242bd0107e01d2a075884a3.

Reverted https://github.com/pytorch/pytorch/pull/112623 on behalf of https://github.com/huydhn due to This break TestCuda::test_lazy_init on ROCm ([comment](https://github.com/pytorch/pytorch/pull/112623#issuecomment-1806597750))
2023-11-11 00:35:56 +00:00
0a7eef9bcf [BE] Remove stale CUDA version check from cpp_extension.py (#113447)
As at least CUDA-11.x is needed to build PyTorch on latest trunk.
But still skip `--generate-dependencies-with-compile` if running on ROCm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113447
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/PaliC, https://github.com/huydhn
2023-11-11 00:20:08 +00:00
740e8a536f Perf improvements for eager GridSampler (#113341)
Description:
- Added vectorized `cast` and fixed `mask_gather` signature bug to be used in GridSampler

Perf speed-up results:
- CPU capability usage: AVX2
```
[--------------------------------------------------------------------------------------- Affine grid sampling, cpu ----------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.2.0a0+git971a50e) PR  |  Eager (2.2.0a0+git3ca81ae) nightly  |  Speed-up PR vs Nightly
1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (1, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |        698.871 (+-42.998)       |         1196.590 (+-16.223)          |     1.712 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=True, mode=nearest    |       1363.909 (+-49.798)       |         2658.933 (+-62.208)          |     1.949 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |        542.857 (+-3.547)        |         1166.259 (+-13.349)          |     2.148 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.channels_last, align_corners=True, mode=nearest        |       1110.957 (+-173.044)      |         2472.511 (+-37.322)          |     2.226 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        666.702 (+-3.624)        |         1211.040 (+-15.933)          |     1.816 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=False, mode=nearest   |       1383.907 (+-52.735)       |         2680.096 (+-72.214)          |     1.937 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        552.020 (+-4.574)        |         1165.713 (+-13.829)          |     2.112 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.channels_last, align_corners=False, mode=nearest       |       1195.561 (+-43.627)       |         2479.525 (+-37.279)          |     2.074 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |       1434.594 (+-18.829)       |         3713.318 (+-53.087)          |     2.588 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=True, mode=bilinear   |       2584.424 (+-61.646)       |         6266.618 (+-70.403)          |     2.425 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |       1064.318 (+-17.605)       |         3689.232 (+-35.200)          |     3.466 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.channels_last, align_corners=True, mode=bilinear       |       2227.200 (+-46.111)       |         6053.448 (+-43.859)          |     2.718 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |       1479.566 (+-23.023)       |         3695.113 (+-48.203)          |     2.497 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=False, mode=bilinear  |       2551.005 (+-58.898)       |         6244.574 (+-66.058)          |     2.448 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |       1081.029 (+-13.911)       |         3680.292 (+-35.145)          |     3.404 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.channels_last, align_corners=False, mode=bilinear      |       2209.528 (+-61.779)       |         6073.101 (+-99.366)          |     2.749 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |       4607.162 (+-40.688)       |        14703.326 (+-564.378)         |     3.191 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=True, mode=bicubic    |      30132.017 (+-679.033)      |        38338.429 (+-768.288)         |     1.272 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |       4274.459 (+-33.603)       |        14766.649 (+-260.509)         |     3.455 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.channels_last, align_corners=True, mode=bicubic        |      29137.615 (+-617.591)      |        37420.822 (+-785.526)         |     1.284 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |       4954.048 (+-79.199)       |        14704.016 (+-330.618)         |     2.968 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=False, mode=bicubic   |      30068.414 (+-792.686)      |        38409.600 (+-691.079)         |     1.277 (+-0.000)
      Input: (1, 3, 500, 400) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |       4274.381 (+-35.679)       |        14756.324 (+-236.034)         |     3.452 (+-0.000)
      Input: (1, 3, 500, 400) torch.float64, torch.channels_last, align_corners=False, mode=bicubic       |      29148.286 (+-780.277)      |        37389.990 (+-663.702)         |     1.283 (+-0.000)

      Input: (8, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |       9656.722 (+-66.127)       |        13726.028 (+-112.412)         |     1.421 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=True, mode=nearest    |      19947.575 (+-108.492)      |        41501.452 (+-327.186)         |     2.081 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |       7597.021 (+-52.866)       |         10839.269 (+-93.029)         |     1.427 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.channels_last, align_corners=True, mode=nearest        |      28164.663 (+-179.955)      |        34985.201 (+-350.970)         |     1.242 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |       9703.983 (+-154.907)      |        13858.466 (+-128.411)         |     1.428 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=False, mode=nearest   |      34086.142 (+-212.213)      |        41104.817 (+-433.195)         |     1.206 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |       7626.922 (+-56.371)       |         10916.952 (+-96.023)         |     1.431 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.channels_last, align_corners=False, mode=nearest       |      28277.855 (+-228.616)      |        34851.453 (+-260.788)         |     1.232 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |      14180.691 (+-184.150)      |        36243.299 (+-350.811)         |     2.556 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=True, mode=bilinear   |      40699.798 (+-234.600)      |        68053.260 (+-1057.869)        |     1.672 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |      11190.905 (+-103.419)      |        30729.080 (+-381.639)         |     2.746 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.channels_last, align_corners=True, mode=bilinear       |      35965.958 (+-298.474)      |        63030.143 (+-390.692)         |     1.752 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |      14461.459 (+-120.555)      |        36150.986 (+-293.416)         |     2.500 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=False, mode=bilinear  |      40891.653 (+-195.887)      |        67757.076 (+-991.072)         |     1.657 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |      11437.092 (+-100.145)      |        30465.192 (+-282.936)         |     2.664 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.channels_last, align_corners=False, mode=bilinear      |      36112.937 (+-306.527)      |        63729.695 (+-678.976)         |     1.765 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |      39512.380 (+-368.172)      |       129854.028 (+-1635.314)        |     3.286 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=True, mode=bicubic    |     283835.203 (+-2166.425)     |       352072.211 (+-3178.250)        |     1.240 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |      35804.934 (+-341.254)      |       126762.714 (+-1740.266)        |     3.540 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.channels_last, align_corners=True, mode=bicubic        |     275862.511 (+-2549.251)     |       341804.886 (+-2974.238)        |     1.239 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |      39514.504 (+-307.814)      |       130436.644 (+-3081.411)        |     3.301 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.contiguous_format, align_corners=False, mode=bicubic   |     283929.198 (+-2373.485)     |       353432.316 (+-3600.725)        |     1.245 (+-0.000)
      Input: (8, 3, 500, 400) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |      35776.293 (+-267.109)      |       126884.936 (+-1718.414)        |     3.547 (+-0.000)
      Input: (8, 3, 500, 400) torch.float64, torch.channels_last, align_corners=False, mode=bicubic       |     276278.294 (+-2150.899)     |       326207.948 (+-2578.309)        |     1.181 (+-0.000)

Times are in microseconds (us).
```
[Source](https://github.com/vfdev-5/pth-grid-sampler/blob/master/output/20231109-161300-pr_vs_nightly-speedup.md)

TODO:
- [ ] Add AVX512 benchmark results (I have no access to a cpu with avx512 capabilities anymore)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113341
Approved by: https://github.com/lezcano
2023-11-11 00:16:26 +00:00
e8e3afb784 [ONNX] Refactor MaxPool to support dynamic inputs (#113318)
In https://github.com/pytorch/pytorch/pull/106270, the solution managed to solve the [`ceil_model` corner issue](https://github.com/onnx/onnx/issues/5711) with the usage of `get_pool_ceil_padding`. However, padding the ceil in converter side only works when we already know the input shapes, therefore, a regression happens when users want to do dynamic inputs.

This PR provides (1) refactor codes with torchlib implementation, (2) add dynamic shapes test, and (3) disable the corner tests with comments saying re-enable it when the [real fix from ONNX](https://github.com/onnx/onnx/pull/5741) is merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113318
Approved by: https://github.com/thiagocrepaldi
2023-11-10 23:23:49 +00:00
a4dc3716c0 Deprecated verbose parameter in LR schedulers (#111302)
Fixes https://github.com/pytorch/pytorch/issues/100847

This PR follows the comment in https://github.com/pytorch/pytorch/issues/100847#issuecomment-1546247239 by deprecating the `verbose` parameter and removing the print statements. Removing the print statements is technically BC breaking, so I would be okay with putting them back in.

To be less annoying, this PR raises a warning only when `verbose` is explicitly passed in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111302
Approved by: https://github.com/albanD
2023-11-10 23:17:27 +00:00
d4e670c37c Add pyre internal configs to gitignore (#113480)
TSIA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113480
Approved by: https://github.com/clee2000
2023-11-10 22:44:13 +00:00
06dc2f162d [AOTI] Implement support for user defined kernels that use triton.autotune (#113229)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113229
Approved by: https://github.com/chenyang78
2023-11-10 22:40:51 +00:00
f2cd68102a [ONNX] Fix scalar type promotion between fp16 tensor and fp32 scalar (#113404)
Fixes https://github.com/pytorch/pytorch/issues/104594.

The reason for the exporter behavior in original posted issue is explained as follows:
ONNX model track shape related computes that were done in pytorch by python
numbers as tensor computes. This is the only way for ONNX to track them properly
since ONNX only has tensor type, otherwise the computation result will be tracked
statically as constant, and the model won't work for another input that differs in shape.

Now for type promotion logic, scalars should be treated differently with tensors.
Exporter mistook the shape related scalars as tensors in this case and incorrectly promoted.

This PR fixes the behavior and relaxes the criteria of scalar recognition. For floating point,
previously only a value from model initializer that has dtype torch.double and rank 0 is
treated as scalar. Now it is relaxed to any intermediate value, as well as for dtype torch.float.
Previous assumption was that python number is traced as torch.double dtype, which also
appears to be invalid anymore.

NOTE that this might introduce regression that a REAL 0-rank tensor is now being recognized as
scalar. The downside is the model will drop in accuracy for these cases as certain computations
will happen in lower precision data types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113404
Approved by: https://github.com/justinchuby
2023-11-10 22:31:25 +00:00
cbf12dfba6 [LLVM] Replaced getInt8PtrTy with getUnqual (#113455)
[llvm-fb-staging] Build failed on pytorch jit due to llvm upstream API change. The fix should just replace getInt8PtrTy with getUnqual. The corresponding task - [T169468309](https://www.internalfb.com/tasks/?t=169468309).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113455
Approved by: https://github.com/malfet
2023-11-10 22:26:20 +00:00
48c2f89399 [BE] Add friendly error message if you compile_fx_inner but not return tuple/list (#113451)
Previously it would fail here:

```
  File "/data/users/ezyang/a/pytorch/torch/_inductor/fx_passes/post_grad.py", line 597, in remove_noop_ops
    for out in tuple(graph.nodes)[-1].args[0]:
```

Now you'll trigger this assert instead.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113451
Approved by: https://github.com/albanD
2023-11-10 21:43:58 +00:00
dfa9e7b511 Allow inferring divisibility on unbacked SymInts and do replacement trick (#113165)
We want something like torch.empty(i0, 12).view(4, -1, 12) to work.  Right now, it chokes on guards on data dependent accesses. It turns out we are very close to having it work based on experiments in https://github.com/pytorch/pytorch/issues/112347 if we do the replacement trick, setting i0 = i1 * 4 to explicitly encode in the divisibility; this is good enough for Sympy to be able to handle the rest.

There are two parts to this PR.

* First, we must discover that there is this divisibility constraint. The place where this happens on view is in `infer_size`; however, we are unable to discover the modulus test with `expect_true` because the condition is currently written with a Python boolean operator that forces guarding too early: `numel == newsize or (dim is not None and newsize > 0 and numel % newsize == 0)`. We rewrite this into an equivalent version which tests on dim being None or not first, before performing individual tests. The main nontrivial reasoning here is that I must show that my set of tests in the `dim is None` branch are sufficient when `numel == newsize`. However, if `numel == newsize`, then the modulus must pass. Thus this is equivalent.
* Given the modifications to `infer_size`, this suffices to produce a runtime assert `Eq(Mod(192*i0, 2304), 0)`. Now we must simply turn this into the replacement automatically. I wasn't really sure how to use Sympy to do this for me, so I just manually pattern matched for this particular expression form, and if it exists do the replacements.

Note that this is kind of only useful for export, because inductor chokes on views involving unbacked SymInts. That will be follow up.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113165
Approved by: https://github.com/lezcano, https://github.com/aakhundov
2023-11-10 21:28:02 +00:00
91c90f232a Fix docstring errors in reductions.py, spawn.py, pool.py, parameter.py, cpp.py, grad.py, __init__.py, profiler.py, queue.py, graph.py (#113052)
Fixes #112595
- `torch/autograd/profiler.py` </br>
**Before: 37**

```
torch/autograd/profiler.py:1 at module level:
        D100: Missing docstring in public module
torch/autograd/profiler.py:91 in public class `profile`:
        D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:175 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/profiler.py:261 in public method `config`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:272 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:290 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:308 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:313 in public method `__str__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:322 in public method `table`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:346 in public method `export_chrome_trace`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:355 in public method `export_stacks`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:361 in public method `key_averages`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:368 in public method `total_average`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:377 in public method `self_cpu_time_total`:
        D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:377 in public method `self_cpu_time_total`:
        D400: First line should end with a period (not 'f')
torch/autograd/profiler.py:555 in public class `record_function`:
        D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:555 in public class `record_function`:
        D400: First line should end with a period (not 'f')
torch/autograd/profiler.py:591 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/profiler.py:602 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:608 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:625 in private method `_call_end_callbacks_on_future`:
        D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:625 in private method `_call_end_callbacks_on_future`:
        D400: First line should end with a period (not 'c')
torch/autograd/profiler.py:707 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/profiler.py:712 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:733 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:826 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/profiler.py:831 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:853 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:863 in public function `load_nvprof`:
        D401: First line should be in imperative mood (perhaps 'Open', not 'Opens')
torch/autograd/profiler.py:874 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/profiler.py:877 in public method `see`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:883 in public function `parse_nvprof_trace`:
        D103: Missing docstring in public function
torch/autograd/profiler.py:951 in public class `KinetoStepTracker`:
        D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:991 in public method `init_step_count`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:995 in public method `erase_step_count`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:1000 in public method `increment_step`:
        D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:1023 in public method `current_step`:
        D102: Missing docstring in public method
37
```

**After: 27**

```
torch/autograd/profiler.py:1 at module level:
        D100: Missing docstring in public module
torch/autograd/profiler.py:176 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/profiler.py:262 in public method `config`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:273 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:291 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:309 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:314 in public method `__str__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:323 in public method `table`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:347 in public method `export_chrome_trace`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:356 in public method `export_stacks`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:362 in public method `key_averages`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:369 in public method `total_average`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:593 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/profiler.py:604 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:610 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:708 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/profiler.py:713 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:734 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:827 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/profiler.py:832 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:854 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/autograd/profiler.py:875 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/profiler.py:878 in public method `see`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:884 in public function `parse_nvprof_trace`:
        D103: Missing docstring in public function
torch/autograd/profiler.py:993 in public method `init_step_count`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:997 in public method `erase_step_count`:
        D102: Missing docstring in public method
torch/autograd/profiler.py:1025 in public method `current_step`:
        D102: Missing docstring in public method
27
```

- `torch/autograd/graph.py` </br>
**Before: 22**

```
torch/autograd/graph.py:1 at module level:
        D100: Missing docstring in public module
torch/autograd/graph.py:24 in public class `Node`:
        D101: Missing docstring in public class
torch/autograd/graph.py:27 in public method `name`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/autograd/graph.py:42 in public method `next_functions`:
        D102: Missing docstring in public method
torch/autograd/graph.py:47 in public method `metadata`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/autograd/graph.py:56 in public method `register_hook`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/autograd/graph.py:94 in public method `register_prehook`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/autograd/graph.py:129 in public method `__subclasshook__`:
        D105: Missing docstring in magic method
torch/autograd/graph.py:147 in public function `get_gradient_edge`:
        D205: 1 blank line required between summary line and description (found 0)
torch/autograd/graph.py:147 in public function `get_gradient_edge`:
        D400: First line should end with a period (not 'f')
torch/autograd/graph.py:147 in public function `get_gradient_edge`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/autograd/graph.py:166 in public function `increment_version`:
        D205: 1 blank line required between summary line and description (found 0)
torch/autograd/graph.py:166 in public function `increment_version`:
        D400: First line should end with a period (not 'd')
torch/autograd/graph.py:166 in public function `increment_version`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/autograd/graph.py:243 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/graph.py:251 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/autograd/graph.py:256 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/autograd/graph.py:261 in public class `save_on_cpu`:
        D205: 1 blank line required between summary line and description (found 0)
torch/autograd/graph.py:261 in public class `save_on_cpu`:
        D400: First line should end with a period (not 'e')
torch/autograd/graph.py:303 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/graph.py:365 in public function `register_multi_grad_hook`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/autograd/graph.py:588 in public function `allow_mutation_on_saved_tensors`:
        D400: First line should end with a period (not 'd')
22
```

**After: 8**

```
torch/autograd/graph.py:1 at module level:
        D100: Missing docstring in public module
torch/autograd/graph.py:24 in public class `Node`:
        D101: Missing docstring in public class
torch/autograd/graph.py:42 in public method `next_functions`:
        D102: Missing docstring in public method
torch/autograd/graph.py:129 in public method `__subclasshook__`:
        D105: Missing docstring in magic method
torch/autograd/graph.py:244 in public method `__init__`:
        D107: Missing docstring in __init__
torch/autograd/graph.py:252 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/autograd/graph.py:257 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/autograd/graph.py:303 in public method `__init__`:
        D107: Missing docstring in __init__
8
```

- `torch/multiprocessing/pool.py` </br>
**Before: 6**

```
torch/multiprocessing/pool.py:1 at module level:
        D100: Missing docstring in public module
torch/multiprocessing/pool.py:7 in public function `clean_worker`:
        D103: Missing docstring in public function
torch/multiprocessing/pool.py:18 in public class `Pool`:
        D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/pool.py:18 in public class `Pool`:
        D209: Multi-line docstring closing quotes should be on a separate line
torch/multiprocessing/pool.py:29 in private method `_repopulate_pool`:
        D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/pool.py:29 in private method `_repopulate_pool`:
        D400: First line should end with a period (not ',')
6
```

**After: 2**

```
torch/multiprocessing/pool.py:1 at module level:
        D100: Missing docstring in public module
torch/multiprocessing/pool.py:7 in public function `clean_worker`:
        D103: Missing docstring in public function
2
```

- `torch/multiprocessing/queue.py` </br>
**Before: 11**

```
torch/multiprocessing/queue.py:1 at module level:
        D100: Missing docstring in public module
torch/multiprocessing/queue.py:8 in public class `ConnectionWrapper`:
        D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/queue.py:8 in public class `ConnectionWrapper`:
        D209: Multi-line docstring closing quotes should be on a separate line
torch/multiprocessing/queue.py:8 in public class `ConnectionWrapper`:
        D400: First line should end with a period (not 'o')
torch/multiprocessing/queue.py:11 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/queue.py:14 in public method `send`:
        D102: Missing docstring in public method
torch/multiprocessing/queue.py:19 in public method `recv`:
        D102: Missing docstring in public method
torch/multiprocessing/queue.py:23 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/multiprocessing/queue.py:29 in public class `Queue`:
        D101: Missing docstring in public class
torch/multiprocessing/queue.py:30 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/queue.py:38 in public class `SimpleQueue`:
        D101: Missing docstring in public class
11
```

**After: 8**

```
torch/multiprocessing/queue.py:1 at module level:
        D100: Missing docstring in public module
torch/multiprocessing/queue.py:10 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/queue.py:13 in public method `send`:
        D102: Missing docstring in public method
torch/multiprocessing/queue.py:18 in public method `recv`:
        D102: Missing docstring in public method
torch/multiprocessing/queue.py:22 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/multiprocessing/queue.py:28 in public class `Queue`:
        D101: Missing docstring in public class
torch/multiprocessing/queue.py:29 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/queue.py:37 in public class `SimpleQueue`:
        D101: Missing docstring in public class
8
```

- `torch/multiprocessing/reductions.py` </br>
**Before: 31**

```
torch/multiprocessing/reductions.py:1 at module level:
        D100: Missing docstring in public module
torch/multiprocessing/reductions.py:24 in public class `StorageWeakRef`:
        D209: Multi-line docstring closing quotes should be on a separate line
torch/multiprocessing/reductions.py:31 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:38 in public method `from_weakref`:
        D102: Missing docstring in public method
torch/multiprocessing/reductions.py:44 in public method `expired`:
        D102: Missing docstring in public method
torch/multiprocessing/reductions.py:47 in public method `__del__`:
        D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:50 in public method `__hash__`:
        D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:53 in public method `__eq__`:
        D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:60 in public class `SharedCache`:
        D400: First line should end with a period (not 'f')
torch/multiprocessing/reductions.py:62 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:75 in public method `get`:
        D102: Missing docstring in public method
torch/multiprocessing/reductions.py:79 in public method `__setitem__`:
        D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:85 in public method `free_dead_references`:
        D102: Missing docstring in public method
torch/multiprocessing/reductions.py:99 in public function `rebuild_event`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:103 in public function `reduce_event`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:108 in public function `rebuild_tensor`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:121 in public function `rebuild_cuda_tensor`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:189 in public function `reduce_tensor`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:347 in public function `rebuild_nested_tensor`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:364 in public function `reduce_nested_tensor`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:389 in public function `fd_id`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:397 in public function `storage_from_cache`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:404 in public function `rebuild_storage_fd`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:417 in public function `rebuild_storage_filename`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:437 in public function `rebuild_storage_empty`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:441 in public function `rebuild_typed_storage`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:446 in public function `reduce_typed_storage`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:450 in public function `rebuild_typed_storage_child`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:455 in public function `reduce_typed_storage_child`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:459 in public function `reduce_storage`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:488 in public function `init_reductions`:
        D103: Missing docstring in public function
31
```

**After: 29**

```
torch/multiprocessing/reductions.py:1 at module level:
        D100: Missing docstring in public module
torch/multiprocessing/reductions.py:32 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:39 in public method `from_weakref`:
        D102: Missing docstring in public method
torch/multiprocessing/reductions.py:45 in public method `expired`:
        D102: Missing docstring in public method
torch/multiprocessing/reductions.py:48 in public method `__del__`:
        D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:51 in public method `__hash__`:
        D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:54 in public method `__eq__`:
        D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:63 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:76 in public method `get`:
        D102: Missing docstring in public method
torch/multiprocessing/reductions.py:80 in public method `__setitem__`:
        D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:86 in public method `free_dead_references`:
        D102: Missing docstring in public method
torch/multiprocessing/reductions.py:100 in public function `rebuild_event`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:104 in public function `reduce_event`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:109 in public function `rebuild_tensor`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:122 in public function `rebuild_cuda_tensor`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:190 in public function `reduce_tensor`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:348 in public function `rebuild_nested_tensor`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:365 in public function `reduce_nested_tensor`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:390 in public function `fd_id`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:398 in public function `storage_from_cache`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:405 in public function `rebuild_storage_fd`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:418 in public function `rebuild_storage_filename`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:438 in public function `rebuild_storage_empty`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:442 in public function `rebuild_typed_storage`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:447 in public function `reduce_typed_storage`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:451 in public function `rebuild_typed_storage_child`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:456 in public function `reduce_typed_storage_child`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:460 in public function `reduce_storage`:
        D103: Missing docstring in public function
torch/multiprocessing/reductions.py:489 in public function `init_reductions`:
        D103: Missing docstring in public function
29
```

- `torch/multiprocessing/spawn.py` </br>
**Before: 19**

```
torch/multiprocessing/spawn.py:1 at module level:
        D100: Missing docstring in public module
torch/multiprocessing/spawn.py:11 in public class `ProcessException`:
        D101: Missing docstring in public class
torch/multiprocessing/spawn.py:14 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:20 in public method `__reduce__`:
        D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:25 in public class `ProcessRaisedException`:
        D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/spawn.py:25 in public class `ProcessRaisedException`:
        D400: First line should end with a period (not 'n')
torch/multiprocessing/spawn.py:30 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:40 in public class `ProcessExitedException`:
        D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/spawn.py:40 in public class `ProcessExitedException`:
        D400: First line should end with a period (not 'l')
torch/multiprocessing/spawn.py:47 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:59 in public method `__reduce__`:
        D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:85 in public class `ProcessContext`:
        D101: Missing docstring in public class
torch/multiprocessing/spawn.py:86 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:93 in public method `pids`:
        D102: Missing docstring in public method
torch/multiprocessing/spawn.py:97 in public method `join`:
        D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/spawn.py:97 in public method `join`:
        D401: First line should be in imperative mood (perhaps 'Try', not 'Tries')
torch/multiprocessing/spawn.py:166 in public class `SpawnContext`:
        D101: Missing docstring in public class
torch/multiprocessing/spawn.py:167 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:180 in public function `start_processes`:
        D103: Missing docstring in public function
19
```

**After: 13**

```
torch/multiprocessing/spawn.py:1 at module level:
        D100: Missing docstring in public module
torch/multiprocessing/spawn.py:11 in public class `ProcessException`:
        D101: Missing docstring in public class
torch/multiprocessing/spawn.py:14 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:20 in public method `__reduce__`:
        D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:27 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:41 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:53 in public method `__reduce__`:
        D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:79 in public class `ProcessContext`:
        D101: Missing docstring in public class
torch/multiprocessing/spawn.py:80 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:87 in public method `pids`:
        D102: Missing docstring in public method
torch/multiprocessing/spawn.py:161 in public class `SpawnContext`:
        D101: Missing docstring in public class
torch/multiprocessing/spawn.py:162 in public method `__init__`:
        D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:175 in public function `start_processes`:
        D103: Missing docstring in public function
13
```

- `torch/multiprocessing/__init__.py` </br>
**Before: 0**

```
torch/multiprocessing/__init__.py:1 at module level:
        D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/__init__.py:1 at module level:
        D400: First line should end with a period (not '`')
torch/multiprocessing/__init__.py:57 in public function `set_sharing_strategy`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/multiprocessing/__init__.py:69 in public function `get_sharing_strategy`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/multiprocessing/__init__.py:74 in public function `get_all_sharing_strategies`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
5
```

**After: 0**

- `torch/nn/__init__.py` </br>
**Before: 3**

```
torch/nn/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/nn/__init__.py:14 in public function `factory_kwargs`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/__init__.py:14 in public function `factory_kwargs`:
        D400: First line should end with a period (not 'd')
3
```

**After: 1**

```
torch/nn/__init__.py:1 at module level:
        D104: Missing docstring in public package
1
```

- `torch/nn/cpp.py` </br>
**Before: 16**

```
torch/nn/cpp.py:7 in public class `OrderedDictWrapper`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/cpp.py:7 in public class `OrderedDictWrapper`:
        D400: First line should end with a period (not 'e')
torch/nn/cpp.py:16 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/cpp.py:21 in public method `cpp_dict`:
        D102: Missing docstring in public method
torch/nn/cpp.py:27 in public method `items`:
        D102: Missing docstring in public method
torch/nn/cpp.py:30 in public method `keys`:
        D102: Missing docstring in public method
torch/nn/cpp.py:33 in public method `values`:
        D102: Missing docstring in public method
torch/nn/cpp.py:36 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/nn/cpp.py:39 in public method `__len__`:
        D105: Missing docstring in magic method
torch/nn/cpp.py:42 in public method `__contains__`:
        D105: Missing docstring in magic method
torch/nn/cpp.py:45 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/nn/cpp.py:50 in public class `ModuleWrapper`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/cpp.py:50 in public class `ModuleWrapper`:
        D400: First line should end with a period (not 'd')
torch/nn/cpp.py:55 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/cpp.py:83 in public method `training`:
        D102: Missing docstring in public method
torch/nn/cpp.py:90 in public method `__repr__`:
        D105: Missing docstring in magic method
16
```

**After: 12**

```
torch/nn/cpp.py:16 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/cpp.py:21 in public method `cpp_dict`:
        D102: Missing docstring in public method
torch/nn/cpp.py:27 in public method `items`:
        D102: Missing docstring in public method
torch/nn/cpp.py:30 in public method `keys`:
        D102: Missing docstring in public method
torch/nn/cpp.py:33 in public method `values`:
        D102: Missing docstring in public method
torch/nn/cpp.py:36 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/nn/cpp.py:39 in public method `__len__`:
        D105: Missing docstring in magic method
torch/nn/cpp.py:42 in public method `__contains__`:
        D105: Missing docstring in magic method
torch/nn/cpp.py:45 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/nn/cpp.py:52 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/cpp.py:80 in public method `training`:
        D102: Missing docstring in public method
torch/nn/cpp.py:87 in public method `__repr__`:
        D105: Missing docstring in magic method
12
```

- `torch/nn/grad.py` </br>
**Before: 10**

```
torch/nn/grad.py:1 at module level:
        D400: First line should end with a period (not 'e')
torch/nn/grad.py:8 in public function `conv1d_input`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/grad.py:8 in public function `conv1d_input`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:40 in public function `conv1d_weight`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:71 in public function `conv2d_input`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/grad.py:71 in public function `conv2d_input`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:103 in public function `conv2d_weight`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:134 in public function `conv3d_input`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/grad.py:134 in public function `conv3d_input`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:166 in public function `conv3d_weight`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
10
```

**After: 0**

- `torch/nn/parameter.py` </br>
**Before: 17**

```
torch/nn/parameter.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/parameter.py:14 in public class `Parameter`:
        D204: 1 blank line required after class docstring (found 0)
torch/nn/parameter.py:33 in public method `__new__`:
        D102: Missing docstring in public method
torch/nn/parameter.py:54 in public method `__deepcopy__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:62 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:65 in public method `__reduce_ex__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:84 in public class `UninitializedTensorMixin`:
        D101: Missing docstring in public class
torch/nn/parameter.py:105 in public method `materialize`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parameter.py:125 in public method `shape`:
        D102: Missing docstring in public method
torch/nn/parameter.py:132 in public method `share_memory_`:
        D102: Missing docstring in public method
torch/nn/parameter.py:138 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:141 in public method `__reduce_ex__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:149 in public method `__torch_function__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:164 in public function `is_lazy`:
        D103: Missing docstring in public function
torch/nn/parameter.py:186 in public method `__new__`:
        D102: Missing docstring in public method
torch/nn/parameter.py:191 in public method `__deepcopy__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:217 in public method `__new__`:
        D102: Missing docstring in public method
17
```

**After: 15**

```
torch/nn/parameter.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/parameter.py:34 in public method `__new__`:
        D102: Missing docstring in public method
torch/nn/parameter.py:55 in public method `__deepcopy__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:63 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:66 in public method `__reduce_ex__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:85 in public class `UninitializedTensorMixin`:
        D101: Missing docstring in public class
torch/nn/parameter.py:127 in public method `shape`:
        D102: Missing docstring in public method
torch/nn/parameter.py:134 in public method `share_memory_`:
        D102: Missing docstring in public method
torch/nn/parameter.py:140 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:143 in public method `__reduce_ex__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:151 in public method `__torch_function__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:166 in public function `is_lazy`:
        D103: Missing docstring in public function
torch/nn/parameter.py:188 in public method `__new__`:
        D102: Missing docstring in public method
torch/nn/parameter.py:193 in public method `__deepcopy__`:
        D105: Missing docstring in magic method
torch/nn/parameter.py:219 in public method `__new__`:
        D102: Missing docstring in public method
15
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113052
Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer
2023-11-10 21:19:17 +00:00
9752ef595c [BE] Consistently use the sym_stride lowering, instead of short-circuiting before (#113071)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113071
Approved by: https://github.com/voznesenskym
2023-11-10 21:19:12 +00:00
958f755a0e [FX][CodeGen] Make sure fx code is valid in python (#113345)
This PR fixes two cases when fx generated code is invalid in python (syntax error):

1. multiple type annotation in one line: `var1: annotation1, var2: annotation2 = function_call()`
2. invalid type annotation for scalars like `var1: f32[] = function_call()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113345
Approved by: https://github.com/ezyang
2023-11-10 21:12:16 +00:00
5540d276ce Fix docstring errors in container.py, _functions.py, transformer.py, comm.py, parallel_apply.py, data_parallel.py, scatter_gather.py (#113250)
Fix docstring errors in container.py, _functions.py, transformer.py, comm.py, parallel_apply.py, data_parallel.py, scatter_gather.py

Fixes #112603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113250
Approved by: https://github.com/mikaylagawarecki
2023-11-10 21:07:25 +00:00
7b28f8c5ea Better error message when applying interpolation on non-4D tensors (#113459)
Fixes #113445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113459
Approved by: https://github.com/albanD
2023-11-10 21:06:51 +00:00
a62a88bb84 [Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)
Summary:
We are planning to lazily initialize CUPTI when profiling is actually performed. Therefore, we need to remove profiler init dependency on CUPTI Callbacks' RESOURCE_CONTEXT_CREATED.

Instead, we can initialize the profilers during torch profiler pybind, ie. THPAutograd_initExtension() and lazily in profilerStep().

Test Plan:
CI and ran internally, see internal diff logs.

Differential Revision: D50894961

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112623
Approved by: https://github.com/albanD
2023-11-10 20:50:54 +00:00
6b38836c73 [BE] Don't reify entire graph.nodes just to access last element (#113450)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113450
Approved by: https://github.com/albanD
2023-11-10 20:50:14 +00:00
ae2c219de2 Revert "[BE] Remove stale CUDA version check from cpp_extension.py (#113447)"
This reverts commit 7ccca60927cdccde63d6a1d40480950f24e9877a.

Reverted https://github.com/pytorch/pytorch/pull/113447 on behalf of https://github.com/malfet due to Broke ROCM ([comment](https://github.com/pytorch/pytorch/pull/113447#issuecomment-1806407892))
2023-11-10 20:46:13 +00:00
a2c32b8bd0 [inductor] Make codegen/{common,wrapper,cuda/cutlass_utils}.py pass follow_imports typechecking (#113411)
SymIntType is referenced by wrapper.py, so I added its .pyi definition.
I also added SymBoolType along the way for completeness.

The `insinstance` checks in wrapper.py reference torch.Type, which seems
to cause mypy to choke. Not entirely sure why; I've just added
type-ignore comments for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113411
Approved by: https://github.com/Skylion007
ghstack dependencies: #113409, #113410
2023-11-10 19:58:08 +00:00
5a9f08feb5 [inductor] Make {joint_graph,inductor_prims,utils}.py pass follow_imports typechecking (#113410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113410
Approved by: https://github.com/lezcano
ghstack dependencies: #113409
2023-11-10 19:58:08 +00:00
b0ede09682 [inductor] Make pattern_matcher.py pass follow_imports typechecking (#113409)
Import following reveals that a good number of hints were wrong...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113409
Approved by: https://github.com/Skylion007
2023-11-10 19:58:08 +00:00
6e243f475d [inductor] Move has_torchvision_roi_align check inside test_roi_align (#113385)
Currently `test_torchinductor.py` imports `torchvision` at import time, which
is problematic when you have a broken `torchvision` install as test collection
will fail. This could happen for example if `torchvision` was built against a
different versions of PyTorch as may happen regularly in development.

This moves the check inside `test_roi_align` so a failure to import
`torchvision` only causes a test failure and the other tests can run fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113385
Approved by: https://github.com/lezcano
ghstack dependencies: #113384
2023-11-10 19:45:33 +00:00
c4fe817a69 [inductor] Fix test_dist on pre-sm80 and add skipCUDAIf decorator (#113384)
`test_dist` uses bfloat16 which isn't well supported by triton on pre-sm80
hardware, so split the test in two and add a skip. This also adds a
`skipCUDAIf` decorator which only skips on CUDA devices so the test still runs
on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113384
Approved by: https://github.com/lezcano
2023-11-10 19:45:33 +00:00
7ccca60927 [BE] Remove stale CUDA version check from cpp_extension.py (#113447)
As at least CUDA-11.x is needed to build PyTorch on latest trunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113447
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/PaliC, https://github.com/huydhn
2023-11-10 18:54:19 +00:00
cb233dada4 Fix docstrings on torch/nn/modules (#113260)
Fixes #112598

## Description
Fixes the docstrings on following files.

```bash
pydocstyle path-to-file --count
```
| File                                  |  Count  |
| ------------------------------------- | ------- |
| torch/nn/modules/adaptive.py          |  20 -> 4 |
| torch/nn/modules/channelshuffle.py    |  7 -> 4 |
| torch/nn/modules/conv.py              |  37 -> 25 |
| torch/nn/modules/distance.py          |  7 -> 5 |
| torch/nn/modules/dropout.py           |  17 -> 7 |
| torch/nn/modules/flatten.py           |  10 -> 7 |
| torch/nn/modules/fold.py              |  11 -> 7 |
| torch/nn/modules/instancenorm.py      |  13 -> 1 |
| torch/nn/modules/lazy.py              |  11 -> 2 |
| torch/nn/modules/linear.py            |  20 -> 14 |
| torch/nn/modules/normalization.py     |  25 -> 16 |
| torch/nn/modules/padding.py           |  33 -> 19 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113260
Approved by: https://github.com/mikaylagawarecki
2023-11-10 18:22:48 +00:00
b794bec581 [PyTorch] AOTI: add AOTIInductorModelGetNumOutputs & use for internal runner (#113299)
I don't see why you couldn't get the number of outputs for a model directly without going through a container. Now you can.

Differential Revision: [D51050435](https://our.internmc.facebook.com/intern/diff/D51050435/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D51050435/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113299
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2023-11-10 18:03:24 +00:00
b1eb9e172a remove jit from dynamo benchmark (#113338)
Continuous of https://github.com/pytorch/pytorch/pull/106071, without this dynamo dist cannot run at the moment.

Related to https://github.com/pytorch/benchmark/pull/1787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113338
Approved by: https://github.com/ezyang
2023-11-10 18:02:08 +00:00
2cd8c0565c Revert "[AOTI] Implement support for user defined kernels that use triton.autotune (#113229)"
This reverts commit 1488bafb274fcc82c8aac429bad61738bc3f950e.

Reverted https://github.com/pytorch/pytorch/pull/113229 on behalf of https://github.com/PaliC due to breaking test_aot_inductor.py tests though a forward fix is coming ([comment](https://github.com/pytorch/pytorch/pull/113229#issuecomment-1806159396))
2023-11-10 17:46:14 +00:00
3c9a59cb8d Revert "[BE] [cuDNN] Always build assuming cuDNN >= 8.0 (#95722)"
This reverts commit df4f0b3829f8e8b623f4e94a8536cfa58ccfb9af.

Reverted https://github.com/pytorch/pytorch/pull/95722 on behalf of https://github.com/PaliC due to is breaking a bunch of internal pytorch users ([comment](https://github.com/pytorch/pytorch/pull/95722#issuecomment-1806131675))
2023-11-10 17:26:36 +00:00
2a271a3efa Revert "[pytree] register pytree node type in both C++ pytree and Python pytree (#112111)"
This reverts commit a0d00349edbe09087b7bb8769cd1f49fbe7117ca.

Reverted https://github.com/pytorch/pytorch/pull/112111 on behalf of https://github.com/PaliC due to _private_register_pytree_node now checks for duplicate registering, unfortunately, this breaks composability with torchrec internally :(  ([comment](https://github.com/pytorch/pytorch/pull/112111#issuecomment-1806130993))
2023-11-10 17:24:40 +00:00
6e714d7315 [state_dict] Rewrite _gather_state_dict to extract the traversal logic (#112885)
This allows us to do cpu_offload with the same traversal logic

Differential Revision: [D50982355](https://our.internmc.facebook.com/intern/diff/D50982355/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112885
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #112836
2023-11-10 17:07:52 +00:00
c197c48ceb [aotinductor] Add a demo tutorial (#112457)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112457
Approved by: https://github.com/msaroufim, https://github.com/albanD
2023-11-10 17:01:03 +00:00
91e4b0fc4e Improve torch.unique docs (#113424)
Related issue: https://github.com/pytorch/pytorch/issues/105742.
In fact, `torch.unique` always sort the tensor at the beginning regardless of the `sort` argument and the `dim` argument .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113424
Approved by: https://github.com/malfet
ghstack dependencies: #113420
2023-11-10 16:36:30 +00:00
23e0923c74 Revert "[pytree] reorganize submodule structure for C++ and Python pytree (#112278)"
This reverts commit eeeb40b32717bab75bd7d8f28f8343385688b3ab.

Reverted https://github.com/pytorch/pytorch/pull/112278 on behalf of https://github.com/PaliC due to Reverting this pr as the one under it in the stack is causing regressions in torchrec ([comment](https://github.com/pytorch/pytorch/pull/112278#issuecomment-1806044435))
2023-11-10 16:30:36 +00:00
d4c810cc11 [state_dict] Add cpu_only and ranks_only support for _gather_state_dict (#112836)
Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112836
Approved by: https://github.com/LucasLLC, https://github.com/wz337
2023-11-10 16:03:46 +00:00
08641a3232 Make FakeProcessGroup traceable (#113314)
This PR mimics what we have done to trace ProcessGroup. This allows use to use FakeProcessGroup with torch.compile. FakeProcessGroup allows us to use world_size > 1 without creating multiple processes thus enabling the usage of PDB to debug bucketing DDP allreduce in the Inductor. We can theoretically use GLOO with world_size==1 to achieve the same goal. However, the `wait()` seems to be optimized away when the world_size is 1.

Differential Revision: [D51136463](https://our.internmc.facebook.com/intern/diff/D51136463/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113314
Approved by: https://github.com/wanchaol
2023-11-10 16:03:38 +00:00
c3c4e70b2c Revert "Revert 107846 and 109695 (#111099)" (#113420)
The algorithm is taken from Numpy implementation at https://github.com/numpy/numpy/blob/main/numpy/lib/arraysetops.py#L323, it first do a sort on the input sequence and then  use a `mask` to record the unique element of each consecutive section.

Now we don't have parallel sort on 1-dimension float tensor, will have it enabled in next step. Parallel radix sort is used for 1-dimensional int tensor.

The following data is collected with script in the issue on Intel(R) Xeon(R) Gold 6248 CPU @ 2.5GHz with single sockets (20 cores):

#### before (dtype int64)
```
Numpy just sort: 0.4271528720855713 s
Numpy sort + indexes: 6.383563041687012 s
Torch just sort: 0.46924352645874023 s
Torch sort + indexes: 1.8140404224395752 s
```

#### after (dtype int64)
```
Torch just sort: 0.2540090084075928 s
Torch sort + indexes: 0.2766146659851074 s
```

#### before (float32)
```
Numpy just sort: 0.41129398345947266 s
Numpy sort + indexes: 6.422696590423584 s
Torch just sort: 9.109549283981323 s
Torch sort + indexes: 37.59021711349487 s
```

#### after (float32)
```
Torch just sort: 3.5369982719421387 s
Torch sort + indexes: 3.582240581512451 s
```

if we enabled parallel sort on 1-dimension float tensor, the performance is:
```
Torch just sort: 0.3212606906890869 s
Torch sort + indexes: 0.36211371421813965 s
```

Since i have fused the `inverse_indices` and `count` calculation in fused parallel loop (the algorithm is identical to NumPy's but with better optimization), they will take a small amount of additional time.

Use a reduction implementation for unique when dtype is bool on CPU.

This reverts commit 6dca81c054c1f7e378e956900265b085ca521e47 as `torch.sort` errors has been fixed in FBGEMM by 70c6e83c29.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113420
Approved by: https://github.com/malfet
2023-11-10 15:45:28 +00:00
8880584015 Improve test_float8.py (#113361)
The numeric test for round-trip casting of float8 dtypes originally consisted of generating a 100x100 tensor in the range 0..max.

This change refactors the test, adds further edge cases and fixes multiple issues with the lower precision simulation which the results of the round-trip cast test were checked against.

Set atol=0 and rtol=0 to ensure an exact equality comparison.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113361
Approved by: https://github.com/malfet, https://github.com/Neilblaze
2023-11-10 15:23:22 +00:00
574e313643 Add thiagocrepaldi as person of interest for onnx exporter (#113402)
@malfet @kit1980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113402
Approved by: https://github.com/malfet
2023-11-10 15:19:58 +00:00
71ca42787f Replaced deprecated pkg_resources.packaging with packaging module (#113023)
Usage of `from pkg_resources import packaging` leads to a deprecation warning:
```
DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
```
and in strict tests where warnings are errors, this leads to CI breaks, e.g.: https://github.com/pytorch/vision/pull/8092

Replacing `pkg_resources.package` with `package` as it is now a pytorch dependency:
fa9045a872/requirements.txt (L19)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113023
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-11-10 15:06:03 +00:00
f49b8e9313 Register SymInt-aware meta function for mm out, symintify resize (#113202)
Fixes https://github.com/pytorch/pytorch/issues/112489

Fixes https://github.com/pytorch/pytorch/issues/112494

New OpInfo tests for out variants added, since these were not exercised previously.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113202
Approved by: https://github.com/albanD
2023-11-10 14:27:05 +00:00
4f2b2883dc [Inductor] [Quant] Enable QLinear int8-mixed-bf16 Lowering (#112486)
**Summary**
- PR 7 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable the QLinear int8-mixed-bf16 weight prepack and post grad lowering inside inductor.

**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qlinear
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112486
Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/jerryzh168
2023-11-10 12:35:13 +00:00
eb1534027f Back out "[inductor] scale up num_warps for reductions to lower register pressure (#113039)" (#113400)
Test Plan: CI

Differential Revision: D51180501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113400
Approved by: https://github.com/htyu
2023-11-10 09:22:29 +00:00
86d32bedc2 [Inductor] [Quant] Enable QConv2d Binary int8-mixed-bf16 Lowering (#112551)
**Summary**
- PR 6 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable the QConv2d Binary int8-mixed-bf16 post grad lowering inside inductor.

**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112551
Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/jerryzh168
ghstack dependencies: #112550
2023-11-10 09:11:11 +00:00
65e99357ae [Inductor] [Quant] Enable QConv2d Unary int8-mixed-bf16 Lowering (#112550)
**Summary**
- PR 5 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable the QConv2d Unary int8-mixed-bf16 weight prepack and post grad lowering inside inductor.

**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112550
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-11-10 08:59:40 +00:00
63d65dd6cd Correct output shape of meta registration for qlinear_pointwise (#112390)
Corrected output shape of meta registration for qlinear_pointwise.
Because the weight of qlinear_pointwise has been transposed during the qLinear weight prepack process, the shape of the weight of qlinear_pointwise is (in_features, out_features).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112390
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
2023-11-10 07:50:59 +00:00
cyy
41e8632ca4 [1/N] Fix clang-tidy warnings in torch/csrc/profiler (#112360)
This PR fixes some clang-tidy warnings in torch/csrc/profiler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112360
Approved by: https://github.com/ezyang
2023-11-10 07:37:23 +00:00
0f7ac2635d Uniformly use SourcelessBuilder to handle user defined types (#113390)
Subsumes https://github.com/pytorch/pytorch/pull/110794

Fixes https://github.com/pytorch/pytorch/issues/110315

This is not really a 100% sound fix, a deeper analysis of the bug can be found at https://docs.google.com/document/d/1y-nRAPdbZEji52MPKYzC0U3VhvW9yEAEDqP5t5GhWZ0/edit

The general idea behind the fix here is that we are going to play fast and loose with user defined classes: as Dynamo is written today, we are willing to pull out these types and directly manipulate them (e.g., look at their `__mro__`, etc) without an intervening VariableTracker. As such, if I use `python_type` to extract out the Python type of a VT or if I am manually reading out the `__bases__` of a type, which may be a user defined class, if it is sourceless, all I need to do is use SourcelessBuilder instead of ConstantVariable to make sure I wrap it into the correct VT class.

The approach in https://github.com/pytorch/pytorch/pull/110794 was "more correct", but we'd have to go substantially further to get it all working. So I am doing this to unblock suo for now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113390
Approved by: https://github.com/suo
2023-11-10 07:26:52 +00:00
59592389fc Revert "[dynamo] Refactor test cross importing (#113242)"
This reverts commit 8858edad656f505728c9810093f796f96e1285cb.

Reverted https://github.com/pytorch/pytorch/pull/113242 on behalf of https://github.com/PaliC due to this diff appears to be causing inductor failures internally ([comment](https://github.com/pytorch/pytorch/pull/113242#issuecomment-1805132719))
2023-11-10 05:43:08 +00:00
eeeb40b327 [pytree] reorganize submodule structure for C++ and Python pytree (#112278)
Reorganized the two C++ and Python pytree submodules into a subpackage. I think this would be easier to implement the abstract `PyTreeAPI` class with two implementations. And it will be much easier for the user to switch between the two implementations.

Before:

```text
torch
├── utils
│   ├── _pytree.py
│   ├── _cxx_pytree.py
│   ...
...
```

After:

```text
torch
├── utils
│   ├── _pytree
│   │   ├── __init__.py
│   │   └── api
│   │       ├── __init__.py
│   │       ├── cxx.py
│   │       └── python.py
│   ...
...
```

The `torch.utils._pytree` module will import all APIs from `torch.utils._pytree.api.python`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112278
Approved by: https://github.com/zou3519
ghstack dependencies: #112111
2023-11-10 05:41:32 +00:00
68bf0f1e7d Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)"
This reverts commit c967dc526a40f4b15003f9c99383acabe66367a6.

Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/PaliC due to the diff this is stacked on top of appears to be causing inductor failures internally ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1805131017))
2023-11-10 05:40:55 +00:00
8943207925 [dynamo] Support kwargs for lazy module call. (#113387)
Summary: Seems like we already support kwargs in _infer_argument, so we don't need the extra assertion here.

Test Plan: buck test caffe2/test:test_export -- -r lazy_module_kwargs

Differential Revision: D51170339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113387
Approved by: https://github.com/yanboliang
2023-11-10 05:17:58 +00:00
7a1314c548 [Kineto] Fix the Chrome trace loading issue with all_to_all input split length > 30 (#113392)
Summary:
This change fixes the Chrome trace loading issue with all_to_all input split length > 30.

Now when the `all_to_all` input split size is larger than 30 we truncate the content and adding `...` at the end, which caused trouble when loading with Chrome trace.

Test Plan:
**Trace with length = 2**:
- Link: https://fburl.com/perfdoctor/b94u4x82
 {F1145436735}

**Looking into the json file**:
```
Before:
"In split size": [6058496, 5942784]

After
"In split size": "[6058496, 5942784]"
```

Reviewed By: aaronenyeshi

Differential Revision: D51167843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113392
Approved by: https://github.com/aaronenyeshi
2023-11-10 05:03:18 +00:00
f9114193bd [NCCL PG] ADD a separate monitoring thread to ensure we collect debug info and check watchdog heartbeat (#112518)
This PR has the following goals:
1. Detect unhealthy nccl watchdog thread by implementing a heartbeat. NCCL watchdog sometimes can hang for several reasons such as nccl/cuda API bugs or unexpected blocking behaviors. This is the last resort to ensure that we don't silently keep the training job run for hours.
2. Sometimes, the process gets stuck in the destroy of NCCL PG, and this PR will ensure that we will eventually abort it after some time (by default 2 mins)
3. Once heartbeat cannot be heard, we dump debug information (for now, we just use the flight recorder implemented in https://github.com/pytorch/pytorch/pull/110960/files) to disk. (How and where to dump the debug info will be addressed in the following PR).
4. Finally, we initiate std::abort via `LOG(FATAL)` to kill the process.

To clarify further what this PR is trying to solve, we first list are four cases when a NCCL PG can end up with:
- case 1: ncclwatchdog gets stuck (maybe some blocking API) and heartbeat monitor kills it during regular heartbeat monitor loop.
- case 2: ncclwatchdog timeout and desync report or destroy kicked in(let's call it shutdown) but this shutdown takes so long and heartbeat believes it has to kills the process anyway.
- case 3: ncclwatchdog aborts the process (heartbeat monitor not involved)
- case 4: program exits cleanly (heartbeat monitor not involved)

As we can see here, this PR is trying to address case one and two and we also want to ensure adding one more monitor thread does not interfere what we are currently doing in case three and four. That's why we added two flags `terminateHeartbeatMonitorThread_` and `collectiveDebugInfoMode_`.

For case three and four, either `monitorWakeUpCV_` will be waked up in the destructor or `terminateHeartbeatMonitorThread_` will be set to true. So that monitor thread will just exit ASAP.

For case one, both `terminateHeartbeatMonitorThread_` and `collectiveDebugInfoMode_` will still false when monitor thread see there are no heartbeat, so it will directly kill the process. For case two, either `terminateHeartbeatMonitorThread_` and `collectiveDebugInfoMode_` will be true, the monitor thread will wait extra time before killing the process.

Differential Revision: [D51146305](https://our.internmc.facebook.com/intern/diff/D51146305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112518
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2023-11-10 04:41:14 +00:00
265d6aac0b [MPS] Fix crashes during Conv backward pass (#113398)
By adding weights tensor to the MPSGraph cache key.
Add regression test to validate that collision no longer happens

Fixes https://github.com/pytorch/pytorch/issues/112998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113398
Approved by: https://github.com/kulinseth
2023-11-10 04:29:33 +00:00
7064fbf1ea Fix selective activation checkpointing with subclasses that override sizes() (#113380)
The problem is that we have a subclass (FunctionalTensor) that overrides size/stride calls, causing them to go through __torch_dispatch__.

But when SAC is active, we have _CachingTorchDispatchMode.__torch_dispatch__ active, that intercepts those size/stride calls first, and does something different with them instead of letting FunctionalTensor.__torch_dispatch__ handle them.

This PR updates the SAC torch dispatch mode to know to not handle metadata calls, and let its tensor arguments handle them directly.

Right now, `FunctionalTensor` has a hardcoded list of metadata ops, but we should probably put them somewhere more general.

I'll add better testing before landing this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113380
Approved by: https://github.com/yf225, https://github.com/wanchaol
2023-11-10 04:12:50 +00:00
cb48f7855a [inductor cpu] fix uint8 add and sub (#113253)
Fix https://github.com/pytorch/pytorch/issues/113016 and https://github.com/pytorch/pytorch/issues/113020 and https://github.com/pytorch/pytorch/issues/113141 and https://github.com/pytorch/pytorch/issues/113143 and https://github.com/pytorch/pytorch/issues/113144
Explicit typecast result of add/sub to uint8 (similar to how we fixed mul previously) to avoid implicit type promotion from C.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113253
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-11-10 04:06:42 +00:00
c7e12c7427 Rerun disabled tests on MacOS x86 (#113315)
After the recent change https://github.com/pytorch/pytorch/pull/112103 to get the correct job name for GitHub runner, I expect rerun disabled tests and memory leak checks to start running on MacOS x86, but they are still not there.  It turns out that we fix the schedule there

Pretty simple change, I guess I will let it test in trunk?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113315
Approved by: https://github.com/clee2000
2023-11-10 03:24:27 +00:00
866457e746 Fix pydocstyle errors in fully_sharded_data_parallel.py, api.py, graph_utils.py, distribute.py, iter_graph_module.py, comm_tensor.py, experimental_ops.py, batch_dim_utils.py, data_parallel.py, graph_optimization.py (#113216)
Fixes #113191

```
pydocstyle torch/distributed/fsdp/fully_sharded_data_parallel.py --count
```

On master: 80
After my changes on this PR: 3

```
pydocstyle torch/distributed/_spmd/comm_tensor.py --count
```
On master: 5
After my changes on this PR: 3

```
pydocstyle torch/distributed/_spmd/experimental_ops.py --count
```
On master: 3
After my changes on this PR: 1

```
pydocstyle torch/distributed/_spmd/iter_graph_module.py --count
```
On master: 39
After my changes on this PR: 27

```
pydocstyle torch/distributed/_spmd/graph_utils.py --count
```
On master: 16
After my changes on this PR: 4

```
pydocstyle torch/distributed/_spmd/distribute.py --count
```
On master: 19
After my changes on this PR: 10

```
pydocstyle torch/distributed/_spmd/api.py --count
```
On master: 10
After my changes on this PR: 3

```
pydocstyle torch/distributed/_spmd/batch_dim_utils.py  --count
```
On master: 14
After my changes on this PR: 3

```
pydocstyle torch/distributed/_spmd/data_parallel.py --count
```
On master: 34
After my changes on this PR: 2

```
pydocstyle torch/distributed/_spmd/graph_optimization.py --count
```
On master: 35
After my changes on this PR: 13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113216
Approved by: https://github.com/ezyang
2023-11-10 03:08:32 +00:00
773b1cbe4f [BE] Parenthesize and clauses for clarity (#113362)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113362
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-11-10 03:01:48 +00:00
a0d00349ed [pytree] register pytree node type in both C++ pytree and Python pytree (#112111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112111
Approved by: https://github.com/zou3519
2023-11-10 02:41:30 +00:00
5e2adc8650 [pytree] align function signature between C++ and Python pytree (#112482)
Change the argument name in C++ and Python pytree APIs. Also add a test to ensure the function signatures are the same in the two implementations.

- #112485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112482
Approved by: https://github.com/zou3519
2023-11-10 02:37:48 +00:00
605236af06 Force fp16 for vision_maskrcnn inference (#113110)
For fp16 for maskrcnn inference (doesnt support bf16). Also skip phi_1_5 in training - it OOMs even with batch size 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113110
Approved by: https://github.com/xmfan
2023-11-10 02:25:11 +00:00
8bdce9bb74 Fix UntypedStorage.resize_ to keep same CUDA device index (#113386)
Fixes #113300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113386
Approved by: https://github.com/albanD
2023-11-10 01:57:25 +00:00
1488bafb27 [AOTI] Implement support for user defined kernels that use triton.autotune (#113229)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113229
Approved by: https://github.com/chenyang78
2023-11-10 01:39:00 +00:00
44d0226690 Fix logging exception/stacks from logging (#113394)
We were accidentally dropping them in our formatter, oops.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113394
Approved by: https://github.com/albanD
2023-11-10 01:17:29 +00:00
66150b29e3 Revert "[pytree] align function signature between C++ and Python pytree (#112482)"
This reverts commit 4893a2814ffb5adeec102c17d71d2f25ba5eeb3c.

Reverted https://github.com/pytorch/pytorch/pull/112482 on behalf of https://github.com/PaliC due to changing _register_pytree_node's signature is bc breaking, please revert the signature and reland ([comment](https://github.com/pytorch/pytorch/pull/112482#issuecomment-1804909926))
2023-11-10 00:59:23 +00:00
9a90989121 Revert "[pytree] register pytree node type in both C++ pytree and Python pytree (#112111)"
This reverts commit 95f52611c735ad5d4eb7967f8588fec065a1b323.

Reverted https://github.com/pytorch/pytorch/pull/112111 on behalf of https://github.com/PaliC due to in the bottom diff in the stack changing _register_pytree_node's signature is bc breaking, please revert the signature and reland ([comment](https://github.com/pytorch/pytorch/pull/112111#issuecomment-1804892924))
2023-11-10 00:38:28 +00:00
d18d7a603e [fbgemm_gpu] add pt2_compliant tag to some ops (#113201)
Summary:
X-link: https://github.com/pytorch/FBGEMM/pull/2119

Logs show these ops are being used with PT2, so we are grandfathering in these
ops to the pt2_compliant tag. Most of these ops are tested, some aren't.

Test Plan: - existing tests

Differential Revision: D51076460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113201
Approved by: https://github.com/williamwen42
2023-11-10 00:32:30 +00:00
cada6c7fee [dynamo] Fix a bug by desugaring in-place ops on constants (#113117)
Summary:

Python allows users to write code like
```
x: 1
x += y
x += z
```

This code has well-defined semantics: because x is an immutable primitive, the first `+=` will actually re-bind x, it is equivalent to `x = x + y`.

The second in-place operation will either similarly desugar (if the result of `x + y` is itself immutable), or possibly result in "true" in-place operation.

Now, this is a problem for us because today, dynamo tries to both resolve constant variables to their literal values at compile time and also compile in a way that treats `operator.*` builtin functions consistently. This leads to a bug where code like
```
x: 1
x += y
```
actually gets compiled to
```
1 += y
```
which is both semantically meaningless and a syntax error.

A very simple fix that we've already used to fix the special case of `+=` is to detect this, treat it as an edge case, and desugar eagerly into `x = x + y`.

The problem with that fix is that it only patched `iadd`, but actually *all* of the in-place operators exhibit this behavior.

This commit proposes that we tackle all of the inplace opeartors supported by fx in the same way: eagerly remap the operation to an assignment when the left-side is actually an immutable constant.

**Alternatives?**

There might be some other fix possible that wouldn't produce a hardcoded remapping; I know that we generally don't like the growth of mappings and blocklists in dynamo.

I'm a little skeptical about a general solution though, because the bug is due precisely to Python's highly dynamic dispatching of inplace operations by type; since the fx graph has to be purely static, I suspect that we actually have to desugar this somewhere, because the dataflow is fundamentally different for true inplace operations on types that define `__iadd__`, etc vs the desugaring on primitives.

I'm open to other suggestions

Test Plan:

I verified that the code in
https://github.com/pytorch/pytorch/issues/112656
compiles with this fix, and the compiled functions produce the same outputs as the originals.

This needs unit tests, but I'd like to get feedback on the approach in the meantime.

Fixes #112656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113117
Approved by: https://github.com/yanboliang
2023-11-10 00:22:55 +00:00
bf452dcde6 Revert "[pytree] reorganize submodule structure for C++ and Python pytree (#112278)"
This reverts commit fa895da968ec6f1ae128ee95fcb96ba9addac8a0.

Reverted https://github.com/pytorch/pytorch/pull/112278 on behalf of https://github.com/PaliC due to in the bottom diff in the stack changing _register_pytree_node's signature is bc breaking, please revert the signature and reland ([comment](https://github.com/pytorch/pytorch/pull/112278#issuecomment-1804870560))
2023-11-10 00:12:52 +00:00
c967dc526a [inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)
This PR is just moving things around, so code shared by multiple tests files is in torch/testing/_internal/inductor_utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113275
Approved by: https://github.com/yanboliang
2023-11-10 00:11:09 +00:00
8a91138f60 Dont error on returned constant, fix for levit_128 (#112544)
Previously, levit_128 would fail on inference because we would return a view of a constant, which messed up our assertions of outputs being in the cuda graph pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112544
Approved by: https://github.com/ezyang
ghstack dependencies: #112543
2023-11-10 00:04:25 +00:00
f8a6ea770c [UCC] Fix input tensor in scatter (#112246)
Input tensor is valid only for root rank. Fixes https://github.com/openucx/ucc/issues/859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112246
Approved by: https://github.com/Aidyn-A, https://github.com/Fuzzkatt, https://github.com/kwen2501
2023-11-09 22:53:40 +00:00
c7e0fa49b6 [UCC][CUDA] Overlap p2p (#111608)
The process group needs to set different streams for send and recv ops to make them asynchronous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111608
Approved by: https://github.com/kwen2501
2023-11-09 22:48:25 +00:00
bb06725ee0 Update mentions of deprecated functions if complex_numbers.rst (#113391)
`torch.svd` is deprecated, and `torch.solve` is completely removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113391
Approved by: https://github.com/malfet, https://github.com/lezcano
2023-11-09 22:32:26 +00:00
afbf345807 [ROCm] Unskip functorch tests that now work (#110760)
This issue unskips some of the working tests that were skipped as a result of https://github.com/pytorch/pytorch/issues/96560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110760
Approved by: https://github.com/zou3519, https://github.com/jeffdaily
2023-11-09 22:26:02 +00:00
0aed86a175 Fix docstring errors in Zero Redundancy Optimizer (#113200)
This PR reduces docstring erros to 0 from total 98. This can be verified by running,
`pydocstyle path-to-zero_redundancy_optimizer.py --count`

BEFORE the PR:
`pydocstyle torch/distributed/optim/zero_redundancy_optimizer.py --count`
98
AFTER the PR:
`pydocstyle torch/distributed/optim/zero_redundancy_optimizer.py --count`
0

Fixes #112642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113200
Approved by: https://github.com/weifengpy
2023-11-09 22:21:40 +00:00
e6f0960762 [inductor] Make debug.py pass follow-imports typechecking (#113307)
pydot accepts both a str and a list of str for its `prog` parameter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113307
Approved by: https://github.com/Skylion007
ghstack dependencies: #113304, #113305, #113306
2023-11-09 22:08:17 +00:00
a65969928c [inductor] Make codecache.py pass follow-imports typechecking (#113306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113306
Approved by: https://github.com/Skylion007
ghstack dependencies: #113304, #113305
2023-11-09 22:08:17 +00:00
87082bd025 Reduce single reader check time for inline_container (#113328)
Differential Revision: D51089711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113328
Approved by: https://github.com/jiayisuse
2023-11-09 22:02:28 +00:00
a3a55df4af [dynamo] Add .pyi declaration of _CacheEntry (#113305)
This is required for enabling follow-imports=silent; referenced by
_dynamo/types.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113305
Approved by: https://github.com/Skylion007, https://github.com/ezyang
ghstack dependencies: #113304
2023-11-09 21:55:49 +00:00
767ce2b81c [dynamo] Make decorators.py pass follow-import typechecking (#113304)
I am trying to turn on `follow_imports=silent` for MYPYNOFOLLOW.
However, this requires a huge number of changes, so I am breaking it
down to a per-file basis.

Unfortunately, we will not be able to turn on `follow_imports` until all
files are fixed, so there is no way to stop regressions. So I hope to get
these fixes in as fast as possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113304
Approved by: https://github.com/Skylion007
2023-11-09 21:55:49 +00:00
4e2e0437ea [fx] stylistic improvements for fx.split_module (#113373)
Was overly verbose before. Less qualified / long names = more clarity

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113373
Approved by: https://github.com/wconstab
2023-11-09 21:49:27 +00:00
82369e44a9 Add sym_node to uninteresting files (#113349)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113349
Approved by: https://github.com/Skylion007
2023-11-09 21:38:57 +00:00
ff592f1038 [iOS][PTMCoreMLCompiler] Refactor use of deprecated writeToFile:atomically: (#113377)
Summary:
The NSString writeToFile:atomically: method was deprecated in iOS 2.0.
This diff replaces it with a call to writeToFile:atomically:encoding:error:

duplicate of D51003188 to fix gh permissions

Test Plan: ci

Differential Revision: D51164941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113377
Approved by: https://github.com/kirklandsign
2023-11-09 21:08:23 +00:00
b8a302ae6a Disable flaky cpp test (#113302)
Fixes [#ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/113251)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113302
Approved by: https://github.com/clee2000
2023-11-09 20:30:31 +00:00
501d118255 [quant][pt2e] Add transform_for_annotation method in Quantizer (#113115)
Summary:
Adding the method so that people can do some transformations before annotation to make the graph easier to annotate

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_transform_for_annotation

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D51141080](https://our.internmc.facebook.com/intern/diff/D51141080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113115
Approved by: https://github.com/kimishpatel
2023-11-09 20:23:29 +00:00
e53da90fe6 [Execution Trace] record global rank in pg_config_info (#113316)
Summary:
pg_config_info is used to dump pg information in Execution Trace(ET). For trace analysis purpose and PARAM replay benchmark, global rank is more meaningful than group ranks.

p.s. ranks is a map of global rank: group rank

Test Plan: Tested in HPC

Differential Revision: D51136587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113316
Approved by: https://github.com/XilunWu
2023-11-09 20:04:43 +00:00
5ccd22502f [contextlib] Wrapping a function with set_grad_enabled will consume its global mutation (#113359)
Fixes https://github.com/pytorch/pytorch/issues/113298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113359
Approved by: https://github.com/soulitzer, https://github.com/jansel
2023-11-09 19:16:20 +00:00
0381d8ce68 Quantized max pool 2d (#112937)
Summary: Add quantized max pool 2d operation

Test Plan:
Check that all quantized tests pass"

buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.uniform_buffer_copy
[       OK ] VulkanAPITest.uniform_buffer_copy (66 ms)
[ RUN      ] VulkanAPITest.copy_to_buffer
[       OK ] VulkanAPITest.copy_to_buffer (61 ms)
[ RUN      ] VulkanAPITest.copy_to_buffer_channels_last
[       OK ] VulkanAPITest.copy_to_buffer_channels_last (28 ms)
[ RUN      ] VulkanAPITest.cpu_to_vulkan_and_dequantize_quint8
[       OK ] VulkanAPITest.cpu_to_vulkan_and_dequantize_quint8 (58 ms)
[ RUN      ] VulkanAPITest.cpu_to_vulkan_and_dequantize_qint8
[       OK ] VulkanAPITest.cpu_to_vulkan_and_dequantize_qint8 (44 ms)
[ RUN      ] VulkanAPITest.cpu_to_vulkan_and_dequantize_qint32
[       OK ] VulkanAPITest.cpu_to_vulkan_and_dequantize_qint32 (72 ms)
[ RUN      ] VulkanAPITest.quantize_dequantize
[       OK ] VulkanAPITest.quantize_dequantize (2 ms)
[ RUN      ] VulkanAPITest.quantize_per_tensor_and_dequantize_quint8
[       OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_quint8 (69 ms)
[ RUN      ] VulkanAPITest.quantize_per_tensor_and_dequantize_quint8_qparams
[       OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_quint8_qparams (58 ms)
[ RUN      ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint8
[       OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint8 (77 ms)
[ RUN      ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint8_qparams
[       OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint8_qparams (54 ms)
[ RUN      ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint32
[       OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint32 (93 ms)
[ RUN      ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint32_qparams
[       OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint32_qparams (90 ms)
[ RUN      ] VulkanAPITest.quantized_add
[       OK ] VulkanAPITest.quantized_add (2 ms)
[ RUN      ] VulkanAPITest.quantized_add_broadcast
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1103 17:42:18.018113 4075724928 Resize.cpp:35] Warning: An output with one or more elements was resized since it had shape [2, 13, 1, 27], which does not match the required output shape [2, 13, 32, 27]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check)
[       OK ] VulkanAPITest.quantized_add_broadcast (2 ms)
[ RUN      ] VulkanAPITest.quantized_add_broadcast1
[       OK ] VulkanAPITest.quantized_add_broadcast1 (1 ms)
[ RUN      ] VulkanAPITest.quantized_add_broadcast2
W1103 17:42:18.022008 4075724928 Resize.cpp:35] Warning: An output with one or more elements was resized since it had shape [32, 1], which does not match the required output shape [32, 27]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check)
[       OK ] VulkanAPITest.quantized_add_broadcast2 (0 ms)
[ RUN      ] VulkanAPITest.quantized_add_broadcast3
[       OK ] VulkanAPITest.quantized_add_broadcast3 (0 ms)
[ RUN      ] VulkanAPITest.quantized_add_dif_params
[       OK ] VulkanAPITest.quantized_add_dif_params (1 ms)
[ RUN      ] VulkanAPITest.conv2d
[       OK ] VulkanAPITest.conv2d (4 ms)
[ RUN      ] VulkanAPITest.conv2d_pw
[       OK ] VulkanAPITest.conv2d_pw (88 ms)
[ RUN      ] VulkanAPITest.conv2d_dw
[       OK ] VulkanAPITest.conv2d_dw (32 ms)
[ RUN      ] VulkanAPITest.quantized_sub
[       OK ] VulkanAPITest.quantized_sub (1 ms)
[ RUN      ] VulkanAPITest.quantized_mul
[       OK ] VulkanAPITest.quantized_mul (1 ms)
[ RUN      ] VulkanAPITest.quantized_div
[       OK ] VulkanAPITest.quantized_div (1 ms)
[ RUN      ] VulkanAPITest.quantized_upsample_nearest2d
[       OK ] VulkanAPITest.quantized_upsample_nearest2d (0 ms)
[ RUN      ] VulkanAPITest.max_pool2d_qint8
[       OK ] VulkanAPITest.max_pool2d_qint8 (5 ms)
[ RUN      ] VulkanAPITest.max_pool2d_quint8
[       OK ] VulkanAPITest.max_pool2d_quint8 (4 ms)
[ RUN      ] VulkanAPITest.quantized_add_tests
[       OK ] VulkanAPITest.quantized_add_tests (77 ms)
[ RUN      ] VulkanAPITest.quantized_sub_tests
[       OK ] VulkanAPITest.quantized_sub_tests (104 ms)
[ RUN      ] VulkanAPITest.quantized_mul_tests
[       OK ] VulkanAPITest.quantized_mul_tests (78 ms)
[ RUN      ] VulkanAPITest.quantized_div_tests
[       OK ] VulkanAPITest.quantized_div_tests (124 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_fixed_params_uint8
[       OK ] VulkanAPITest.conv2d_quantized_fixed_params_uint8 (1 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_computed_params_uint8
[       OK ] VulkanAPITest.conv2d_quantized_computed_params_uint8 (0 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_random_params_uint8
[       OK ] VulkanAPITest.conv2d_quantized_random_params_uint8 (0 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_prepack_fixed_params_uint8
[       OK ] VulkanAPITest.conv2d_quantized_prepack_fixed_params_uint8 (0 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_prepack_computed_params_uint8
[       OK ] VulkanAPITest.conv2d_quantized_prepack_computed_params_uint8 (0 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_prepack_random_params_uint8
[       OK ] VulkanAPITest.conv2d_quantized_prepack_random_params_uint8 (0 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_fixed_params_uint8
[       OK ] VulkanAPITest.conv2d_dw_quantized_fixed_params_uint8 (4 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_computed_params_uint8
[       OK ] VulkanAPITest.conv2d_dw_quantized_computed_params_uint8 (3 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_random_params_uint8
[       OK ] VulkanAPITest.conv2d_dw_quantized_random_params_uint8 (3 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_prepack_fixed_params_uint8
[       OK ] VulkanAPITest.conv2d_dw_quantized_prepack_fixed_params_uint8 (3 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_prepack_computed_params_uint8
[       OK ] VulkanAPITest.conv2d_dw_quantized_prepack_computed_params_uint8 (3 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_prepack_random_params_uint8
[       OK ] VulkanAPITest.conv2d_dw_quantized_prepack_random_params_uint8 (3 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_fixed_params_uint8
[       OK ] VulkanAPITest.conv2d_pw_quantized_fixed_params_uint8 (11 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_computed_params_uint8
input_dif too big: 0.0175897. generating input again ...
[       OK ] VulkanAPITest.conv2d_pw_quantized_computed_params_uint8 (17 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_random_params_uint8
[       OK ] VulkanAPITest.conv2d_pw_quantized_random_params_uint8 (11 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_prepack_fixed_params_uint8
[       OK ] VulkanAPITest.conv2d_pw_quantized_prepack_fixed_params_uint8 (11 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_prepack_computed_params_uint8
[       OK ] VulkanAPITest.conv2d_pw_quantized_prepack_computed_params_uint8 (11 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_uint8
[       OK ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_uint8 (11 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_fixed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_quantized_fixed_params_int8_int32 (1 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_computed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_quantized_computed_params_int8_int32 (0 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_random_params_int8_int32
[       OK ] VulkanAPITest.conv2d_quantized_random_params_int8_int32 (0 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_prepack_fixed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_quantized_prepack_fixed_params_int8_int32 (0 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_prepack_computed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_quantized_prepack_computed_params_int8_int32 (0 ms)
[ RUN      ] VulkanAPITest.conv2d_quantized_prepack_random_params_int8_int32
[       OK ] VulkanAPITest.conv2d_quantized_prepack_random_params_int8_int32 (0 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_fixed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_dw_quantized_fixed_params_int8_int32 (3 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_computed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_dw_quantized_computed_params_int8_int32 (4 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_random_params_int8_int32
[       OK ] VulkanAPITest.conv2d_dw_quantized_random_params_int8_int32 (3 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_prepack_fixed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_dw_quantized_prepack_fixed_params_int8_int32 (4 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_prepack_computed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_dw_quantized_prepack_computed_params_int8_int32 (3 ms)
[ RUN      ] VulkanAPITest.conv2d_dw_quantized_prepack_random_params_int8_int32
[       OK ] VulkanAPITest.conv2d_dw_quantized_prepack_random_params_int8_int32 (3 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_fixed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_pw_quantized_fixed_params_int8_int32 (11 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_computed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_pw_quantized_computed_params_int8_int32 (12 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_random_params_int8_int32
[       OK ] VulkanAPITest.conv2d_pw_quantized_random_params_int8_int32 (11 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_prepack_fixed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_pw_quantized_prepack_fixed_params_int8_int32 (11 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_prepack_computed_params_int8_int32
[       OK ] VulkanAPITest.conv2d_pw_quantized_prepack_computed_params_int8_int32 (12 ms)
[ RUN      ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_int8_int32
[       OK ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_int8_int32 (11 ms)
[ RUN      ] VulkanAPITest.quantized_tensor_get_scale_zero_point
[       OK ] VulkanAPITest.quantized_tensor_get_scale_zero_point (0 ms)
[ RUN      ] VulkanAPITest.linear_2d_flat
[       OK ] VulkanAPITest.linear_2d_flat (3 ms)
[ RUN      ] VulkanAPITest.linear_2d_small
[       OK ] VulkanAPITest.linear_2d_small (0 ms)
[ RUN      ] VulkanAPITest.linear_2d_large
[       OK ] VulkanAPITest.linear_2d_large (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_flat
[       OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_small
[       OK ] VulkanAPITest.linear_3d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_3d_large
[       OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_flat
[       OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (2 ms)
[----------] 78 tests from VulkanAPITest (1537 ms total)

[----------] Global test environment tear-down
[==========] 78 tests from 1 test suite ran. (1537 ms total)
[  PASSED  ] 78 tests.

Differential Revision: D50821920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112937
Approved by: https://github.com/yipjustin
2023-11-09 19:15:26 +00:00
44c0521e8c fix: docstring error in torch/distributed module (#113241)
Fixes: #113193

`pydocstyle <all_files_in_issue> --count`

- Before: 345
- After: 130

For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241
Approved by: https://github.com/kit1980
2023-11-09 19:10:20 +00:00
977e555ca6 Skip conv-bn folding on multiple conv uses (#112543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112543
Approved by: https://github.com/XiaobingSuper, https://github.com/davidberard98
2023-11-09 18:38:10 +00:00
b0c9ccdc4b Add standard deviation of metrics over runs to inference benchmark (#113309)
Run each `(batch_size, compile)` benchmark 10 times in `./runner.sh` and get mean and standard deviation of metrics in output table

Only report `warmup latency`, `average_latency`, `throughput` and `gpu_util`

Break `output.md` file into a single markdown file per `(batch_size, compile)` configuration. Further runs of `./runner.sh` will append one row to the table in each file for easy comparison

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113309
Approved by: https://github.com/albanD
2023-11-09 18:38:05 +00:00
d977f118ad Update ruff linter to v0.1.5 (#113355)
Update ruff linter to v0.1.5. Mainly bugfixes, primarily for autofixes, but good to include since there is at least one pydocstyle autofix update in their as people prepare their pydocstyle PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113355
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-11-09 18:06:54 +00:00
9834fb7fd0 [dtensor] full_tensor to return synchronously (#113322)
full_tensor API should return synchronously instead of
AsyncCollectiveTensor and if the return is that, we do the wait
directly, this makes the full_tensor API be more percise
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113322
Approved by: https://github.com/wz337
2023-11-09 18:02:40 +00:00
4da5d4b2ef Fix Clang compilation error with Lib ATen for ppc64le (#106446)
This patch fixes error while compiling with Clang for ppc64le
I have used clang version 15.0.7
Errors are as follow:
```
No matching function for call to 'vec_sel’
No matching function for call to 'vec_splats'
Excess elements in scalar initializer
Use of undeclared identifier 'vec_vsubudm'
Fix for multiple error within int64_t  DEFINE_MEMBER_OP_AND_ONE
```
References:
- https://releases.llvm.org/9.0.0/tools/clang/docs/AttributeReference.html
- https://reviews.llvm.org/D81083

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106446
Approved by: https://github.com/malfet
2023-11-09 17:27:14 +00:00
289d887a41 Fix ZeroDivisionError when unfolding a zero-dimension tensor in compile mode (#113259)
Fixes #113026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113259
Approved by: https://github.com/peterbell10
2023-11-09 17:25:36 +00:00
1d56e7b5af Adds broadcast to functional collectives (#112668)
Adds `broadcast` to functional collectives, including inductor support.

Test with `python test_inductor_collectives.py -- TestCollectivesMultiProc.test_broadcast_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112668
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2023-11-09 15:47:52 +00:00
bf2c20be55 [inductor test] enable dynamic loop for test_adaptive_avg_pool1d_argmax (#113339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113339
Approved by: https://github.com/lezcano
ghstack dependencies: #113168
2023-11-09 15:14:52 +00:00
f98ba596f1 Use CapturedTraceback symbolizer for C++ exceptions from Python library (#113207)
This is the cheap and cheerful implementation, which is only enabled on TORCH_SHOW_CPP_STACKTRACES, because it *eagerly* symbolizes immediately at exception throw time, even if the exception will end up getting caught. It would be better to do this lazily and only symbolize when we try to print the exception, but that requires a more involved refactor of c10::Error that I don't feel like doing.

Compare the output before:

```
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x95 (0x7fa21b99d975 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so)
frame #1: c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const + 0x8d (0x7fa21b951269 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so)
frame #2: c10::TensorImpl::sizes_custom() const + 0x9f (0x7fa21b9770df in /data/users/ezyang/c/pytorch/torch/lib/libc10.so)
frame #3: at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) + 0x31e (0x7fa20a202a8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x29f34de (0x7fa20b5f34de in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2a1fd8e (0x7fa20b61fd8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x6b907b (0x7fa2142b907b in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6b6175 (0x7fa2142b6175 in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so)
```

and after:

```
#4 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#5 c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const from ??:0
#6 c10::TensorImpl::sizes_custom() const [clone .localalias] from TensorImpl.cpp:0
#7 at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) from ??:0
#8 at::(anonymous namespace)::wrapper_Meta_mm_out_out(at::Tensor const&, at::Tensor const&, at::Tensor&) from RegisterMeta.cpp:0
#9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (at::Tensor const&, at::Tensor const&, at::Tensor&), &at::(anonymous namespace)::wrapper_Meta_mm_out_out>, at::Tensor&, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from RegisterMeta.cpp:0
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113207
Approved by: https://github.com/Skylion007
2023-11-09 15:06:08 +00:00
e6eab49e11 [dynamo] graph break on setattr requires_grad (#113163)
Main: `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`
This PR: graph breaks and eager applies the mutation, new tensors are tracked

Fixes https://github.com/pytorch/pytorch/issues/109505 (the original bug does not occur, but a new bug where the mutation isn't applied - because AOTAutograd is not `requires_grad` mutation aware - is mitigated)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113163
Approved by: https://github.com/bdhirsh
2023-11-09 13:13:29 +00:00
8c704f7a0e [inductor cpp] fix argmax with >1 reduction dims (#113168)
Fix #113013.

The argmax (and argmin) implementation doesn't handle the index compute properly when the number of reduction dims is larger than 1. It wrongly assumed only one reduction dim.

With the given reproducer, the generated code before the change:
```c++
#include "/tmp/torchinductor_jgong5/tb/ctbgktuhgnnlel6ipqkfk76lfztr5pledachdkcq3asdqtlxpzt6.h"
extern "C" void kernel(const double* in_ptr0,
                       long* out_ptr0)
{
    {
        {
            struct IndexValue_1 {size_t index; double value;};
            IndexValue_1 tmp_acc0{0, -std::numeric_limits<double>::infinity()};
            #if !defined(__clang_major__) || __clang_major__ > 9
            #pragma omp declare reduction(argmax : IndexValue_1 :\
                omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\
                omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\
            	initializer(omp_priv = {0, -std::numeric_limits<double>::infinity()})
            #endif
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(9L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(2L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = c10::convert<long>(0);
                    auto tmp1 = c10::convert<long>(1);
                    auto tmp2 = tmp0 < tmp1;
                    auto tmp3 = c10::convert<long>(at::native::div_floor_integer((3L*x1), 2L));
                    auto tmp4 = c10::convert<long>(2L + (at::native::div_floor_integer((3L*x1), 2L)));
                    auto tmp5 = tmp3 < tmp4;
                    auto tmp6 = tmp2 & tmp5;
                    auto tmp7 = [&]
                    {
                        auto tmp8 = in_ptr0[static_cast<long>((3L*x0) + (at::native::div_floor_integer((3L*x1), 2L)))];
                        return tmp8;
                    }
                    ;
                    auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0);
                    auto tmp10 = c10::convert<long>(1L + (at::native::div_floor_integer((3L*x1), 2L)));
                    auto tmp11 = tmp10 < tmp4;
                    auto tmp12 = tmp2 & tmp11;
                    auto tmp13 = [&]
                    {
                        auto tmp14 = in_ptr0[static_cast<long>(1L + (3L*x0) + (at::native::div_floor_integer((3L*x1), 2L)))];
                        return tmp14;
                    }
                    ;
                    auto tmp15 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0);
                    auto tmp16 = tmp15 + tmp9;
                    auto tmp17 = [&]
                    {
                        auto tmp18 = c10::convert<double>(1.0);
                        return tmp18;
                    }
                    ;
                    auto tmp19 = tmp6 ? tmp17() : static_cast<decltype(tmp17())>(0.0);
                    auto tmp20 = [&]
                    {
                        auto tmp21 = c10::convert<double>(1.0);
                        return tmp21;
                    }
                    ;
                    auto tmp22 = tmp12 ? tmp20() : static_cast<decltype(tmp20())>(0.0);
                    auto tmp23 = tmp22 + tmp19;
                    auto tmp24 = tmp16 / tmp23;
                    if (tmp_acc0.value < tmp24) {
                        tmp_acc0.index = x1; tmp_acc0.value = tmp24; // both x0 and x1 are reduction vars while only x1 is assigned to tmp_acc0.index
                    }
                }
            }
            out_ptr0[static_cast<long>(0L)] = tmp_acc0.index;
        }
    }
}
```
After fix:
```c++
#include "/tmp/torchinductor_jgong5/tb/ctbgktuhgnnlel6ipqkfk76lfztr5pledachdkcq3asdqtlxpzt6.h"
extern "C" void kernel(const double* in_ptr0,
                       long* out_ptr0)
{
    {
        {
            struct IndexValue_1 {size_t index; double value;};
            IndexValue_1 tmp_acc0{0, -std::numeric_limits<double>::infinity()};
            #if !defined(__clang_major__) || __clang_major__ > 9
            #pragma omp declare reduction(argmax : IndexValue_1 :\
                omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\
                omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\
            	initializer(omp_priv = {0, -std::numeric_limits<double>::infinity()})
            #endif
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(9L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(2L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = c10::convert<long>(0);
                    auto tmp1 = c10::convert<long>(1);
                    auto tmp2 = tmp0 < tmp1;
                    auto tmp3 = c10::convert<long>(at::native::div_floor_integer((3L*x1), 2L));
                    auto tmp4 = c10::convert<long>(2L + (at::native::div_floor_integer((3L*x1), 2L)));
                    auto tmp5 = tmp3 < tmp4;
                    auto tmp6 = tmp2 & tmp5;
                    auto tmp7 = [&]
                    {
                        auto tmp8 = in_ptr0[static_cast<long>((3L*x0) + (at::native::div_floor_integer((3L*x1), 2L)))];
                        return tmp8;
                    }
                    ;
                    auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0);
                    auto tmp10 = c10::convert<long>(1L + (at::native::div_floor_integer((3L*x1), 2L)));
                    auto tmp11 = tmp10 < tmp4;
                    auto tmp12 = tmp2 & tmp11;
                    auto tmp13 = [&]
                    {
                        auto tmp14 = in_ptr0[static_cast<long>(1L + (3L*x0) + (at::native::div_floor_integer((3L*x1), 2L)))];
                        return tmp14;
                    }
                    ;
                    auto tmp15 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0);
                    auto tmp16 = tmp15 + tmp9;
                    auto tmp17 = [&]
                    {
                        auto tmp18 = c10::convert<double>(1.0);
                        return tmp18;
                    }
                    ;
                    auto tmp19 = tmp6 ? tmp17() : static_cast<decltype(tmp17())>(0.0);
                    auto tmp20 = [&]
                    {
                        auto tmp21 = c10::convert<double>(1.0);
                        return tmp21;
                    }
                    ;
                    auto tmp22 = tmp12 ? tmp20() : static_cast<decltype(tmp20())>(0.0);
                    auto tmp23 = tmp22 + tmp19;
                    auto tmp24 = tmp16 / tmp23;
                    if (tmp_acc0.value < tmp24) {
                        tmp_acc0.index = static_cast<long>(x1 + (2L*x0)); tmp_acc0.value = tmp24;
                    }
                }
            }
            out_ptr0[static_cast<long>(0L)] = tmp_acc0.index;
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113168
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-11-09 11:47:51 +00:00
be66d5e845 Add file name and size to the serialization metadata logging (#113077)
Summary:
To be able to get more info on serialization/deserialization events, adding these two files to the metadata logging.
- file_name
- file_size

Test Plan: buck2 test mode/dev caffe2/caffe2/serialize:inline_container_test

Reviewed By: davidberard98

Differential Revision: D51040426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113077
Approved by: https://github.com/davidberard98
2023-11-09 11:14:24 +00:00
addb8e29cd Enable 2d + AC torch.compile (#112536)
This PR enables AC + torch.compile to work with FSDP + TP, the fix to
high order op path is that we need to check both tensor and tensor
subclass bases to make sourceless builder

NOTE: selective AC + 2D is still not working, need to fix this
separately

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112536
Approved by: https://github.com/yf225
2023-11-09 06:12:13 +00:00
acd595e352 [easy][tp] Fix typo (#113292)
Summary: as title

Test Plan: buck test mode/opt  -c fbcode.enable_gpu_sections=true //caffe2/test/distributed/_tensor/experimental:tp_transform

Differential Revision: D51124333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113292
Approved by: https://github.com/Skylion007
2023-11-09 06:02:18 +00:00
0093e23e52 [dynamo] GradModeVariable should only be eagerly initialized when doing the equivalent of set_grad_enabled (#113293)
Grad mode variable was previously initialized eagerly when called - which is wrong when not explicitly using it in `set_grad_enabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113293
Approved by: https://github.com/jansel
2023-11-09 06:00:14 +00:00
b3ad29e269 [export] Fix executorch models. (#113296)
Summary: yolo fixing issues. See Test plan

Test Plan:
buck2 run 'fbcode//mode/dev' fbcode//executorch/examples/portable/test:test_export -- -r test_mv3_export_to_executorch

[Need acl to repro this but the error message looks straight forward]
buck2 test 'fbcode//mode/dev-nosan' fbcode//pye/model_inventory/nlu_stella_cap:nlu_stella_cap_test -- --exact 'pye/model_inventory/nlu_stella_cap:nlu_stella_cap_test - test_export_to_backend_dynamic_quantized (pye.model_inventory.nlu_stella_cap.NluStellaCapTest.NluStellaCapTest)'

Differential Revision: D51128480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113296
Approved by: https://github.com/tugsbayasgalan
2023-11-09 03:58:16 +00:00
fbf7866ac9 [Inductor] Fallback scatter when src dtype is bf16 (#113204)
basic_gnn_gcn, basic_gnn_gin, basic_gnn_sage now pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113204
Approved by: https://github.com/eellison
2023-11-09 03:43:11 +00:00
31ded95cd5 [2D] Bind _fsdp_extension to FSDP instances (#113237)
Currently, when we have 2D composition, a global variable _extensions controls the 2D deviation we need to take in state_dict calls (See https://github.com/pytorch/pytorch/blob/release/2.1/torch/distributed/fsdp/_fsdp_extensions.py#L66-L68). This is problematic when we have both a 2D model and a plain FSDP model in the same dist environment, as the _extensions will be mistakenly turned on for the plain FSDP model, resulting in state_dict error (RuntimeError: No parent device_mesh is found for FSDP device_mesh.).

This PR binds _fsdp_extension to the FSDP instances to make sure that state_dict calls would not get interfered with each other when mixing both 2D and 1D parallelism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113237
Approved by: https://github.com/fduwjj, https://github.com/fegin
2023-11-09 03:31:03 +00:00
204ec11e6d [inductor][easy] Fix fusion logging (#113308)
We should use %s instead of %d as the numel may be sympy Exprs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113308
Approved by: https://github.com/lezcano
2023-11-09 03:19:39 +00:00
adcf9bb2bd optimize case where div denominator is -1 (#112878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112878
Approved by: https://github.com/lezcano
2023-11-09 02:41:05 +00:00
b694f88ef6 Grandfather in built-in TorchScript ops to being pt2_compliant (#113061)
I'm seeing ops like torch.ops.aten.mul.complex being used with
torch.compile (though this seems strange to me), but we should
grandfather these in.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113061
Approved by: https://github.com/ezyang
ghstack dependencies: #113050
2023-11-09 02:35:33 +00:00
c88a36ebce Grandfather in some more pytorch ops to be pt2_compliant (#113050)
We're not directly testing these, but in general the policy is to assume
that PyTorch ops inside the pytorch repo are compliant.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113050
Approved by: https://github.com/ezyang
2023-11-09 02:35:33 +00:00
e2236ae097 [tp] Fix test_tp_transform_with_uncovered_op (#113310)
Summary:
Test fails on CPU currently with some weird error when `wait_tensor = torch.ops.c10d_functional.wait_tensor.default(all_gather_into_tensor);` runs.
```
[rank2]:[2023-11-08 13:30:29,940] torch.testing._internal.common_distributed: [ERROR] RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
```

https://www.internalfb.com/intern/test/562950070959214/

Test Plan: buck test mode/opt  -c fbcode.enable_gpu_sections=true //caffe2/test/distributed/_tensor/experimental:tp_transform

Reviewed By: weifengpy

Differential Revision: D51131676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113310
Approved by: https://github.com/wanchaol
2023-11-09 02:00:11 +00:00
15b61d6c1a TensorImpl: Lazily compute numel and contiguity when symbolic (#112785)
Currently whenever the sizes or strides are modified for a `TensorImpl` we
eagerly recompute the numel and memory format flags. This is fine for static
shapes as it's all fast C++ code, but for symbolic shapes it runs slow python code.

This instead changes the `SymbolicShapeMeta` object to compute the derived
quantities lazily at the first request. This has the added benefit that we can
now pass assumptions in `empty_tensor_restride` which remove the need to compute
some contiguity flags at all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112785
Approved by: https://github.com/ezyang
ghstack dependencies: #112689, #112890
2023-11-09 01:36:37 +00:00
8c4bdac560 TensorImpl: Move symbolic refresh_numel and refresh_contiguous into their own class (#112890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112890
Approved by: https://github.com/lezcano
ghstack dependencies: #112689
2023-11-09 01:36:37 +00:00
8858edad65 [dynamo] Refactor test cross importing (#113242)
Having tests import tests is a bit annoying because fbcode/oss have different paths.  This moves that stuff into a helper function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113242
Approved by: https://github.com/yanboliang
2023-11-09 01:36:27 +00:00
325e0fdfdd Enable masked_scatter_backward for inductor (#109642)
masked_scatter_backward was previously implemented as a
CompositeExplicitAutograd, which involved a decomp that calls
masked_select, and masked_select in general produces data-dependent
shapes that inductor doesn't support. But masked_scatter_backward
reshapes the return value of masked_select such that the end result has
a static shape again.

I have converted masked_scatter_backward into an aten op to avoid this
issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109642
Approved by: https://github.com/ezyang
ghstack dependencies: #108170
2023-11-09 01:27:57 +00:00
14811d69d7 [BE] Cleanup sdpa test helper usage (#113294)
# Summary

standardizes usage of the rand_sdpa_tensor helper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113294
Approved by: https://github.com/soulitzer
2023-11-09 01:16:53 +00:00
84d64d72d6 Persist copy_ in training graph for inputs that don't require grad (#111046)
In this PR, we try to keep the input mutations in the forward graph IFF input mutation is data mutation and not metadata mutation and doesn't require grad. This is for optimizing inductor training graphs. (For more details: https://github.com/pytorch/pytorch/issues/109240)

We keep the input mutation in the graph by wrapping the original callable in a wrapper function where in the end we add input.copy_(updated_input) call which is then traced via make_fx. Previously, this was only enabled for forward-only path but unconditionally disabled for joint graph.

Another caveat is that when we are tracing through tensor subclasses, we won't allow any input mutations to be preserved in the graph. The reason is that it makes the code logic quite ugly for no obvious performance improvement.

Most of the changes in this PR are mechanical and I didn't have to make any change to the partitioner. Previously forward/backward heavily relied on metadata field `num_mutated_inps` to figure out whether something is returned as extra output or not. But now since we keep some mutations in the graph, we need to propogate something similar to `num_mutated_inps - num_graph_handled_inps`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111046
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-11-09 00:40:29 +00:00
2c4be77f02 Revert "[dynamo] Graph break on setattr(Tensor, "data", Tensor) (#113043)" (#113297)
This reverts commit ddfe5725342b0c0f707222879ca9dac305f97210.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113297
Approved by: https://github.com/PaliC
2023-11-09 00:26:21 +00:00
94d95a91a2 Revert "[dynamo] graph break on setattr requires_grad (#113163)"
This reverts commit d261687d5f56ac8148fab2567cf1fa6dd5264def.

Reverted https://github.com/pytorch/pytorch/pull/113163 on behalf of https://github.com/PaliC due to relevant tests are not running for this pr, however, this is fixed after landing https://github.com/pytorch/pytorch/pull/113297/ ([comment](https://github.com/pytorch/pytorch/pull/113163#issuecomment-1802967236))
2023-11-09 00:23:04 +00:00
12c257cc00 [qunat][pt2e] Support allow_implicit_sharing flag (#112929)
Summary:
For a Node: node1 and edge: (node1, node2), since they are observing the same
Tensor, we may want to implicitly share observers, this flag allows people to
turn off this behavior for the output of the node

See the test_allow_implicit_sharing test for use case

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_allow_implicit_sharing

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112929
Approved by: https://github.com/kimishpatel
2023-11-08 23:47:17 +00:00
625958d8bc Inductor support for native c10d_functional (#112439)
This PR adds Inductor support for [native c10d_functional ops](https://github.com/pytorch/pytorch/pull/110570).

The Inductor IRs introduced in this PR will replace the existing `CollectiveKernel` IR hierarchy. Compared to the existing collective IRs, the new IRs:
- Are target language agnostic and support AOTInductor.
- Express the constraints solely with read/write deps. This maximizes the potential for buffer reuse.
- Address an issue where out-of-place collective's input buffers could be mutated while being volatilely read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112439
Approved by: https://github.com/Chillee
2023-11-08 23:40:21 +00:00
297c26bb8e Support fp8 in AOTInductor + support optional<> in C ABI (#112527)
This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358

It turns out that we need to add support for optional types in order to
support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface
can't support optional<> directly, I am passing in optional types via
pointer instead.

`AtenTensorHandle`s are already pointers, so nothing needs to change
there. Only value types need to change.

We decided on this approach instead of adding an extra `bool` param to
the callee because this simplifies things. Having the same number of
arguments regardless of whether we are emitting Python / C++ /
ABI-compatible C++ makes codegen easier.

There are a number of existing ABI-compatible functions that have
optional-typed value parameters. Previously, they just assumed they
would never be passed a `nullopt` / `None` at runtime. Changing them to
use pointer types now would break ABI stability, so I have created an
exclude list for those functions.

Finally, I think the current implementation is kind of messy, and only
works for FallbackKernels, even though technically ExternKernels could
also have the same issue. It also doesn't support optional types nested
in lists. I've left FIXME comments for both issues.

Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-11-08 22:56:48 +00:00
ee777a7c3c docs: Add docstring for torch.masked._ops.logaddexp (#113206)
logaddexp is not a reduction and normalization, so
_apply_docstring_templates cannot be used to add a docstring.

Fixes https://github.com/pytorch/pytorch/issues/113082

Also fix another misspelling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113206
Approved by: https://github.com/cpuhrsch
2023-11-08 22:45:35 +00:00
f6c00b16c8 [aotinductor] Update the benchmarking script to clone an eager model (#113046)
Summary: fix https://github.com/pytorch/pytorch/issues/113029 where running a model in eager somehow can change a weight stride

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113046
Approved by: https://github.com/angelayi
2023-11-08 22:05:03 +00:00
24bb60d8a1 [inductor] Add test for debug.trace mode (#113240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113240
Approved by: https://github.com/oulgen
2023-11-08 21:50:18 +00:00
5506b9db43 [decomp] Fix _scaled_dot_product_flash_attention decomposition bug (#113102)
For `_scaled_dot_product_flash_attention` we don't have

`Tensor? attn_mask=None`

but `scaled_dot_product_attention` has. In the original decomp there's a
mixup where I added this argument to
`_scaled_dot_product_flash_attention`.

Fix it so that `_scaled_dot_product_flash_attention` is being decomposed correctly.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113102
Approved by: https://github.com/ezyang
2023-11-08 21:47:37 +00:00
aef9e43fe6 Revert "Replaced deprecated pkg_resources.packaging with packaging module (#113023)"
This reverts commit 81ea7a489a85d6f6de2c3b63206ca090927e203a.

Reverted https://github.com/pytorch/pytorch/pull/113023 on behalf of https://github.com/atalman due to breaks nightlies ([comment](https://github.com/pytorch/pytorch/pull/113023#issuecomment-1802720774))
2023-11-08 21:39:59 +00:00
b30f178d09 Replace assert with CUDA_KERNEL_ASSERT in Reduce.cuh for consistency (#113098)
Related to Fixes #94891

**Problem:**
We are trying to disable `printf` in kernels for Pytorch build on ROCm to fix the `torch.sum()` issues for certain community users by disabling `CUDA_KERNEL_ASSERT`, but found that there are still hostcall printf happening in `ReduceSumProdKernel`  used by `torch.sum`.

**Reason:**
The reason is that there are `assert` function calls inside `Reduce.cuh`, (  defined as `__assert_fail` ) which caused `printf`.

**Fix:**
This pull request is to change `assert` to `CUDA_KERNEL_ASSERT` so that we can consistently disable assertion/printf in cuda/hip kernel code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113098
Approved by: https://github.com/ezyang
2023-11-08 21:25:54 +00:00
77e8e8fd2d Rewrite docs so that it is OK to use record_stream before uses (#113282)
The previous documentation did not appear to accurately describe
the actual semantics in CUDA caching allocator.

When you record stream, we only record a stream use:

```
  void recordStream(Block* block, cuda::CUDAStream stream) {
    std::lock_guard<std::recursive_mutex> lock(mutex);
    if (stream.stream() == block->stream) {
      // ignore uses on the allocation stream, since those don't require any
      // special synchronization
      return;
    }
    block->stream_uses.insert(stream);
  }
```

It is only at deallocation time when we actually install an event on
stream uses that we will subsequently query to determine if the block
can be reused or not.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113282
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-11-08 21:24:50 +00:00
5da9abfec2 [dynamo] Enable typechecking for comptime.py (#112999)
I made `comptime` a callable instance instead of a function because mypy
doesn't allow creating extra attributes on a plain function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112999
Approved by: https://github.com/ezyang
ghstack dependencies: #112130, #112970, #112971, #112972, #112973, #112974, #112975
2023-11-08 21:17:45 +00:00
26f907e09b [dynamo] Enable typechecking for skipfiles.py (#112975)
Not sure why mypy thinks `importlib.util.find_spec` is not a valid
lookup, but it seems OK if I explicitly import it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112975
Approved by: https://github.com/yanboliang, https://github.com/eellison
ghstack dependencies: #112130, #112970, #112971, #112972, #112973, #112974
2023-11-08 21:17:45 +00:00
7fb56993ba [dynamo] Enable typechecking for device_interface.py (#112974)
One small runtime change: `get_interface_for_device()` now throws
instead of returning None when an interface is not found. Inspecting all
the callsites in the codebase shows that none of them actually check if
the return type is None, so I think this is safe.

I also silenced a bunch of mypy errors around method assignment; mypy
seems unable to handle the subtype checks correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112974
Approved by: https://github.com/eellison
ghstack dependencies: #112130, #112970, #112971, #112972, #112973
2023-11-08 21:17:45 +00:00
152f9bbb9a [dynamo] Switch MYPYNOFOLLOW config from includes to excludes (#112973)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112973
Approved by: https://github.com/Skylion007
ghstack dependencies: #112130, #112970, #112971, #112972
2023-11-08 21:17:45 +00:00
bea2b703b0 [dynamo] Enable typechecking for bytecode_analysis.py (#112972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112972
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #112130, #112970, #112971
2023-11-08 21:17:45 +00:00
c1fa708b03 [dynamo] Enable typechecking for utils.py (#112971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112971
Approved by: https://github.com/lezcano, https://github.com/jansel
ghstack dependencies: #112130, #112970
2023-11-08 21:17:45 +00:00
1c40d1c683 [dynamo] Enable typechecking for profiler.py (#112970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112970
Approved by: https://github.com/ezyang
ghstack dependencies: #112130
2023-11-08 21:17:45 +00:00
dc63248b76 Make dynamo configs more amenable to static type checking (#112130)
`install_config_module` makes a regular module into a ConfigModule with
extra methods defined on it. mypy thinks those extra methods (or module
functions) are undefined since it cannot analyze something so
dynamic. As a workaround, I've created a fake module that defines these
extra functions, which I import into the config modules during type
checking.

As part of this change, I've also added more types to config_utils.py
and enabled typechecking for torch/_dynamo/config.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112130
Approved by: https://github.com/jansel
2023-11-08 21:17:45 +00:00
d5eb9f725c Fix test_add_scalar_with_empty_list_tensor (#113262)
By actually instantiating test method to a different types and devices rather than always creating it on CPU.
Also, remove `bool` from the list, as adding 1 to bool is not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113262
Approved by: https://github.com/jeanschmidt, https://github.com/atalman, https://github.com/lezcano
2023-11-08 20:56:37 +00:00
d261687d5f [dynamo] graph break on setattr requires_grad (#113163)
Main: `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`
This PR: graph breaks and eager applies the mutation, new tensors are tracked

Fixes https://github.com/pytorch/pytorch/issues/109505 (the original bug does not occur, but a new bug where the mutation isn't applied - because AOTAutograd is not `requires_grad` mutation aware - is mitigated)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113163
Approved by: https://github.com/bdhirsh
2023-11-08 19:51:23 +00:00
a66f2a1b99 [state_dict] Move _gather_state_dict to dcp module (#112835)
This api is getting used by more than just FSDP. This PR moves it to DCP module.

Differential Revision: [D50962966](https://our.internmc.facebook.com/intern/diff/D50962966/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112835
Approved by: https://github.com/wz337
2023-11-08 19:42:56 +00:00
d98182e34e Revert "Grandfather in built-in TorchScript ops to being pt2_compliant (#113061)"
This reverts commit 493b52b3d9395bde3c0dc072885a15e71f786c78.

Reverted https://github.com/pytorch/pytorch/pull/113061 on behalf of https://github.com/PaliC due to breaking internal tests - contacted author with errors ([comment](https://github.com/pytorch/pytorch/pull/113061#issuecomment-1802528592))
2023-11-08 19:36:41 +00:00
81bf0bd68d [no ci] Fix typo in persons_of_interest.rst (#113283)
There are no `c` in `Hirsh`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113283
Approved by: https://github.com/bdhirsh
2023-11-08 19:36:32 +00:00
e49b9492c6 Revert "Grandfather in some more pytorch ops to be pt2_compliant (#113050)"
This reverts commit 85832c0b9b2c7a4299aa1640a952f4d0f48efa66.

Reverted https://github.com/pytorch/pytorch/pull/113050 on behalf of https://github.com/PaliC due to breaking internal tests - contacted author with errors ([comment](https://github.com/pytorch/pytorch/pull/113050#issuecomment-1802524046))
2023-11-08 19:33:15 +00:00
16f82198ca Export ReduleL1/ReduceL2 ONNX ops for aten::linalg_vector_norm(ord={1,2}) (#113173)
After #84624, aten::linalg_vector_norm started being used instead of aten::norm. In the ONNX exporter, the latter leveraged Reduce{L1,L2} when p={1,2}, which resulted in more optimized code in the ONNX Runtime

This PR extends aten::linal_vector_norm to also use Reduce{L1,L2} when ord={1,2}, producing an equivalent ONNX subgraph

This PR is a WIP. Pending work include checking argument equivalence between `aten::norm` and `aten::linalg_vector_norm` and maybe re-enable tests disabled by #84624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113173
Approved by: https://github.com/justinchuby
2023-11-08 19:08:43 +00:00
81b0166ca2 [Inductor][fx pass] Normalize nodes created by users (#113179)
Summary: We noticed that the nodes created by users are absent of example value, which could not be normalized in the normalization pass, thus we change the format to the normalization format for enable the split cat merge.

Test Plan: N/A

Differential Revision: D51058817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113179
Approved by: https://github.com/jackiexu1992
2023-11-08 19:08:18 +00:00
0ab2a48e7e Reland: [TD] Add heuristic for class level historical correlations (#113213)
Relands PR https://github.com/pytorch/pytorch/pull/112162
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113213
Approved by: https://github.com/clee2000
2023-11-08 19:06:20 +00:00
5ea76f1760 [DeviceMesh][Test] Update 2D related test to use init_device_mesh (#113236)
This PR:
1. Update all 2d related test to use DeviceMesh and remove `tp_mesh_dim` from TP calls.
2. Remove `test_fsdp_tp_checkpoint_integration` from `test/distributed/fsdp/test_fsdp_tp_integration.py` as checkpointing tests are covered in https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/parallel/test_fsdp_2d_parallel.py#L330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113236
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin
2023-11-08 18:41:50 +00:00
e138d80e8e [DTensor][2/N][forward fix] extend util function normalize_to_torch_size to accept single int size (#113244)
**Summary**:
In #113105 I used the util function `normalize_to_torch_size` to unify the `size` argument that may be in multiple formats. However the use of that function would only handle input of `Sequence[int]` therefore I submit this forward fix to have `normalize_to_torch_size` also able to handle the size argument of type `int` and `torch.Size`. A side product of this fix is it also enables 3 dtensor op tests (check `test_dtensor_ops.py`).

**Test**:
`pytest test/distributed/_tensor/test_dtensor_ops.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113244
Approved by: https://github.com/wanchaol
ghstack dependencies: #113105
2023-11-08 18:29:08 +00:00
088587574d [DTensor][1/N] add forward layer norm support (#113105)
**Summary**:
This PR adds DTensor implementation for ATen op `native_layer_norm`.

**Test**:
`pytest test/distributed/_tensor/test_dtensor_ops.py -s -k layer_norm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113105
Approved by: https://github.com/wanchaol
2023-11-08 18:29:08 +00:00
9e6e9587c1 Make numel/sym_numel PyInterpreter work symmetrically to others (#113065)
Just some better engineering code cleanup.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113065
Approved by: https://github.com/voznesenskym
2023-11-08 17:44:29 +00:00
78b8465565 [Distributed] Limit world_size to 8 for FSDP Unit tests (#103412)
There are few unit tests in FSDP that can support upto 8 GPUs.
In this case, for example test_fsdp_uneven has an input size of [8,3]. For each process/rank we pass the data as input[self.rank] as below. So when we use 16 GPUs for our tests, these tests throw an index/key error. So basically to avoid such corner cases, I would like to add this change to use 8GPUs if there are more than 8 GPUs. This is applicable to both ROCm and CUDA builds as well.

https://github.com/pytorch/pytorch/blob/main/test/distributed/fsdp/test_fsdp_uneven.py#L44
https://github.com/pytorch/pytorch/blob/main/test/distributed/fsdp/test_fsdp_uneven.py#L55

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103412
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2023-11-08 17:21:38 +00:00
66577c0f3b Update ROCm triton pin (#111129)
Changes:
- Enables bfloat16 support in MFMA dot on MI200 (23979098c8)
- Add support for int8 to bfloat16 conversion (2d3e38e182) fixing a bug in bf16 triton gemm workloads.
- Enable scanOp lowering by adding shfl_up support https://github.com/ROCmSoftwarePlatform/triton/pull/324
- MFMA16 support - support for the mfma_16x16xX instructions - these help perf on smaller sized GEMMs - 7e34c244c2
- configurable wavefront-per-eu - this helps us increase our occupancy in certain use cases such as Flash Attention - e801638b40
- Many bug fixes and optimisations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111129
Approved by: https://github.com/malfet, https://github.com/pruthvistony
2023-11-08 17:16:48 +00:00
9bda1e874c Reland "[aot inductor] Move constant loading logic from Container to Model" (#112197)
Trying again, hopefully with 100% fewer merge conflicts

Original diff: D50582959
Revert diff: D50657400

Differential Revision: [D50710815](https://our.internmc.facebook.com/intern/diff/D50710815/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112197
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2023-11-08 15:08:26 +00:00
6e73ae2022 [ci][ez] Add job_id to emit_metrics (#113099)
As in title.

Also print the job id in the step since I'm struggling to find it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113099
Approved by: https://github.com/seemethere
2023-11-08 10:32:41 +00:00
3914566c73 [dynamo] Refactor OrderedDict to dict (#113234)
In Python3 all dicts are ordered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113234
Approved by: https://github.com/oulgen, https://github.com/lezcano
2023-11-08 09:27:08 +00:00
728ed37663 [AOTInductor] Allow using ProxyExecutor for ATen fallbacks (#112976)
Summary: Use ProxyExecutor for aten._scaled_dot_product_efficient_attention in ABI-mode

Test Plan: OSS CI

Differential Revision: D51005807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112976
Approved by: https://github.com/chenyang78, https://github.com/jansel
2023-11-08 08:34:11 +00:00
df4f0b3829 [BE] [cuDNN] Always build assuming cuDNN >= 8.0 (#95722)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 27084ed</samp>

This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built.

This is a re-land of https://github.com/pytorch/pytorch/pull/91527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722
Approved by: https://github.com/malfet
2023-11-08 07:53:23 +00:00
8ba11bf79d [AOTI] Support non auto-tuned triton kernels in aoti (#113090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113090
Approved by: https://github.com/aakhundov, https://github.com/chenyang78, https://github.com/desertfire
2023-11-08 07:48:15 +00:00
9f3e378125 [nested tensor]add split and layer_norm_backward operations (#113108)
Summary:
Add split and layer_norm_backward.

Note: It is non trivial to support split_with_sizes backward so adding the split operation to support the use case in the model.

Test Plan: unit tests

Differential Revision: D51052966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113108
Approved by: https://github.com/soulitzer
2023-11-08 07:44:35 +00:00
3a429423fc Upgrade CI to ROCm5.7 (#110465)
This PR is to upgrade CI to ROCm5.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110465
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2023-11-08 06:11:10 +00:00
fa895da968 [pytree] reorganize submodule structure for C++ and Python pytree (#112278)
Reorganized the two C++ and Python pytree submodules into a subpackage. I think this would be easier to implement the abstract `PyTreeAPI` class with two implementations. And it will be much easier for the user to switch between the two implementations.

Before:

```text
torch
├── utils
│   ├── _pytree.py
│   ├── _cxx_pytree.py
│   ...
...
```

After:

```text
torch
├── utils
│   ├── _pytree
│   │   ├── __init__.py
│   │   └── api
│   │       ├── __init__.py
│   │       ├── cxx.py
│   │       └── python.py
│   ...
...
```

The `torch.utils._pytree` module will import all APIs from `torch.utils._pytree.api.python`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112278
Approved by: https://github.com/zou3519
ghstack dependencies: #112111
2023-11-08 06:05:39 +00:00
3e4d14702a On grad access, check if grad has changed and update stored example grad as needed (#112811)
Fixes https://github.com/pytorch/pytorch/issues/112446

This is a doozy of a PR, there's a few important things to keep in mind here:

1) We MUST lift all tensors accessed via attrs to inputs, getattr is a no go in the graph, it violates the aot_autograd contract. Furthermore, aot_autograd does not know how to apply in-place ops to intermediary tensors that are attributes (aka from getattr) anyway. Views from ops are fine.

2) `.grad` access handling in dynamo peeks at the underlying value, the real tensor, because re-piping FakeTensors already made with this fake_mode through builder anew is a no go.

3) We have no proper mechanism for updating the hint / grapharg.example (the real value in (2) above) midway through trace

Therefore, what we need to do is reconcile the difference in grad stashed on grapharg.example. The easiest way to do this is lazily, upon .grad access, by reading the new value off the right fake tensors. We can then make a tensor using that data as a hint to VariableBuilder to make the right VariableTracker. Note that the example value used here (torch.zeros) in the PR, is a dummy value only used as a tracing hint, it does not leak out into real runtime code.

Alternatively, we could implement accumulate_grad_ in python...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112811
Approved by: https://github.com/jansel
2023-11-08 05:45:00 +00:00
d01f8b291d Fix visualize_overlap for Inductor comm reordering (#113066)
The following assumptions are not always valid and need checking:
1. `snode.node` exists
2. `snode.node.layout.size` exists
3. `snode.node.layout.stride` exists
4. `snode.node.name` exists

Also there is no guarantee that there won't be two collectives running at the same time. But it's hard to visualize the overlap in that case. So disable the visualization for that case for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113066
Approved by: https://github.com/wanchaol
2023-11-08 05:27:15 +00:00
95f52611c7 [pytree] register pytree node type in both C++ pytree and Python pytree (#112111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112111
Approved by: https://github.com/zou3519
2023-11-08 05:02:03 +00:00
1f3fa13f0a Handle unbacked SymInt sized outputs in AOTAutograd (#113159)
Thanks aakhundov for constructing the test case. This PR was constructed by running the failing test case, and then fixing problems until we got all the way to the end. There are a few distinct fixes:

* AOTAutograd performs equality tests on tensor metadata to determine if a metadata mutation had occurred. If we test i0 vs i1, we should report these are NOT equal, since obviously we have somehow resized the tensor from i0 to i1 (even if, on a particular run, it is possible i0 == i1).
* There's a sketchy fix for `test_aot_autograd_exhaustive_matmul_cpu_float32` where we check if the output shape equals the tangent shape. Unfortunately, the same `definitely_true` treatment does not work here, it still fails on the example. I piled an extra sketchy fix on top of it, where I just try my best to avoid doing the view. Maybe we should have some sort of logging here.
* Partitioner needs to get out a size for unbacked SymInt when partitioning. I just feed it a random heuristic value in this case, similar to how we've been dealing with this in Inductor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113159
Approved by: https://github.com/aakhundov, https://github.com/bdhirsh
2023-11-08 04:28:38 +00:00
aa376e31fd [export] Enable verifier [2/n] (#113075)
Summary: Turn on verifier check for exportec program ctor. Note that this effectively detect a large surface of spec violations, so we also spend some time fixing them one by one in this diff.

Test Plan: CI

Differential Revision: D51014944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113075
Approved by: https://github.com/angelayi
2023-11-08 03:32:11 +00:00
f2963642c2 [DDP] Add device_mesh to DDP ctor (#112761)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112761
Approved by: https://github.com/fegin
2023-11-08 03:08:08 +00:00
9d765d28ca [pytorch] Add binding to get nccl version suffix (#112884)
Summary: Adds a Python to C binding to get the NCCL_SUFFIX value for more accurate NCCL version information and add that to the NCCL version tuple.

Differential Revision: D50978181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112884
Approved by: https://github.com/kwen2501
2023-11-08 02:51:22 +00:00
93cea394de CMake: Loosen CUDA consistency check (#113174)
Closes #108931, closes #108932, see also conda-forge/pytorch-cpu-feedstock#203

Currently we compare `CUDA_INCLUDE_DIRS` and expect exact equality
with `CUDAToolkit_INCLUDE_DIR` however this fails in the presense of
symbolic links or for split installs where there are multiple include paths.
Given that, it makes sense to loosen the requirement to just version
equality under the assumption that two installs of the same version
should still be compatible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113174
Approved by: https://github.com/malfet
2023-11-08 02:51:18 +00:00
b7acd374c9 Remove unecessary warning when getting storage.filename (#113212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113212
Approved by: https://github.com/vmoens
2023-11-08 02:09:59 +00:00
ceb07656c2 [dynamo] use APIs to use device interface instead of raw object in dynamo capture (#113000)
This PR makes up for the https://github.com/pytorch/pytorch/pull/108312.
This PR uses the _get_registered_device_interfaces_ to get the device interface, instead of using raw objects.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113000
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-11-08 01:45:00 +00:00
a6ed86bfdb Add torch.onnx.dynamo_export test using ExportedProgram from file (#112271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112271
Approved by: https://github.com/BowenBao
2023-11-08 01:34:18 +00:00
2043d92472 [PyTorch][Vulkan] Add LayerNorm performance test binary (#112915)
Summary:
We add a performance test binary for the recently implemented operator `LayerNorm` D50436726. The difference of this test compared to the existing `vulkan_conv_arithmetic_perf_test.cpp` and `vulkan_mm_perf_test.cpp` is that
- the existing tests benchmark a specific Vulkan shader such as `vulkan.mm`, `vulkan.addmm`, etc
- but `LayerNorm` is implemented by invoking [a sequence of other operators (shader files)](https://www.internalfb.com/code/fbsource/[ff4989384cacda66a2eed4c800f69c69f6832c52]/fbcode/caffe2/aten/src/ATen/native/vulkan/ops/Layernorm.cpp?lines=94) instead of a devoted single shader file. Reusing the exiting test code wouldn't print meaningful result.

To deal with this, we add a function `extractTotalShaderResultsAndSetState` which aggregates the latency of all invoked shaders except `nchw_to_image` and `image_to_nchw`. This test can be applied to other operators which don't have a devoted shader file.

Test Plan:
- build the binary
```
(base) luwei@luwei-mbp fbsource % buck2 build  -c ndk.debug_info_level=0  -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_layernorm_perf_test_binAndroid  --show-output  -c pt.vulkan_full_precision=1
```
- push to device
```
(base) luwei@luwei-mbp fbsource % adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_layernorm_perf_test_binAndroid__/pt_vulkan_layernorm_perf_test_binAndroid /data/local/tmp
```
- test on device
```
(base) luwei@luwei-mbp ~ % adb shell /data/local/tmp/pt_vulkan_layernorm_perf_test_binAndroid
```
- output, excerpt below, full test result in P871803721
**it shows that the aggregation of invoked shaders takes 14.2 ms on average**
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {75, 75, 19}                     1310660
vulkan.nchw_to_image     {75, 75, 19}                     1313260
vulkan.nchw_to_image     {75, 75, 19}                     1268748
vulkan.mean_dim_keepdim  {1, 75, 19}                       878124
vulkan.mean_dim_keepdim  {1, 1, 19}                         53300
vulkan.mean_dim_keepdim  {1, 1, 1}                          62660
vulkan.mean_dim_keepdim  {1, 75, 19}                       871260
vulkan.mean_dim_keepdim  {1, 1, 19}                         53144
vulkan.mean_dim_keepdim  {1, 1, 1}                          62400
vulkan.sub               {75, 75, 19}                     1787760
vulkan.mul               {75, 75, 19}                     1866904
vulkan.mean_dim_keepdim  {1, 75, 19}                       868764
vulkan.mean_dim_keepdim  {1, 1, 19}                         56212
vulkan.mean_dim_keepdim  {1, 1, 1}                          62400
vulkan.sub               {75, 75, 19}                     1782872
vulkan.add_scalar        {1, 1, 1}                           2028
vulkan.pow_tensor_scalar {1, 1, 1}                           2236
vulkan.mul               {75, 75, 19}                     1771276
...
vulkan.add               {75, 75, 19}                     1909544
vulkan.image_to_nchw     {75, 75, 19}                     1143844
------------------------------------------------------------------------------------------------------------------
Benchmark                                                                        Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------------
layer_norm_benchmark/N:75/M:75/P:75/iterations:50/manual_time/threads:1       14.2 ms         48.9 ms           50
```

Reviewed By: yipjustin

Differential Revision: D50940613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112915
Approved by: https://github.com/yipjustin
2023-11-08 01:29:10 +00:00
e5b758b855 S390x complex division (#108516)
Adopt algorithm from AVX2 implementation.
This change fixes test test_complex_div_underflow_overflow_cpu_complex128
from test/test_binary_ufuncs.py

At the same time it breaks some of Arithmetics/*.Division tests
from vec_test_all_types_ZVECTOR,
but it's also broken on AVX2 and AVX512.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108516
Approved by: https://github.com/ezyang
2023-11-08 01:28:29 +00:00
a8097ed479 Fix docstring errors in _composable_state.py, remote_device.py, value_ranges.py, utils.py, run.py, rendezvous.py, launch.py, argparse_util.py, __init__.py, _cycles.py (#112953)
Fixes #112639

```txt
 torch/utils/_sympy/value_ranges.py
 torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
        D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
        D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:129 in public method `increasing_map`:
        D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:135 in public method `decreasing_map`:
        D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:141 in public method `monotone_map`:
        D400: First line should end with a period (not 'g')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
        D400: First line should end with a period (not '0')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
        D403: First word of the first line should be properly capitalized ('Fn', not 'fn')
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
        D400: First line should end with a period (not ':')
torch/utils/_sympy/value_ranges.py:171 in public method `coordinatewise_monotone_map`:
        D400: First line should end with a period (not 'e')
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
        D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
        D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
        D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:488 in public class `ValueRangeAnalysis`:
        D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:489 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:501 in public method `bool_handler`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:506 in public method `default_handler`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:511 in public method `load`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:514 in public method `store`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:517 in public method `reduction`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:520 in public method `index_expr`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:525 in public method `to_dtype`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:558 in public method `square`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:562 in public method `neg`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:566 in public method `truncdiv`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:577 in public method `sub`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:580 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:585 in public function `bound_sympy`:
        D103: Missing docstring in public function
36
torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
        D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
        D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
        D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
        D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
        D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:490 in public class `ValueRangeAnalysis`:
        D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:491 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:503 in public method `bool_handler`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:508 in public method `default_handler`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:513 in public method `load`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:516 in public method `store`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:519 in public method `reduction`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:522 in public method `index_expr`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:527 in public method `to_dtype`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:560 in public method `square`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:564 in public method `neg`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:568 in public method `truncdiv`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:579 in public method `sub`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:582 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:587 in public function `bound_sympy`:
        D103: Missing docstring in public function
28

torch/utils/viz/_cycles.py
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
        D400: First line should end with a period (not 'g')
torch/utils/viz/_cycles.py:256 in public class `Node`:
        D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
        D400: First line should end with a period (not 'p')
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
        D401: First line should be in imperative mood; try rephrasing (found 'Reference')
14
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:256 in public class `Node`:
        D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
        D103: Missing docstring in public function
9

torch/distributed/argparse_util.py
torch/distributed/argparse_util.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/argparse_util.py:13 in public class `env`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:13 in public class `env`:
        D400: First line should end with a period (not 'g')
torch/distributed/argparse_util.py:13 in public class `env`:
        D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:43 in public method `__init__`:
        D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
        D102: Missing docstring in public method
torch/distributed/argparse_util.py:61 in public class `check_env`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:61 in public class `check_env`:
        D400: First line should end with a period (not 's')
torch/distributed/argparse_util.py:61 in public class `check_env`:
        D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:97 in public method `__init__`:
        D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
        D102: Missing docstring in public method
11
torch/distributed/argparse_util.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/argparse_util.py:43 in public method `__init__`:
        D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
        D102: Missing docstring in public method
torch/distributed/argparse_util.py:97 in public method `__init__`:
        D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
        D102: Missing docstring in public method
5

torch/distributed/_composable_state.py
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
        D202: No blank lines allowed after function docstring (found 1)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
        D400: First line should end with a period (not '`')
3
0

torch/distributed/launch.py
torch/distributed/launch.py:1 at module level:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/launch.py:1 at module level:
        D400: First line should end with a period (not 'd')
torch/distributed/launch.py:156 in public function `parse_args`:
        D103: Missing docstring in public function
torch/distributed/launch.py:171 in public function `launch`:
        D103: Missing docstring in public function
torch/distributed/launch.py:180 in public function `main`:
        D103: Missing docstring in public function
5
torch/distributed/launch.py:157 in public function `parse_args`:
        D103: Missing docstring in public function
torch/distributed/launch.py:172 in public function `launch`:
        D103: Missing docstring in public function
torch/distributed/launch.py:181 in public function `main`:
        D103: Missing docstring in public function
3

torch/distributed/remote_device.py
torch/distributed/remote_device.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/remote_device.py:81 in private method `worker_name`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:81 in private method `worker_name`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:88 in private method `rank`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:88 in private method `rank`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:95 in private method `device`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/remote_device.py:95 in private method `device`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
7
torch/distributed/remote_device.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/remote_device.py:85 in private method `rank`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:85 in private method `rank`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
3

torch/distributed/rendezvous.py
torch/distributed/rendezvous.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/rendezvous.py:23 in public function `register_rendezvous_handler`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/distributed/rendezvous.py:88 in public function `rendezvous`:
        D103: Missing docstring in public function
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
        D400: First line should end with a period (not 'r')
5
torch/distributed/rendezvous.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/rendezvous.py:89 in public function `rendezvous`:
        D103: Missing docstring in public function
2

torch/distributed/run.py
torch/distributed/run.py:9 at module level:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:9 at module level:
        D400: First line should end with a period (not '`')
torch/distributed/run.py:393 in public function `get_args_parser`:
        D202: No blank lines allowed after function docstring (found 1)
torch/distributed/run.py:393 in public function `get_args_parser`:
        D401: First line should be in imperative mood; try rephrasing (found 'Helper')
torch/distributed/run.py:610 in public function `parse_args`:
        D103: Missing docstring in public function
torch/distributed/run.py:615 in public function `parse_min_max_nnodes`:
        D103: Missing docstring in public function
torch/distributed/run.py:629 in public function `determine_local_world_size`:
        D103: Missing docstring in public function
torch/distributed/run.py:670 in public function `get_rdzv_endpoint`:
        D103: Missing docstring in public function
torch/distributed/run.py:677 in public function `get_use_env`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:677 in public function `get_use_env`:
        D401: First line should be in imperative mood (perhaps 'Retrieve', not 'Retrieves')
torch/distributed/run.py:689 in public function `config_from_args`:
        D103: Missing docstring in public function
torch/distributed/run.py:770 in public function `run_script_path`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:770 in public function `run_script_path`:
        D401: First line should be in imperative mood (perhaps 'Run', not 'Runs')
torch/distributed/run.py:781 in public function `run`:
        D103: Missing docstring in public function
torch/distributed/run.py:804 in public function `main`:
        D103: Missing docstring in public function
15
torch/distributed/run.py:611 in public function `parse_args`:
        D103: Missing docstring in public function
torch/distributed/run.py:616 in public function `parse_min_max_nnodes`:
        D103: Missing docstring in public function
torch/distributed/run.py:630 in public function `determine_local_world_size`:
        D103: Missing docstring in public function
torch/distributed/run.py:671 in public function `get_rdzv_endpoint`:
        D103: Missing docstring in public function
torch/distributed/run.py:691 in public function `config_from_args`:
        D103: Missing docstring in public function
torch/distributed/run.py:784 in public function `run`:
        D103: Missing docstring in public function
torch/distributed/run.py:807 in public function `main`:
        D103: Missing docstring in public function
7

torch/distributed/__init__.py
torch/distributed/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/distributed/__init__.py:8 in public function `is_available`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/__init__.py:8 in public function `is_available`:
        D400: First line should end with a period (not ',')
torch/distributed/__init__.py:8 in public function `is_available`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
4
torch/distributed/__init__.py:1 at module level:
        D104: Missing docstring in public package
1

torch/distributed/utils.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
        D400: First line should end with a period (not ')')
torch/distributed/utils.py:47 in private function `_cast_forward_inputs`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:88 in private function `_recursive_to`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/utils.py:141 in private function `_p_assert`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:141 in private function `_p_assert`:
        D209: Multi-line docstring closing quotes should be on a separate line
torch/distributed/utils.py:141 in private function `_p_assert`:
        D400: First line should end with a period (not 't')
torch/distributed/utils.py:141 in private function `_p_assert`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:275 in private function `_sync_module_states`:
        D400: First line should end with a period (not 'n')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
        D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
        D400: First line should end with a period (not 'y')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
        D401: First line should be in imperative mood (perhaps 'Synchronize', not 'Synchronizes')
15
torch/distributed/utils.py:1 at module level:
        D100: Missing docstring in public module
1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112953
Approved by: https://github.com/weifengpy
2023-11-08 01:13:09 +00:00
bae8506589 [TorchElastic] Add option to configure log prefix for each rank (#112357)
Summary:
Add an ability to customize log lines and addtional template like behavior to enrich log information.

Motivation:
a) Log stream processing/aggregation gains additional value when it includes information about the global rank. Extension to that is that it will be easier to map ranks to hosts from log stream information (less relevant at the moment)
b) Users can easily map the failure to the right rank without matching node rank offset+local rank.

Implementation
- BC change - keeps the logs line prefix as `[<role name><local rank>]:`
- Optional env variable TORCHELASTIC_LOG_LINE_HEADER that will be used as a prefix when specified and currently exposes `role_name`, `rank` and `local_rank` variables that will be bound when agent assigns the ranks.

Test Plan:
CI

https://fburl.com/mlhub/mzx5xspv

Differential Revision: D50584590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112357
Approved by: https://github.com/kiukchung
2023-11-08 01:00:26 +00:00
d1c092ae1b Update impl_abstract_pystub to be less boilerplatey (#113182)
Summary:

We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
  Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
  the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
  Library.define in Python appends the op to a global set, which is analogous
  to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
  we require that it has an `impl_abstract_pystub` specified and we also check
  that the module in the `impl_abstract_pystub` is the same as the module where
  the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
  buck-based systems) because buck sits above us.

bypass-github-export-checks

Test Plan: - existing tests

Differential Revision: D51080493

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113182
Approved by: https://github.com/ezyang
2023-11-08 00:39:00 +00:00
aae418aea6 Remove TODOs to add docstrings (#113197)
Because the docstrings are actually defined in torch/_torch_docs.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113197
Approved by: https://github.com/atalman, https://github.com/malfet
2023-11-08 00:34:26 +00:00
eb5487361d docs: fix docstring errors in quantized modules and others (#112695)
Fixes #112632

Before: 171
```
torch/backends/_nnapi/prepare.py:24 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/_nnapi/prepare.py:46 in public method `init`:
        D102: Missing docstring in public method
torch/backends/_nnapi/prepare.py:60 in public method `forward`:
        D102: Missing docstring in public method
torch/backends/_nnapi/prepare.py:94 in public function `convert_model_to_nnapi`:
        D103: Missing docstring in public function
torch/backends/_nnapi/prepare.py:153 in public function `process_for_nnapi`:
        D103: Missing docstring in public function
torch/backends/_nnapi/prepare.py:177 in private nested class `ShapeComputeModule`:
        D400: First line should end with a period (not 'n')
torch/backends/_nnapi/serializer.py:19 in public class `NNAPI_OperandCode`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:35 in public class `NNAPI_OperationCode`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:133 in public class `NNAPI_FuseCode`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:140 in public class `OperandValueSourceType`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:150 in public class `TorchScalarTypes`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:154 in public function `approx_equal`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:158 in public function `tensor_size`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:172 in public function `change_element`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:194 in public class `DimOrder`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:225 in public method `use_nchw`:
        D102: Missing docstring in public method
torch/backends/_nnapi/serializer.py:233 in public function `broadcast_shapes`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:260 in public function `get_conv_pool_shape`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:284 in public function `fix_shape`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:301 in public function `reverse_map_dim`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:312 in public function `flex_name`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:1337 in private method `_do_add_binary`:
        D400: First line should end with a period (not 's')
torch/backends/_nnapi/serializer.py:1337 in private method `_do_add_binary`:
        D401: First line should be in imperative mood; try rephrasing (found 'Helper')
torch/backends/_nnapi/serializer.py:2180 in public function `serialize_model`:
        D202: No blank lines allowed after function docstring (found 1)
torch/backends/_nnapi/serializer.py:2180 in public function `serialize_model`:
        D205: 1 blank line required between summary line and description (found 0)
torch/backends/_nnapi/serializer.py:2180 in public function `serialize_model`:
        D400: First line should end with a period (not ':')
torch/backends/cuda/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/backends/cuda/__init__.py:30 in public function `is_built`:
        D205: 1 blank line required between summary line and description (found 0)
torch/backends/cuda/__init__.py:30 in public function `is_built`:
        D209: Multi-line docstring closing quotes should be on a separate line
torch/backends/cuda/__init__.py:30 in public function `is_built`:
        D400: First line should end with a period (not 's')
torch/backends/cuda/__init__.py:30 in public function `is_built`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/backends/cuda/__init__.py:37 in public class `cuFFTPlanCacheAttrContextProp`:
        D101: Missing docstring in public class
torch/backends/cuda/__init__.py:40 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:44 in public method `__get__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:47 in public method `__set__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:54 in public class `cuFFTPlanCache`:
        D205: 1 blank line required between summary line and description (found 0)
torch/backends/cuda/__init__.py:54 in public class `cuFFTPlanCache`:
        D400: First line should end with a period (not 'e')
torch/backends/cuda/__init__.py:60 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:73 in public method `clear`:
        D102: Missing docstring in public method
torch/backends/cuda/__init__.py:78 in public class `cuFFTPlanCacheManager`:
        D205: 1 blank line required between summary line and description (found 0)
torch/backends/cuda/__init__.py:78 in public class `cuFFTPlanCacheManager`:
        D400: First line should end with a period (not ',')
torch/backends/cuda/__init__.py:89 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:93 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:106 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:109 in public method `__setattr__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:116 in public class `cuBLASModule`:
        D101: Missing docstring in public class
torch/backends/cuda/__init__.py:117 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:126 in public method `__setattr__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:147 in public function `preferred_linalg_library`:
        D202: No blank lines allowed after function docstring (found 1)
torch/backends/cuda/__init__.py:204 in public class `SDPBackend`:
        D204: 1 blank line required after class docstring (found 0)
torch/backends/cudnn/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/backends/cudnn/__init__.py:81 in public function `version`:
        D400: First line should end with a period (not 'N')
torch/backends/cudnn/__init__.py:81 in public function `version`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/backends/cudnn/__init__.py:95 in public function `is_available`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/backends/cudnn/__init__.py:99 in public function `is_acceptable`:
        D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:122 in public function `set_flags`:
        D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:150 in public function `flags`:
        D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:174 in public class `CudnnModule`:
        D101: Missing docstring in public class
torch/backends/cudnn/__init__.py:175 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/mkl/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/backends/mkl/__init__.py:5 in public function `is_available`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/backends/mkl/__init__.py:14 in public class `verbose`:
        D205: 1 blank line required between summary line and description (found 0)
torch/backends/mkl/__init__.py:14 in public class `verbose`:
        D400: First line should end with a period (not 'y')
torch/backends/mkl/__init__.py:41 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/mkl/__init__.py:44 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/backends/mkl/__init__.py:53 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/backends/mkldnn/__init__.py:9 in public function `is_available`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/backends/mkldnn/__init__.py:19 in public class `verbose`:
        D205: 1 blank line required between summary line and description (found 0)
torch/backends/mkldnn/__init__.py:19 in public class `verbose`:
        D400: First line should end with a period (not 'y')
torch/backends/mkldnn/__init__.py:47 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/mkldnn/__init__.py:50 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:59 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:64 in public function `set_flags`:
        D103: Missing docstring in public function
torch/backends/mkldnn/__init__.py:71 in public function `flags`:
        D103: Missing docstring in public function
torch/backends/mkldnn/__init__.py:81 in public class `MkldnnModule`:
        D101: Missing docstring in public class
torch/backends/mkldnn/__init__.py:82 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/openmp/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/backends/openmp/__init__.py:5 in public function `is_available`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/intrinsic/qat/modules/conv_fused.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/intrinsic/qat/modules/linear_fused.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/intrinsic/qat/modules/linear_relu.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/qat/__init__.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/qat/dynamic/__init__.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/qat/dynamic/modules/linear.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/qat/modules/__init__.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/qat/modules/conv.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/qat/modules/embedding_ops.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/qat/modules/linear.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantizable/modules/activation.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantizable/modules/rnn.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/__init__.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/conv.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/linear.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/rnn.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/sparse.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/utils.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/dynamic/modules/__init__.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/dynamic/modules/conv.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/dynamic/modules/linear.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/dynamic/modules/rnn.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/functional.py:1 at module level:
        D400: First line should end with a period (not 'l')
torch/nn/quantized/modules/__init__.py:1 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/modules/activation.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/modules/batchnorm.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/modules/conv.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/modules/dropout.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/modules/embedding_ops.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/modules/functional_modules.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/modules/linear.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/modules/normalization.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/modules/rnn.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/quantized/modules/utils.py:2 at module level:
        D400: First line should end with a period (not 's')
torch/nn/utils/_expanded_weights/conv_utils.py:13 in public function `conv_picker`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:23 in public function `conv_args_and_kwargs`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:31 in public function `conv_normalizer`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:35 in public function `conv_input_for_string_padding`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:43 in public function `int_padding_for_string_padding`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:59 in public function `conv_padding_for_same`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:66 in public function `conv_backward`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:131 in public function `conv_unfold_weight_grad_sample`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:166 in public function `conv_group_weight_grad_sample`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:189 in public function `unfold3d`:
        D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/_expanded_weights/conv_utils.py:189 in public function `unfold3d`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_expanded_weights/conv_utils.py:189 in public function `unfold3d`:
        D401: First line should be in imperative mood (perhaps 'Extract', not 'Extracts')
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:6 in public function `is_batch_first`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:19 in public function `standard_kwargs`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:19 in public function `standard_kwargs`:
        D300: Use """triple double quotes""" (found '''-quotes)
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:19 in public function `standard_kwargs`:
        D400: First line should end with a period (not 'e')
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:28 in public function `forward_helper`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:28 in public function `forward_helper`:
        D300: Use """triple double quotes""" (found '''-quotes)
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:28 in public function `forward_helper`:
        D400: First line should end with a period (not ')')
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:84 in public function `maybe_scale_by_batch_size`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:90 in public function `set_grad_sample_if_exists`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:108 in public function `unpack_expanded_weight_or_tensor`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:123 in public function `sum_over_all_but_batch_and_last_n`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:123 in public function `sum_over_all_but_batch_and_last_n`:
        D400: First line should end with a period (not 't')
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:123 in public function `sum_over_all_but_batch_and_last_n`:
        D401: First line should be in imperative mood (perhaps 'Calculate', not 'Calculates')
torch/nn/utils/convert_parameters.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/utils/convert_parameters.py:57 in private function `_check_param_device`:
        D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/convert_parameters.py:57 in private function `_check_param_device`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/convert_parameters.py:57 in private function `_check_param_device`:
        D400: First line should end with a period (not 'd')
torch/nn/utils/convert_parameters.py:57 in private function `_check_param_device`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/nn/utils/rnn.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/utils/rnn.py:28 in public class `PackedSequence`:
        D204: 1 blank line required after class docstring (found 0)
torch/nn/utils/rnn.py:63 in public method `__new__`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:73 in public method `pin_memory`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:80 in public method `cuda`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:87 in public method `cpu`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:94 in public method `double`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:97 in public method `float`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:100 in public method `half`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:103 in public method `long`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:106 in public method `int`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:109 in public method `short`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:112 in public method `char`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:115 in public method `byte`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:119 in public method `to`:
        D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/rnn.py:119 in public method `to`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
torch/nn/utils/rnn.py:146 in public method `is_cuda`:
        D400: First line should end with a period (not 'u')
torch/nn/utils/rnn.py:150 in public method `is_pinned`:
        D400: First line should end with a period (not 'y')
torch/nn/utils/rnn.py:150 in public method `is_pinned`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/utils/rnn.py:198 in public function `invert_permutation`:
        D103: Missing docstring in public function
torch/nn/utils/rnn.py:274 in public function `pad_packed_sequence`:
        D401: First line should be in imperative mood (perhaps 'Pad', not 'Pads')
torch/nn/utils/rnn.py:347 in public function `pad_sequence`:
        D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/rnn.py:347 in public function `pad_sequence`:
        D400: First line should end with a period (not '`')
torch/nn/utils/rnn.py:408 in public function `unpad_sequence`:
        D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/rnn.py:408 in public function `unpad_sequence`:
        D400: First line should end with a period (not 's')
torch/nn/utils/rnn.py:454 in public function `pack_sequence`:
        D400: First line should end with a period (not 's')
torch/nn/utils/rnn.py:490 in public function `unpack_sequence`:
        D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/rnn.py:490 in public function `unpack_sequence`:
        D400: First line should end with a period (not 's')
171
```

After: 81
```
torch/backends/_nnapi/prepare.py:24 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/_nnapi/prepare.py:46 in public method `init`:
        D102: Missing docstring in public method
torch/backends/_nnapi/prepare.py:60 in public method `forward`:
        D102: Missing docstring in public method
torch/backends/_nnapi/prepare.py:94 in public function `convert_model_to_nnapi`:
        D103: Missing docstring in public function
torch/backends/_nnapi/prepare.py:153 in public function `process_for_nnapi`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:19 in public class `NNAPI_OperandCode`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:35 in public class `NNAPI_OperationCode`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:133 in public class `NNAPI_FuseCode`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:140 in public class `OperandValueSourceType`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:150 in public class `TorchScalarTypes`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:154 in public function `approx_equal`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:158 in public function `tensor_size`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:172 in public function `change_element`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:194 in public class `DimOrder`:
        D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:225 in public method `use_nchw`:
        D102: Missing docstring in public method
torch/backends/_nnapi/serializer.py:233 in public function `broadcast_shapes`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:260 in public function `get_conv_pool_shape`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:284 in public function `fix_shape`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:301 in public function `reverse_map_dim`:
        D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:312 in public function `flex_name`:
        D103: Missing docstring in public function
torch/backends/cuda/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/backends/cuda/__init__.py:39 in public class `cuFFTPlanCacheAttrContextProp`:
        D101: Missing docstring in public class
torch/backends/cuda/__init__.py:42 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:46 in public method `__get__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:49 in public method `__set__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:63 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:76 in public method `clear`:
        D102: Missing docstring in public method
torch/backends/cuda/__init__.py:91 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:95 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:108 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:111 in public method `__setattr__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:118 in public class `cuBLASModule`:
        D101: Missing docstring in public class
torch/backends/cuda/__init__.py:119 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:128 in public method `__setattr__`:
        D105: Missing docstring in magic method
torch/backends/cudnn/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/backends/cudnn/__init__.py:99 in public function `is_acceptable`:
        D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:122 in public function `set_flags`:
        D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:150 in public function `flags`:
        D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:174 in public class `CudnnModule`:
        D101: Missing docstring in public class
torch/backends/cudnn/__init__.py:175 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/mkl/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/backends/mkl/__init__.py:42 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/mkl/__init__.py:45 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/backends/mkl/__init__.py:54 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/backends/mkldnn/__init__.py:48 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/mkldnn/__init__.py:51 in public method `__enter__`:
        D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:60 in public method `__exit__`:
        D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:65 in public function `set_flags`:
        D103: Missing docstring in public function
torch/backends/mkldnn/__init__.py:72 in public function `flags`:
        D103: Missing docstring in public function
torch/backends/mkldnn/__init__.py:82 in public class `MkldnnModule`:
        D101: Missing docstring in public class
torch/backends/mkldnn/__init__.py:83 in public method `__init__`:
        D107: Missing docstring in __init__
torch/backends/openmp/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/nn/utils/_expanded_weights/conv_utils.py:13 in public function `conv_picker`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:23 in public function `conv_args_and_kwargs`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:31 in public function `conv_normalizer`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:35 in public function `conv_input_for_string_padding`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:43 in public function `int_padding_for_string_padding`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:59 in public function `conv_padding_for_same`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:66 in public function `conv_backward`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:131 in public function `conv_unfold_weight_grad_sample`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:166 in public function `conv_group_weight_grad_sample`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:6 in public function `is_batch_first`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:87 in public function `maybe_scale_by_batch_size`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:93 in public function `set_grad_sample_if_exists`:
        D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:111 in public function `unpack_expanded_weight_or_tensor`:
        D103: Missing docstring in public function
torch/nn/utils/convert_parameters.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/utils/rnn.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/utils/rnn.py:64 in public method `__new__`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:74 in public method `pin_memory`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:81 in public method `cuda`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:88 in public method `cpu`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:95 in public method `double`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:98 in public method `float`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:101 in public method `half`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:104 in public method `long`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:107 in public method `int`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:110 in public method `short`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:113 in public method `char`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:116 in public method `byte`:
        D102: Missing docstring in public method
torch/nn/utils/rnn.py:198 in public function `invert_permutation`:
        D103: Missing docstring in public function
81
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112695
Approved by: https://github.com/mikaylagawarecki
2023-11-07 23:52:16 +00:00
edcbd5a895 Make TORCH_COMPILE_DEBUG=1 work again (#112917)
ATT. After the fix, self.node is `Optional[ir.Buffer]` in `FusedSchedulerNode` and `ForeachKernelSchedulerNode`, but `ir.Buffer` in `BaseSchedulerNode`. Using `ir.Buffer` for `BaseSchedulerNode.node` avoids all mypy complaints about Optionals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112917
Approved by: https://github.com/davidberard98, https://github.com/int3, https://github.com/leslie-fang-intel, https://github.com/aakhundov
2023-11-07 23:34:30 +00:00
041b6b5c6b TorchInductor Opinfo fixes for rng ops (#108170)
Tests rng ops both with
- fallback_random=True, assertEqual=True
- fallback_random=False, assertEqual=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108170
Approved by: https://github.com/davidberard98
2023-11-07 23:13:57 +00:00
498a760802 Update comm_analysis.py license (#113184)
Consulted with legal, this is the right way to do it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113184
Approved by: https://github.com/Chillee, https://github.com/malfet
2023-11-07 22:58:56 +00:00
a3a2486be8 [dynamo] Avoid eager imports of classes with custom VariableTrackers (#112319)
Currently custom VariableTrackers exist for classes that live outside of pytorch.
For these cases dynamo currently eagerly imports the module to get the class
object to compare against.

This instead uses `sys.modules.get("module.path")` such that the module is never
imported by dynamo itself, but if the user has imported the module then we will
still access the module and grab the type we need to compare against.

I noticed this issue because importing `KeyedJaggedTensor` fails half-way
through if `fbgemm_gpu` has been built with an incompatible PyTorch version, in
which case it retries the import again each time!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112319
Approved by: https://github.com/lezcano, https://github.com/ezyang
2023-11-07 22:45:54 +00:00
e4c8737a0c [PT-D] Updated Dynamo skip message for @contract tests (#112793)
Even Dynamo can now trace through module hooks, its regex matcher for `HASATTR` does not like the state key:
12a6f5aa6b/torch/distributed/_composable/contract.py (L10-L14)
12a6f5aa6b/torch/_dynamo/guards.py (L353-L355)

```
PYTORCH_TEST_WITH_DYNAMO=1 python -m pytest test/distributed/_composable/test_contract.py
```

```
------------------------------------- Captured stderr call -------------------------------------
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT resume_in_test_registry /data/users/andgu/pytorch/test/distributed/_composable/test_contract.py line 125
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING] due to:
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]   File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 687, in _convert_frame
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]     result = inner_convert(frame, cache_entry, hooks, frame_state)
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]   File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 148, in _fn
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]     return fn(*args, **kwargs)
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]   File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 406, in _convert_frame_assert
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]     return _compile(
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]   File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 614, in _compile
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]     guarded_code = compile_inner(code, one_graph, hooks, transform)
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]   File "/data/users/andgu/pytorch/torch/_dynamo/utils.py", line 221, in time_wrapper
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]     r = func(*args, **kwargs)
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]   File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 594, in compile_inner
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]     check_fn = CheckFunctionManager(
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]   File "/data/users/andgu/pytorch/torch/_dynamo/guards.py", line 987, in __init__
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]     guard.create(builder)
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]   File "/data/users/andgu/pytorch/torch/_guards.py", line 244, in create
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]     return self.create_fn(builder, self)
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]   File "/data/users/andgu/pytorch/torch/_dynamo/guards.py", line 354, in HASATTR
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING]     assert m, f"invalid hasattr check {guard.name}"
[2023-11-02 14:40:02,242] torch._dynamo.convert_frame: [WARNING] AssertionError: invalid hasattr check getattr(L['___stack0'], '__composable_api_state_key_643e6a56-3313-4c8f-9401-a5af7bd3ee26')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112793
Approved by: https://github.com/wanchaol
2023-11-07 22:42:03 +00:00
0fee7a0181 Revert "[TD] Add heuristic for class level historical correlations (#112162)"
This reverts commit ff1ae3520506045c266463a05b0ce346552363c7.

Reverted https://github.com/pytorch/pytorch/pull/112162 on behalf of https://github.com/clee2000 due to broke lint? probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/112162#issuecomment-1800310012))
2023-11-07 22:40:35 +00:00
356f3458c4 [dynamo] Remove incorrect sources (#112961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112961
Approved by: https://github.com/voznesenskym, https://github.com/Skylion007
ghstack dependencies: #111306, #111415, #111725, #111726, #112962
2023-11-07 22:01:40 +00:00
bd8d924e9b [dynamo] Relax NullContextVariable and RangeVariable guards (#112962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112962
Approved by: https://github.com/voznesenskym
ghstack dependencies: #111306, #111415, #111725, #111726
2023-11-07 22:01:40 +00:00
8cee0a25bd fix: Flake8-BugBear code B-026 for PyTorch (#111362)
Fixes #106571

I have fixed the B-026 error codes for Flake8 tests on the codebase. Please review and tell me anything else to do.
Thanks and excited for this first contribution to PyTorch.

Also I refer this issue which introduced [B-026](https://github.com/PyCQA/flake8-bugbear/issues/286) in `pytest-bugbear` and discuss the error code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111362
Approved by: https://github.com/Skylion007
2023-11-07 21:38:18 +00:00
2da062da51 [pytorch-vulkan] fix zero-dim test (#113116)
Summary:
Fix zero-dim test. Use `at::zeros` instead of `at::empty` as the init value inside a `at::empty` tensor is undefined. Likely to be the cause of test flakiness.

 {F1142344469}

Test Plan:
Run on devserver

```
$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin
...
[       OK ] VulkanAPITest.linear_4d_large (2 ms)
[ RUN      ] VulkanAPITest.lstm_success
[       OK ] VulkanAPITest.lstm_success (4 ms)
[ RUN      ] VulkanAPITest.lstm_mclareninputs_success
[       OK ] VulkanAPITest.lstm_mclareninputs_success (45 ms)
[ RUN      ] VulkanAPITest.lstm_prepack_success
[       OK ] VulkanAPITest.lstm_prepack_success (2 ms)
[ RUN      ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7773: Skipped
QueryPool is not available

[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 402 tests from VulkanAPITest (24598 ms total)

[----------] Global test environment tear-down
[==========] 402 tests from 1 test suite ran. (24598 ms total)
[  PASSED  ] 399 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] VulkanAPITest.conv2d_pw_prepack
[  FAILED  ] VulkanAPITest.conv2d_pw_prepack_bc

 2 FAILED TESTS
  YOU HAVE 7 DISABLED TESTS

```

Last two are known failures on devserver.

Full output: P875058890

Differential Revision: D51055623

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113116
Approved by: https://github.com/manuelcandales
2023-11-07 21:32:03 +00:00
ff1ae35205 [TD] Add heuristic for class level historical correlations (#112162)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112162
Approved by: https://github.com/clee2000
2023-11-07 20:57:03 +00:00
056f2cba17 Deprecate "fallthrough" as autograd fallback default (#113166)
This got reverted a couple of months ago. We have since fixed the known
problems with the flag. It is time to try again.

Context:

This PR adds a new fallback to the Autograd dispatch keys. The previous
behavior was a big footgun; we are moving to deprecating it.

If you would prefer the old behavior:
- A quick (unsupported) way to get the previous behavior is to call
torch._C._set_autograd_fallback("nothing")
- Register "torch::CppFunction::makeFallthrough()" to your Autograd key,
like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8

It is possible that this PR regresses performance of overhead-bound
models. If this is the case, please reach out (and apply one of the
temporary fixes in the previous section).

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113166
Approved by: https://github.com/soulitzer
2023-11-07 20:26:39 +00:00
f496c8c4a7 [tp] handle non-covered ops (#112530)
Summary: Only propagate sharding if the op has sharding strategy registered, otherwise mark in/out of the op as `Replicate`.

Test Plan: buck test mode/opt  -c fbcode.enable_gpu_sections=true //caffe2/test/distributed/_tensor/experimental:tp_transform

Differential Revision: D50747611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112530
Approved by: https://github.com/wanchaol
2023-11-07 20:20:44 +00:00
0af8fb71ab add test for consecutive aot inductor compiles (#111170)
Differential Revision: D50246956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111170
Approved by: https://github.com/khabinov
2023-11-07 20:11:53 +00:00
ad1c3467e2 [dynamo] run guard fail hooks for each cache entry for which there is a cache miss (#110325)
Attempt number 2 at https://github.com/pytorch/pytorch/issues/108950.

Improves debugging for guard failures/recompilations by:
- only running guard fail reason generation during recompilation, instead of when a guard fails during dynamo cache lookup (so generating guard failure reasons is not on the critical path)
- ~~always reporting all guard failures~~ Reports the first-failing guard failure for each cache entry.

We don't expect a performance hit since the guard fail reasons are only generated at recompile time rather than runtime. Perf benchmark to check this (https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?startTime=Fri,%2027%20Oct%202023%2017:42:43%20GMT&stopTime=Fri,%2003%20Nov%202023%2017:42:43%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/williamwen42/62/head&lCommit=f4724f5ffc6d17ceae513a42fc18627be7b85482&rBranch=main&rCommit=29f3d392bf230072e3bffae37b078e770cae1956). We may also need to verify this on benchmarks where guard fails are common.

Sample script:
```python
import torch
def generate_data(b):
    return (
        torch.randn(b, 3, 32, 32).to(torch.float32).cuda(),
        torch.randint(1000, (b,)).cuda(),
    )

from torchvision.models import resnet18
def init_model():
    return resnet18().to(torch.float32).cuda()

model = init_model()
model_opt = torch.compile(model, dynamic=False)

for b in range(16, 32):
    data = generate_data(b)
    model_opt(data[0])
```

Sample logs:
```bash
(/data/users/williamwen/py310-env) [williamwen@devgpu020.odn1 /data/users/williamwen/pytorch (wwen/log-all-guards)]$ python playground5.py
/data/users/williamwen/pytorch/torch/_inductor/compile_fx.py:141: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
[2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING]    function: 'forward' (/data/users/williamwen/torchvision/torchvision/models/resnet.py:284)
[2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING]    last reason: tensor 'L['x']' size mismatch at index 0. expected 16, actual 24
[2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
(/data/users/williamwen/py310-env) [williamwen@devgpu020.odn1 /data/users/williamwen/pytorch (wwen/log-all-guards)]$ TORCH_LOGS="recompiles" python playground5.py
/data/users/williamwen/pytorch/torch/_inductor/compile_fx.py:141: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
[2023-11-06 14:53:31,591] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:53:31,591] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:53:31,591] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 16, actual 17
[2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 17, actual 18
[2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 16, actual 18
[2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 18, actual 19
[2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 17, actual 19
[2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 16, actual 19
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 19, actual 20
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 18, actual 20
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 17, actual 20
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 16, actual 20
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 20, actual 21
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 19, actual 21
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 18, actual 21
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 17, actual 21
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 16, actual 21
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 21, actual 22
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 20, actual 22
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 19, actual 22
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 18, actual 22
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 17, actual 22
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 16, actual 22
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 22, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 21, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 20, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 19, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 18, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 17, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 16, actual 23
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 23, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 22, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 21, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 20, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 19, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 18, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 17, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 16, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING]    function: 'forward' (/data/users/williamwen/torchvision/torchvision/models/resnet.py:284)
[2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING]    last reason: tensor 'L['x']' size mismatch at index 0. expected 16, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
[2023-11-06 14:54:45,922] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:54:45,922] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:54:45,922] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 24, actual 25
[2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 25, actual 26
[2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 24, actual 26
[2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 26, actual 27
[2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 25, actual 27
[2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 24, actual 27
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 27, actual 28
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 26, actual 28
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 25, actual 28
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 24, actual 28
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 28, actual 29
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 27, actual 29
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 26, actual 29
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 25, actual 29
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 24, actual 29
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 29, actual 30
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 28, actual 30
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 27, actual 30
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 26, actual 30
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 25, actual 30
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 24, actual 30
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG]     triggered by the following guard failure(s):
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 30, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 29, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 28, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 27, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 26, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 25, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG]     - tensor 'L['x']' size mismatch at index 0. expected 24, actual 31
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110325
Approved by: https://github.com/ezyang, https://github.com/jon-chuang
2023-11-07 20:10:59 +00:00
c0aba9be41 [quant][pt2] Fix custom dtype per channel weight in QAT (#112612)
Summary: Previously we only copied over q/dq args for the per
tensor case. This was because the qparams for `quantize_per_tensor`
are literals while the qparams for `quantize_per_channel` are
`get_attr` nodes (tensors), which disappear from the original
nodes in the graph after subgraph rewriting.

However, this is problematic because, in the per channel case,
not all q/dq args are tensors. In particular, the args after
the qparams (axis, qmin, qmax, dtype) are all literals. For
these literal args we simply used the hardcoded ones
(0, -127, 127, torch.int8 respectively), even if the user
explicitly specified to use a different weight dtype. This
commit fixes this by copying over these literal args for the
per channel case as well.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_per_channel_weight_custom_dtype

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112612
Approved by: https://github.com/jerryzh168
2023-11-07 20:10:53 +00:00
538ec4942a Do not generate zero-numel NT by default in helper and improve to_padded_tensor msg (#113162)
Improvements: improves to_padded_tensor error message when passed a NT with zero numel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113162
Approved by: https://github.com/jbschlosser
ghstack dependencies: #113031, #112519, #113091
2023-11-07 19:56:26 +00:00
0c991acab0 Factor out test_nestedtensor setUp tearDown and call super (#113091)
Fixes https://github.com/pytorch/pytorch/issues/112845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113091
Approved by: https://github.com/jbschlosser
ghstack dependencies: #113031, #112519
2023-11-07 19:56:26 +00:00
5fe96eaaf4 [dynamo] Remove VariableTracker.propagate (#111726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111726
Approved by: https://github.com/voznesenskym
ghstack dependencies: #111306, #111415, #111725
2023-11-07 19:55:19 +00:00
843a8ecd24 [dynamo] Remove VariableTracker.add_options (#111725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111725
Approved by: https://github.com/voznesenskym
ghstack dependencies: #111306, #111415
2023-11-07 19:55:19 +00:00
9664190952 [dynamo] Eagerly install guards (#111415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111415
Approved by: https://github.com/voznesenskym
ghstack dependencies: #111306
2023-11-07 19:55:19 +00:00
2964682490 [dynamo] Add LazyVariableTracker (#111306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111306
Approved by: https://github.com/voznesenskym
2023-11-07 19:55:19 +00:00
2322d989e8 Apply release only changes to core (#109208)
Utility script to run after branch cut have been completed.
Execute: ``RELEASE_VERSION=2.1 apply-release-changes.sh``
Similar to: https://github.com/pytorch/audio/pull/3590

Test PR: https://github.com/pytorch/pytorch/pull/109210

Automate generation of PRs:
https://github.com/pytorch/pytorch/pull/108053
https://github.com/pytorch/pytorch/pull/108688
https://github.com/pytorch/pytorch/pull/108064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109208
Approved by: https://github.com/seemethere
2023-11-07 19:47:30 +00:00
0c448526a4 [experiment][TD] Rating number system (#112676)
Emit excessive amount of heuristic info emitted, but that just means I can do more with it later?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112676
Approved by: https://github.com/ZainRizvi
2023-11-07 19:40:11 +00:00
82875e69fe [inductor][fx pass] Fix a bug for the merge_stack_tahn_unbind pattern (#113101)
Summary:
Context:
https://fb.workplace.com/groups/1075192433118967/permalink/1328366351134906/

Test Plan:
local reproduce igctr:
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch-group
```
P874994427

Differential Revision: D51052304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113101
Approved by: https://github.com/jackiexu1992
2023-11-07 19:25:50 +00:00
785e586eb0 [CUDA][cuBLAS] Separate reduced precision reductions on/off for addmm tests (#112545)
CC @malfet @ptrblck
~~We've been seeing a lot of noise from Ampere and later devices due to reduced precision reductions, so preemptively disabling them for addmm tests.~~

Breaking out addmm tests into one with and without reduced precision reductions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112545
Approved by: https://github.com/malfet
2023-11-07 19:09:29 +00:00
bc3e2e03cd Revert "Update impl_abstract_pystub to be less boilerplatey (#112851)"
This reverts commit 6ae4e3a8d249a96d9a8bbfba389d0509783e11e1.

Reverted https://github.com/pytorch/pytorch/pull/112851 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112851#issuecomment-1799539354))
2023-11-07 18:53:13 +00:00
2fc940e0c4 [DTensorTestbase] Add run_subtests to DTensorTestbase and fix test_ddp checkpoint test error (#113051)
This PR:

1. Adds  `run_subtests` to `DTensorTestbase`, which runs test functions given by `test_fn` as subtests. This amortizes the costly dist setup.
2. Update `test/distributed/checkpoint/test_state_dict.py` to use `DTensorTestbase`. This fixes the "Duplicate GPU detected: rank 0 and rank 1 both on CUDA device 11000" when running on 1 GPU, as the skip_if_lt_x_gpu is currently happening after dist setup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113051
Approved by: https://github.com/fegin
2023-11-07 18:48:01 +00:00
7c4e49ec80 [Fix] add validation logics to TCPStore queries (#107607)
This PR fixes #106294.

Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607
Approved by: https://github.com/cbalioglu, https://github.com/XilunWu
2023-11-07 18:36:25 +00:00
56e514aefb [dtensor][BE][1/N] fix DTensor Ops test (#113104)
**Summary**:
dtensor_ops test has helper function `assert_ref_dtensor_equal` which was written as expecting a `DTensor` argument `dtensor_rs` but actually receives a `torch.Tensor` in test. This PR removes the `to_local()` call on that object since it's actually a `torch.Tensor`.

This PR is a part of internal task [T169242924](https://www.internalfb.com/intern/tasks/?t=169242924) for better engineering.

**Test**:
`pytest test/distributed/_tensor/test_dtensor_ops.py`
`pytest test/distributed/_tensor/test_dtensor_ops.py -s -k baddbmm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113104
Approved by: https://github.com/wanchaol
2023-11-07 18:23:14 +00:00
92e7f79609 Doc: Add and Fix docstrings for torch.util.data files (#112817)
Fixes #112635

Fix docstrings for `torch.utils.data` files.

```
Before:
> pydocstyle torch/utils/data/graph.py --count
Before: 5
After: 1

> pydocstyle torch/utils/data/graph_settings.py --count
Before: 8
After: 3

> pydocstyle torch/utils/data/dataloader.py --count
Before: 12
After: 6

> pydocstyle torch/utils/data/dataset.py --count
Before: 28
After: 23

> pydocstyle torch/utils/data/sampler.py --count
Before: 24
After: 19

> pydocstyle torch/utils/data/_utils/signal_handling.py --count
Before: 1
After: 0

> pydocstyle torch/utils/data/_utils/__init__.py --count
Before: 2
After: 0

> pydocstyle torch/utils/data/_utils/collate.py --count
Before: 20
After: 6

> pydocstyle torch/utils/data/_utils/fetch.py --count
Before: 3
After: 0

> pydocstyle torch/utils/data/_utils/pin_memory.py --count
Before: 4
After: 1

> pydocstyle torch/utils/data/datapipes/_decorator.py --count
Before: 19
After: 16

> pydocstyle torch/utils/data/datapipes/_hook_iterator.py --count
Before: 13
After: 0

> pydocstyle torch/utils/data/datapipes/_typing.py --count
Before: 17
After: 4

> pydocstyle torch/utils/data/datapipes/gen_pyi.py --count
Before: 19
After: 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112817
Approved by: https://github.com/kit1980
2023-11-07 17:59:56 +00:00
740137df6f [MPS] Add bucketize op (#112830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112830
Approved by: https://github.com/kulinseth, https://github.com/malfet
ghstack dependencies: #112829
2023-11-07 17:22:08 +00:00
c4bb77323d [MPS] Add searchsorted op (#112829)
The metal kernels implemented are closely following `Bucketization.cu`.

Benchmark:
```
[----------------------------- searchsorted ----------------------------]
                                                         |  cpu   |  mps
1 threads: --------------------------------------------------------------
      Batch size: 8; In features: 64; Sorter: True       |    44  |   530
      Batch size: 8; In features: 64; Sorter: False      |    31  |    12
      Batch size: 8; In features: 256; Sorter: True      |   131  |   520
      Batch size: 8; In features: 256; Sorter: False     |   107  |    12
      Batch size: 8; In features: 1024; Sorter: True     |   499  |   590
      Batch size: 8; In features: 1024; Sorter: False    |   398  |    12
      Batch size: 16; In features: 64; Sorter: True      |    71  |   540
      Batch size: 16; In features: 64; Sorter: False     |    57  |    12
      Batch size: 16; In features: 256; Sorter: True     |   242  |   610
      Batch size: 16; In features: 256; Sorter: False    |   200  |    12
      Batch size: 16; In features: 1024; Sorter: True    |   999  |   720
      Batch size: 16; In features: 1024; Sorter: False   |   842  |    12
      Batch size: 32; In features: 64; Sorter: True      |   124  |   509
      Batch size: 32; In features: 64; Sorter: False     |   103  |    12
      Batch size: 32; In features: 256; Sorter: True     |   477  |   650
      Batch size: 32; In features: 256; Sorter: False    |   407  |    12
      Batch size: 32; In features: 1024; Sorter: True    |  1940  |   833
      Batch size: 32; In features: 1024; Sorter: False   |  1710  |    12
      Batch size: 64; In features: 64; Sorter: True      |   231  |   590
      Batch size: 64; In features: 64; Sorter: False     |   194  |    12
      Batch size: 64; In features: 256; Sorter: True     |   937  |   710
      Batch size: 64; In features: 256; Sorter: False    |   800  |    13
      Batch size: 64; In features: 1024; Sorter: True    |  3980  |  1290
      Batch size: 64; In features: 1024; Sorter: False   |  3330  |    12
      Batch size: 128; In features: 64; Sorter: True     |   448  |   650
      Batch size: 128; In features: 64; Sorter: False    |   390  |    13
      Batch size: 128; In features: 256; Sorter: True    |  1830  |   850
      Batch size: 128; In features: 256; Sorter: False   |  1590  |    12
      Batch size: 128; In features: 1024; Sorter: True   |  7790  |  2850
      Batch size: 128; In features: 1024; Sorter: False  |  6670  |    13
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112829
Approved by: https://github.com/malfet
2023-11-07 17:22:08 +00:00
70eeb82f00 s390x: skip tests relying on specific openblas precision (#112843)
This change skips test_forward_mode_AD_linalg_det_singular_cpu_complex128 and test_forward_mode_AD_linalg_det_singular_cpu_float64 from test/test_ops_fwd_gradients.py
due to https://github.com/OpenMathLib/OpenBLAS/issues/4194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112843
Approved by: https://github.com/kit1980
2023-11-07 17:18:20 +00:00
611a7457ca [Inductor] Kill MutationLayout from ir.py (#112925)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112925
Approved by: https://github.com/jansel
2023-11-07 17:03:52 +00:00
562c4ae4bc Update Pillow pin to 10.0.1 (#113111)
To err on the side of caution and fix dependabot warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113111
Approved by: https://github.com/clee2000
2023-11-07 16:41:06 +00:00
4fecbebc37 Fix OOM in test_large_block_sizes (#113153)
This test is causing flakyness on CPU, see #113134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113153
Approved by: https://github.com/lezcano
2023-11-07 16:12:19 +00:00
6ce5de5275 Avoid calling as_tensor twice (#112866)
Sometimes doing so may copy and that's not good. We avoid that by
setting global flags.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112866
Approved by: https://github.com/kit1980, https://github.com/ev-br
2023-11-07 16:10:59 +00:00
6ae4e3a8d2 Update impl_abstract_pystub to be less boilerplatey (#112851)
Summary:
We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
  Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
  the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
  Library.define in Python appends the op to a global set, which is analogous
  to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
  we require that it has an `impl_abstract_pystub` specified and we also check
  that the module in the `impl_abstract_pystub` is the same as the module where
  the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
  buck-based systems) because buck sits above us.

Test Plan: - existing tests

Differential Revision: D50972148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112851
Approved by: https://github.com/ezyang
2023-11-07 16:07:42 +00:00
9a28a7b498 Revert "Add support for torch.Generator type in TorchScript (#110413)"
This reverts commit 27e31ab6e86259b27d816d6fb6e7a69de526a0e4.

Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1799003164))
2023-11-07 15:53:32 +00:00
bb7ac12cbf [ProcessGroupNCCL] Avoid recording stream for broadcast and scatter (#112896)
Summary: Follows PR #111431, save memory for DTensor init

Test Plan: Sandcastle

Reviewed By: wanchaol

Differential Revision: D50985365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112896
Approved by: https://github.com/wanchaol
2023-11-07 15:44:04 +00:00
98564d2d7a If you have i0 = i1 * 12, perform this replacement directly (#112653)
In https://github.com/pytorch/pytorch/pull/112156 I added support for creating replacements on unbacked SymInts, so if you asserted that `i0 == s0`, we would replace i0 with s0 (only ever replacing unbacked with backed.)

However, if we have assertions involving only unbacked SymInts, we can also replace in this case! E.g., `i0 == i1` or `i0 == i1 * 12`. The previous logic for generating replacements would reject these cases, because you're not allowed to replace unbacked with unbacked. Modifying the logic is not so easy though; ordinarily, we decide what substitution to prioritize by trying to replace the largest hinted symbol, but for unbacked integers we don't have this. To get around this problem, for now I only setup replacements for trivial symbol equals something else situations. Check the diff with whitespace ignored, the addition is quite small.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112653
Approved by: https://github.com/aakhundov
2023-11-07 14:31:54 +00:00
493b52b3d9 Grandfather in built-in TorchScript ops to being pt2_compliant (#113061)
I'm seeing ops like torch.ops.aten.mul.complex being used with
torch.compile (though this seems strange to me), but we should
grandfather these in.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113061
Approved by: https://github.com/ezyang
ghstack dependencies: #113049, #113050
2023-11-07 12:55:16 +00:00
85832c0b9b Grandfather in some more pytorch ops to be pt2_compliant (#113050)
We're not directly testing these, but in general the policy is to assume
that PyTorch ops inside the pytorch repo are compliant.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113050
Approved by: https://github.com/ezyang
ghstack dependencies: #113049
2023-11-07 12:55:16 +00:00
a06832f911 Grandfather in c10d_functional ops to pt2_compliant (#113049)
This PR also adds the ability to specify Tags for more `m.def(`
overloads.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113049
Approved by: https://github.com/williamwen42
2023-11-07 12:55:05 +00:00
c6f435befd Don't recompute numel and contiguous in detach (#112689)
When symolic shapes are involved, `refresh_numel` and `refresh_contiguous` are
fairly expensive since they dispatch to python for SymInt handling. However, in
the case of detach we can just copy the existing values instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112689
Approved by: https://github.com/lezcano, https://github.com/ezyang
2023-11-07 10:55:48 +00:00
52e2b87d00 [Kineto][NCCL][5/n] Populate in/out split size info for all_to_all from CPU to CUDA kernel (#112308)
Summary: This diff populates all_to_all input and out split size from CPU op to GPU kernel when valid.

Test Plan:
**Trace example**:
- For non all_to_all collective functions: https://fburl.com/perfdoctor/4nobsu15
https://pxl.cl/3GNVb

- For all_to_all: https://fburl.com/perfdoctor/f418goys

https://pxl.cl/3H2nd

Differential Revision: D50762093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112308
Approved by: https://github.com/aaronenyeshi
2023-11-07 09:50:37 +00:00
220c3bae6d Pass in parallel strategy to tp_transform API (#112286)
Summary: Support passing in parallel strategy map and apply TP transform based on that. This is to make it easier manually select layers from real model to parallelize and benchmark.

Test Plan: buck test mode/opt  -c fbcode.enable_gpu_sections=true //caffe2/test/distributed/_tensor/experimental:tp_transform

Differential Revision: D50591039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112286
Approved by: https://github.com/wanchaol
2023-11-07 09:28:58 +00:00
f6fb9fd681 use smaller batch size for timm_efficientdet in inference (#113095)
Previously had OOMs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113095
Approved by: https://github.com/xmfan
ghstack dependencies: #112650
2023-11-07 07:08:16 +00:00
65304d8fd0 s390x: fix inductor constructing floats out of bytes (#112723)
This change fixes test_embedding_bag_byte_unpack_cpu from test/inductor/test_torchinductor.py on s390x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112723
Approved by: https://github.com/jansel
2023-11-07 06:51:46 +00:00
ff51f94e32 [Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) (#113094)
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094
Approved by: https://github.com/fduwjj
2023-11-07 05:34:26 +00:00
68c4507bc2 [Inductor] Allow None values to be passed in as arguments to triton kernels (#113056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113056
Approved by: https://github.com/jansel
ghstack dependencies: #112752, #113008, #112801
2023-11-07 05:29:42 +00:00
bfa717c6a6 [Inductor] Improve reinplace_scatters pass (#112801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112801
Approved by: https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #112752, #113008
2023-11-07 05:29:42 +00:00
f6008be266 Move all triton related testing utils into shared file (#113008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113008
Approved by: https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #112752
2023-11-07 05:29:29 +00:00
dbf44dffc9 [Inductor] Cache generated user defined triton kernels on tensor dtype and non tensor parameters (#112752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112752
Approved by: https://github.com/jansel
2023-11-07 05:29:16 +00:00
f99b5f1f23 [Inductor][fx pass] Remove split nodes with split section size one (#112922)
Summary: We observe that DSNN has many split nodes with split section size one, which hinder the split cat merge in the later pass, thus we remove such nodes in the early stage.

Test Plan:
# local reproduce with DSNN model
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch-group -c
```
P872705076
diffing: https://www.internalfb.com/intern/diffing/?paste_number=872698775

# unit test

```
buck2 test mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/b248410e-a556-47a2-9293-7f113b49f0d6
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10696049124469023
Network: Up: 80KiB  Down: 47KiB  (reSessionID-a31dec17-d322-4757-ba84-4d262bd139cf)
Jobs completed: 24. Time elapsed: 1:52.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D50990290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112922
Approved by: https://github.com/jackiexu1992
2023-11-07 04:53:32 +00:00
7bd066ab48 Package pybind11/eigen/ (#113055)
Which was added for eigen 2.11 release, see https://github.com/pybind/pybind11/tree/v2.11.0/include/pybind11/eigen

Fixes https://github.com/pytorch/pytorch/issues/112841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113055
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2023-11-07 04:27:43 +00:00
10a829b85d Retarget sym_size/sym_stride lowerings to their .int overloads (#113054)
Fixes https://github.com/pytorch/pytorch/issues/112913

The new logging looks like this:

```
[2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg0_1 : [num_users=0] = placeholder[target=arg0_1]
[2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
[2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG] lowering %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, 1), kwargs = {})
[2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG]   via <function make_pointwise.<locals>.inner at 0x7f0abed28ee0>
[2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %sym_stride_int : [num_users=1] = call_function[target=torch.ops.aten.sym_stride.int](args = (%add, 0), kwargs = {}) sym_stride
[2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg1_1, %sym_stride_int), kwargs = {})
[2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG]   via <function mul at 0x7f0abec8bd00>
[2023-11-06 12:48:57,744] [0/0] torch._inductor.graph: [DEBUG] lowering return (mul,)
```

Notice that `sym_stride` no longer is hitting the lowering. This is what the behavior was before I broke it. A better refactor coming soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113054
Approved by: https://github.com/davidberard98
2023-11-07 04:15:38 +00:00
c847fd2ac8 Fix torch.compiler.cudagraph_mark_step_begin example (#112807)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112807
Approved by: https://github.com/eellison
2023-11-07 04:15:31 +00:00
74c24d2367 Fixes a bug in inductor.triton.load (#113047)
Lettin CI/CD tell me if there is anything wrong with this

Original bug:
``` Shell
        r1 = rindex
        tmp37 = tl.load(out_ptr2 + (r1 + (8192*x0)), rmask, eviction_policy='evict_first', other=0)
                                                     ^
AssertionError('cannot cast int32[constexpr[1],constexpr[2048]] to <[1, 2048], fp8e4nv>')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113047
Approved by: https://github.com/Skylion007, https://github.com/ipiszy
2023-11-07 04:06:54 +00:00
ddfe572534 [dynamo] Graph break on setattr(Tensor, "data", Tensor) (#113043)
Fixes https://github.com/pytorch/pytorch/issues/113030

Alias information needs to be applied in eager before we can continue to trace the graph.

----

Perhaps this is too strict - couldn't we fx trace through the in-graph (pointer) aliasing, and track mutations through fake tensors instead, and still apply the aliasing mutation epilogue for further mutations outside of graph? 🤔

Regardless, it didn't seem to work too well when I tried this. Seems that `Tensor.__setattr__` doesn't work well in fx graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113043
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
2023-11-07 03:56:21 +00:00
5c1ea30ca3 bump torchbench commit (#112650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112650
Approved by: https://github.com/msaroufim, https://github.com/xuzhao9
2023-11-07 03:56:16 +00:00
5cfe973bed [PyTorch FX] ProxyableClassMeta skip map_aggregate if not is_fx_tracing (#112934)
Summary: TorchRec KJT (https://fburl.com/code/yoaqqsgi) and LazyAwaitable (https://fburl.com/code/4bygm7tg) inherits ProxyableClassMeta in order to make torchrec model fx traceble. The issue is that even when is not fx tracing, it still triggers this `map_aggregate(args, check_proxy)` https://fburl.com/code/mpbmjsqw, which will iterate every inputs to KJT and flatten the list/dict to run a function on every element. It's super slow if the len(list) is large. This diff is to skip the map_aggregate when it's not fx tracing.

Test Plan:
#facebook
# before: [trace](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu021.odn1.facebook.com/rank-0.Nov_03_16_56_11.243575.pt.trace.json.gz&bucket=aps_traces)
move_id_list features takes ~80ms when profiling with stack, most of the time is `map_aggregate`

{F1140039564}

# after: [trace](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu021.odn1.facebook.com/rank-0.Nov_03_16_27_50.3617247.pt.trace.json.gz&bucket=aps_traces)

now it's less than 3ms, no `map_aggregate`
 {F1140038095}

Differential Revision: D50994285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112934
Approved by: https://github.com/angelayi
2023-11-07 03:16:30 +00:00
4c04ae2451 [ROCm] fix test_softmax_forward_64bit_indexing_cuda OOM (#113093)
TestNNDeviceTypeCUDA.test_softmax_forward_64bit_indexing_cuda started failing for ROCm after #112096 with the message

torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 13.35 GiB. GPU 0 has a total capacity of 31.98 GiB of which 3.89 GiB is free. Of the allocated memory 26.69 GiB is allocated by PyTorch, and 18.91 MiB is reserved by PyTorch but unallocated.

This amounts to approximately 41GB. The test is currently decorated with `largeTensorTest("30GB", "cuda")` but this is not sufficient for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113093
Approved by: https://github.com/malfet
2023-11-07 03:00:37 +00:00
8768b87bd1 Remove torch distributed from CODEOWNERS (#112813)
After adding support for labeler, we don't need CODEOWNERS.

This change will cause the distributed team members previously listed in
CODEOWNERS to stop being auto-added as reviewers on PRs touching these
files.  The preceding PR adds labeler support for these same sets of
files, and contains instructions for adding yourself to be cc'd for that
label.

It is preferable to be auto-cc'd rather than auto-tagged as reviewer, so
that there is more signal in the reviewers list (either someone opted in
which shows the PR author someone is likely looking at it, or the PR
author added someone specifically which is a stronger notification to
the tagged reviewer than the blanket CODEOWNERS behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112813
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-11-07 02:43:04 +00:00
1fea599d9a Revert "Grandfather in c10d_functional ops to pt2_compliant (#113049)"
This reverts commit fe8570a1fe5c6678a4be8deff561dbc48693410e.

Reverted https://github.com/pytorch/pytorch/pull/113049 on behalf of https://github.com/clee2000 due to something in the stack broke distributed and inductor, pretty sure its this one ([comment](https://github.com/pytorch/pytorch/pull/113049#issuecomment-1797298969))
2023-11-07 02:34:13 +00:00
19dbd8aca3 Revert "Grandfather in some more pytorch ops to be pt2_compliant (#113050)"
This reverts commit efae8449a83df2bcd2e5f3c0f531051b6860cf0c.

Reverted https://github.com/pytorch/pytorch/pull/113050 on behalf of https://github.com/clee2000 due to something in the stack broke distributed and inductor, pretty sure its the c10 one ([comment](https://github.com/pytorch/pytorch/pull/113050#issuecomment-1797279756))
2023-11-07 02:30:42 +00:00
d94d72b397 Revert "Grandfather in built-in TorchScript ops to being pt2_compliant (#113061)"
This reverts commit 1d4d5e4319a5ddacdb4e0d1ac944bbb63921fdb1.

Reverted https://github.com/pytorch/pytorch/pull/113061 on behalf of https://github.com/clee2000 due to something in the stack broke distributed and inductor, pretty sure its the c10 one.  Not sure why so many things were flaky on this PR ([comment](https://github.com/pytorch/pytorch/pull/113061#issuecomment-1797251293))
2023-11-07 02:28:14 +00:00
ad844e7919 [inductor] fix out of shared memory issue (#112916)
Fix https://github.com/pytorch/pytorch/issues/112454 .

The current fix is quite simple. The kernel has multiple triton configs. Previously any triton config fail to compile, we skip everything else and fail. Now we just skip the bad configs and pick the best one from the remaining configs.

There are other ways to fix the issues more fundamentally but requires much more work:
1. Horace mentioned an idea to make sure the largest one of size_hints is the right most dimension. This way, that largest dimension with be mapped to XBLOCK and we won't scale it up too much since the threshold for the max grid size for x dimension is quite large (2**31 - 1). But this may require us to change loop ordering heuristics which may have other perf impact.
2. The issue happens because the kernel requires 2D tiling which uses shared memory. We can stop scaling up block size if: `XBLOCK * YBLOCK * element_size >= max_shared_memory` . max_shared_memory is around 160K for A100. The tricky part here is we don't know dtype in `triton_config` method to decide the `element_size`.  From metadata, we can find the dtype for each tensor, but if the kernel uses tensors of mixed types, we won't know what dtype is actually used for the data loaded into the shared memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112916
Approved by: https://github.com/Chillee, https://github.com/eellison, https://github.com/jansel
2023-11-07 01:47:53 +00:00
c608b0eb35 [Dist] Enable FSDP on CPU (#112145)
Differential Revision: [D50688958](https://our.internmc.facebook.com/intern/diff/D50688958/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112145
Approved by: https://github.com/fegin
ghstack dependencies: #112144
2023-11-07 01:37:02 +00:00
5ffa98f7ba [Dist] Add fallback reduce_scatter_base, allgather_base APIs to Gloo (#112144)
Per Ke's suggestion, adding these APIs in ProcessGroupGloo directly to
enable FSDP on CPUs

Differential Revision: [D50636382](https://our.internmc.facebook.com/intern/diff/D50636382/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112144
Approved by: https://github.com/wz337, https://github.com/fegin, https://github.com/wanchaol, https://github.com/XilunWu
2023-11-07 01:37:02 +00:00
e9496fdc34 [pytorch-vulkan] Disable failing test on vulkan_api_test (#112936)
Summary:
`conv2d_pw_prepack` and `conv2d_pw_prepack_bc` has been broken for a long time on Meta's CI.

Cause unknown yet. The tests passes with a smaller input.

Hence disable the large test, and enable test with a smaller tensor.

Test Plan:
Devserver:

```
 LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*"
```

Output: P872944689

```
...
[       OK ] VulkanAPITest.linear_4d_small (0 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (2 ms)
[ RUN      ] VulkanAPITest.lstm_success
[       OK ] VulkanAPITest.lstm_success (7 ms)
[ RUN      ] VulkanAPITest.lstm_mclareninputs_success
[       OK ] VulkanAPITest.lstm_mclareninputs_success (39 ms)
[ RUN      ] VulkanAPITest.lstm_prepack_success
[       OK ] VulkanAPITest.lstm_prepack_success (3 ms)
[ RUN      ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7627: Skipped
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 396 tests from VulkanAPITest (23847 ms total)
[----------] Global test environment tear-down
[==========] 396 tests from 1 test suite ran. (23847 ms total)
[  PASSED  ] 395 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
  YOU HAVE 9 DISABLED TESTS
```

Differential Revision: D50997218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112936
Approved by: https://github.com/manuelcandales
2023-11-07 01:33:45 +00:00
4893a2814f [pytree] align function signature between C++ and Python pytree (#112482)
Change the argument name in C++ and Python pytree APIs. Also add a test to ensure the function signatures are the same in the two implementations.

- #112485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112482
Approved by: https://github.com/zou3519
2023-11-07 01:26:41 +00:00
7715b47f44 [fx] Speedup ShapeEnv cache invalidation checks (#112687)
This may seem a bit silly but we spend ~5% of compilation on simply checking if the `ShapeEnv` cache has been invalidated. It isn't necessarily slow, but we call it millions of times per compile so everything adds up.

To improve the situation, I've added a version counter to the shape env that gets incremented whenever the cache key changes. This does require a bit of care in `ShapeEnv` that we don't modify the relevant state without calling `self._update_version_counter()`. However, we already have a similar situation for the translation validation feature which requires `_set_replacement` to be called instead of modifying the replacements directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112687
Approved by: https://github.com/ezyang
ghstack dependencies: #112933
2023-11-07 01:10:25 +00:00
65ecb36621 Move ShapeEnv config out of dynamo (#112933)
Previously there was a circular dependency between fx and dynamo that happened
to work out since ShapeEnv didn't access the config at module init time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112933
Approved by: https://github.com/ezyang
2023-11-07 01:10:25 +00:00
b4dbb02d46 Adjust _list_with_default to also work with SymInt input (#113073)
Fixes https://github.com/pytorch/pytorch/issues/112496

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113073
Approved by: https://github.com/jbschlosser
2023-11-07 00:59:25 +00:00
8219bf051b [BE]: Apply RUF015 to torch folder (#113025)
Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-11-07 00:48:15 +00:00
fb8ffba47f [PyTorch][Vulkan] Reduce 2D float matrix multiplication shader latency by more than 50% on some Android GPUs (#112918)
Summary:
- Introduce improved algorithm for 2d float GEMM [ output = alpha * (input) * (weight) + beta * (bias) ] that shows more than 50% shader latency reduction on Qualcomm GPUs. Does not apply for the quantized [integer] and batch [3d] matrix multiplication cases.
  - At function call of `run_linear_context()`/`run_addmm_context()`, re-pack the input tensor data to be row-wise element-dense in each texel
  - Reducing global I/O reads and writes through "batching" by fetching 4 input and weight texels each, then performing 16 output computations and writes, in each shader invocation
  - Leverage a loop unrolling/coalescing compile-time optimization of the GLSL->SPIR-V compiler using a macro for 4

Test Plan:
# Numerical Validation
- There are two pre-existing failures on trunk related to conv2d, unrelated to this diff's code paths
- `LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin`

```
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 396 tests from VulkanAPITest (38014 ms total)

[----------] Global test environment tear-down
[==========] 396 tests from 1 test suite ran. (38014 ms total)
[  PASSED  ] 393 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] VulkanAPITest.conv2d_pw_prepack
[  FAILED  ] VulkanAPITest.conv2d_pw_prepack_bc
```
# Performance Validation with Matrix Multiplication Benchmark Binary
- build the benchmark binary on both this diff and trunk modified for 100 iterations, `buck2 build -c ndk.debug_info_level=0 -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_mm_perf_test_binAndroid --show-output`

__on local testing against a Samsung Galaxy S22 Ultra **75% reduction**__
- this diff
```
Benchmark                                                                                       Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------------------------
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1          2.08 ms         10.3 ms          100
```
- trunk:
```
Benchmark                                                                                       Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------------------------
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1          9.11 ms         13.8 ms          100
```
__on local testing against our Android chipset of interest **50% reduction**__
- this diff:
```
Benchmark                                                                                       Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------------------------
[...]
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1          40.0 ms         90.6 ms          100

```
- trunk:
```
Benchmark                                                                                       Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------------------------
[...]
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1          81.3 ms          106 ms          100
```
__on local testing against Google Pixel 7 Pro **55% reduction**__
- this diff:
```
Benchmark                                                                                       Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------------------------
[...]
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1          7.38 ms         10.7 ms          100
```
- trunk:
```
Benchmark                                                                                       Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------------------------
[...]
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1          16.2 ms         12.4 ms          100

```

Reviewed By: yipjustin

Differential Revision: D50441864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112918
Approved by: https://github.com/yipjustin
2023-11-07 00:32:04 +00:00
24b61a45c9 [inductor] scale up num_warps for reductions to lower register pressure (#113039)
Recent work (https://github.com/pytorch/pytorch/pull/108193 and https://github.com/pytorch/pytorch/pull/109275) unveiled that bigger Triton kernel can regress performance due to increased register pressure which in turn lowers thread occupancy. By taking a look at the Triton internal, I see an opportunity to reduce the register pressure by decreasing the amount of work each thread does. I'm bumping up the `num_warps` to achieve this. The change should only affect reduction cases.

I'm seeing real compilation time reduction with this change which is likely due to smaller LLVM IR:
https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2023%20Oct%202023%2017%3A57%3A40%20GMT&stopTime=Mon%2C%2006%20Nov%202023%2018%3A57%3A40%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=hoy-reduction&lCommit=f2d31b83aa170914018407d88a76d5951153b316&rBranch=main&rCommit=64f326097be8ac66ff057365f3bed2d64c697563

The slightly performance improvement can be noise, if not, the lower register pressure could explain.

Ideally, we should improve Triton to automatically reroll large kernel to an inner loop, without hurting vectorization. That's something I'm considering on the LLVM side.

I'm also seeing the fused kernel provided in https://github.com/pytorch/pytorch/pull/108193 gets a better performance by benefiting from a lower register pressure. PTXAS shows a usage of 32 registers compared to 55 previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113039
Approved by: https://github.com/shunting314
2023-11-07 00:12:16 +00:00
c2084da14a [NT] Backward support for broadcasting binary ops (#112519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112519
Approved by: https://github.com/jbschlosser
ghstack dependencies: #113031
2023-11-07 00:03:21 +00:00
d5007d8d8e Split out input_metadata.cpp from input_metadata.h (#113031)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113031
Approved by: https://github.com/albanD
2023-11-07 00:03:21 +00:00
1d4d5e4319 Grandfather in built-in TorchScript ops to being pt2_compliant (#113061)
I'm seeing ops like torch.ops.aten.mul.complex being used with
torch.compile (though this seems strange to me), but we should
grandfather these in.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113061
Approved by: https://github.com/ezyang
ghstack dependencies: #113036, #113049, #113050
2023-11-06 23:43:31 +00:00
efae8449a8 Grandfather in some more pytorch ops to be pt2_compliant (#113050)
We're not directly testing these, but in general the policy is to assume
that PyTorch ops inside the pytorch repo are compliant.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113050
Approved by: https://github.com/ezyang
ghstack dependencies: #113036, #113049
2023-11-06 23:43:31 +00:00
fe8570a1fe Grandfather in c10d_functional ops to pt2_compliant (#113049)
This PR also adds the ability to specify Tags for more `m.def(`
overloads.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113049
Approved by: https://github.com/williamwen42
ghstack dependencies: #113036
2023-11-06 23:43:23 +00:00
71dca16610 Grandfather autogen'ed ops as pt2_compliant (#113036)
Summary:
I missed this when I grandfathered torchgen'ed aten ops as pt2_compliant.

Test Plan:
New test.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113036
Approved by: https://github.com/williamwen42
2023-11-06 23:43:17 +00:00
75adb9f371 Revert "Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893)"
This reverts commit f9d47e13813bbefc9f19a6c0430b7122f9d09b91.

Reverted https://github.com/pytorch/pytorch/pull/112893 on behalf of https://github.com/clee2000 due to sorry this seems to have broken inductor f9d47e1381 https://github.com/pytorch/pytorch/actions/runs/6776367936/job/18418174752 ([comment](https://github.com/pytorch/pytorch/pull/112893#issuecomment-1796979811))
2023-11-06 22:49:53 +00:00
eefe327b11 Rename torch.onnx.ExportOutput* to ONNXProgram* (#112263)
Since PyTorch 2.1, torch.export API was introduced and the term "export"
got overloaded due to the already existing torch.onnx.export API.

The torch.onnx.dynamo_export API was introduced on pyTorch 2.0 and it
exposed a torch.onnx.ExportOutput which now can be confused with
torch.export.export output

To prevent such ambiguity and standardize names around the new
torch.export.ExportedProgram, this PR renames torch.onnx.ExportOutput to
torch.onnx.ONNXProgram

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112263
Approved by: https://github.com/BowenBao
ghstack dependencies: #112444
2023-11-06 22:27:15 +00:00
21b6030ac3 Don't set CUDA_HOME when not compiled with CUDA support (#106310)
It doesn't make sense to set this (on import!) as CUDA cannot be used with PyTorch in this case but leads to messages like
> No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
when CUDA happens to be installed which is at least confusing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106310
Approved by: https://github.com/ezyang
2023-11-06 21:48:49 +00:00
27e31ab6e8 Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-06 21:27:02 +00:00
7b99b3efb1 added 'weights_only' param in torch.load examples (#112860)
Fixes #111876

`torch.load` without setting `weights_only=True` is unsafe. So updating examples of `torch.load` to use `weights_only=True` where possible and `weights_only=False` elsewhere with a warning of being unsafety.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112860
Approved by: https://github.com/kit1980
2023-11-06 21:17:36 +00:00
c83112a31f Add Autocast support to Conv thourgh explicit cast (#112806)
Fix ONNX Runtime failure due to `[ONNXRuntimeError] : 1 : FAIL : Type Error : Type parameter (T) of Optype (Conv) bound to different types (tensor(float) and tensor(float16) in node (Conv_5401).`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112806
Approved by: https://github.com/BowenBao
2023-11-06 21:00:18 +00:00
f9d47e1381 Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893)
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: #112611, #112803
2023-11-06 20:48:39 +00:00
81ea7a489a Replaced deprecated pkg_resources.packaging with packaging module (#113023)
Usage of `from pkg_resources import packaging` leads to a deprecation warning:
```
DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
```
and in strict tests where warnings are errors, this leads to CI breaks, e.g.: https://github.com/pytorch/vision/pull/8092

Replacing `pkg_resources.package` with `package` as it is now a pytorch dependency:
fa9045a872/requirements.txt (L19)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113023
Approved by: https://github.com/Skylion007
2023-11-06 20:26:32 +00:00
67256d5c1c [aotinductor] Solves a problem where a tensor is returned more than once (#112177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112177
Approved by: https://github.com/zhxchen17
2023-11-06 20:12:25 +00:00
718035791d Prefer e.is_number over not e.free_symbols in SymPy (#112688)
We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times.
Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is
horribly slow. It turns out though that there is another propery `is_number` that does what we want.

> property is_number:
>
> Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster
> than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined
> function.

Even further, we also avoid the overhead of building the unnecessary set object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688
Approved by: https://github.com/lezcano
2023-11-06 20:05:13 +00:00
19e9f5cc7b [torchgen] Add support for optional tensor (#112938)
Summary: As titled

Test Plan: rely on CI

Differential Revision: D50997957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112938
Approved by: https://github.com/Skylion007
2023-11-06 20:03:05 +00:00
bdfde62e54 [Inductor CUTLASS backend] Epilogue fusion codegen (Step 1) (#110890)
Summary:

This PR adds epilogue fusion code generation support for the new experimental
[Inductor Cutlass backend]([https://github.com/pytorch/pytorch/pull/108015]).

Details:

A fusion happens on the GEMM template level by taking a Cutlass 3.x GEMM Universal Matmul Kernel template
and adding a custom template functor based on Cutlass new “Epilogue Visitor Trees” (EVT) on top, which represents and
performs the computation of the fused Pointwise / Elementwise computation nodes.

This is the approach dictated by [NVIDIA/cutlass example 49](https://github.com/NVIDIA/cutlass/blob/main/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu),
which is currently the only documentation and example of Cutlass Epilogue Visitor Trees.

This EVT functor in turn is a hierarchical template expression which represents an abstract syntax tree of the fused computation to perform.
A second codegen task is to create a hierarchical initializer expression, which provides potentially necessary arguments
to each of the functor subexpressions.

Step 1 functionality:

 * End to end code generation is possible using the above approach.
 * Supports simple elementwise expression fusion of chains of elementwise operations (with scalar constants )
   after a matmul.
 * Elementwise operation support includes addition, subtraction, multiplication, division, minimum, maximum etc.
 * Examples / Unit tests include ReLU and ReLU6 fusion.
 * Support for fp16 and fp16 with fp32 accumulation data types.
 * Generates SM90 ( Hopper ) based CUDA Kernels ( as Cutlass up to 3.2.0 only supported EVT for SM90 )

The following is not yet supported, and is left for future work:

 * Full operation support ( e.g. full set of all ops usually handled via V.ops handlers )
 * Cutlass EVT with SM80 support ( possible in Cutlass 3.2.1 according to release notes, but not yet documented )
 * Add support for additional (auxiliary) inputs, which changes the Template Kernels' call signature
 * Add support for additional (auxiliary) outputs ( requires support for full computation graphs )
 * Add support for reduction operations and operations which use different output layouts than the input
 * Add support for additional dtypes ( as far as Cutlass allows )

This PR updates third_party/cutlass to v3.2.2, which has some important improvements and features
for the inductor backend.

See also Cutlass release notes:
https://github.com/NVIDIA/cutlass/releases/tag/v3.2.1 and https://github.com/NVIDIA/cutlass/releases/tag/v3.2.2

Notable changes in Cutlass 3.2.1 include:
 * Cutlass codegen python code has moved into a package with the "cutlass_library" namespace, which allows to
   prevent namespace clashes without resolving to monkey-patching ( which was done earlier ).
 * Support for SM80 epilogue visitor trees ( according to the Release Notes, not tried yet )
 * Small API changes to the cutlass_library API ( requires adapting the inductor backend code )

Notable changes in Cutlass 3.2.2 include:
 * Bugfix that led to CUDA Illegal memory access in some Pytorch unit tests involving flash attention

 Test Plan:
  * CI
  * pytest test/inductor/test_max_autotune.py

Note: So far, the CUTLASS backend is still disabled by default. Benchmarks are planned once more advanced fusions are enabled.

Differential Revision: [D50988161](https://our.internmc.facebook.com/intern/diff/D50988161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110890
Approved by: https://github.com/jansel
ghstack dependencies: #112762
2023-11-06 19:42:10 +00:00
59e003d159 Fixed cat uint8 lowering (#112753)
Description:
- Fixed cat uint8 lowering

Otherwise, it gives the following issue on the repro code:
```python
def func(x):
    batch_shape = x.shape[:1]
    out = torch.cat([x.new_zeros(1).expand(batch_shape + (1,)), x], dim=-1)
    return out

cfunc = torch.compile(func)

x = torch.randint(0, 256, size=(3, 255), dtype=torch.uint8)
out = cfunc(x)
```
Error message:
```
  File "/pytorch/torch/_inductor/lowering.py", line 1037, in <genexpr>
    if all(len(input.layout.size) == 4 for input in inputs):
  File "/pytorch/torch/_inductor/ir.py", line 5795, in __getattr__
    fn = getattr(self.data, name)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: AttributeError: 'ExpandView' object has no attribute 'layout'
  target: aten.cat.default
  args[0]: [TensorBox(
    ExpandView(data=StorageBox(
      ComputedBuffer(name='buf0', layout=FlexibleLayout('cpu', torch.uint8, size=[1], stride=[1]), data=Pointwise(
        'cpu',
        torch.uint8,
        def inner_fn(index):
            _ = index
            tmp0 = ops.constant(0, torch.uint8)
            return tmp0
        ,
        ranges=[1],
        origin_node=full,
        origins={full}
      ))
    ), size=[3, 1])
  ), TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cpu', torch.uint8, size=[3, 255], stride=[255, 1]))
  ))]
  args[1]: 1

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

Context: compiling is not working for torchvision's `F.equalize` op: https://github.com/pytorch/vision/issues/8056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112753
Approved by: https://github.com/peterbell10
2023-11-06 19:42:04 +00:00
542fa4a2e7 Revert "Revert "Use OpOverload instead of OpOverloadPacket for size/s… (#113058)
Revert "Revert "Use OpOverload instead of OpOverloadPacket for size/stride/etc slots (#112119)""

This reverts commit a1d1b73a7c2cf6b9a2edb4170ec268dfd90956bd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113058
Approved by: https://github.com/izaitsevfb
2023-11-06 19:38:49 +00:00
118e842fdf [2D][test] Update 2d test to reflect distributed_state_dict API changes (#112967)
As title

Fixes https://github.com/pytorch/pytorch/issues/113033
Fixes https://github.com/pytorch/pytorch/issues/112969
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112967
Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/huydhn
2023-11-06 19:36:30 +00:00
4d9546cc1b [pytorch-vulkan] conv1d, only handle special case (#112880)
Summary:
Just enough to cover the requirement for our target use-case.

Will add complete implementation later.

Test Plan:
```
(base) yipjustin@yipjustin-mac fbsource % buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -- --gtest_filter="*conv1d*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
Buck UI: https://www.internalfb.com/buck2/27291bfe-940a-4bed-9616-8f3b4f2a3fc7
Network: Up: 20MiB  Down: 142B  (reSessionID-5632e058-9f48-40eb-8157-30e2db104272)
Jobs completed: 6. Time elapsed: 13.5s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.conv1d_simple
[       OK ] VulkanAPITest.conv1d_simple (37 ms)
[ RUN      ] VulkanAPITest.conv1d
[       OK ] VulkanAPITest.conv1d (2 ms)
[----------] 2 tests from VulkanAPITest (39 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (39 ms total)
[  PASSED  ] 2 tests.
```

Differential Revision: D50914117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112880
Approved by: https://github.com/manuelcandales
2023-11-06 19:36:08 +00:00
ab1f6d58bc [c10d] use allocator trace callbacks for NCCL PG register (#112850)
Summary:
We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.

How to track and register all cache segments:
- It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree.
- When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once.
- When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator

Test Plan: See test in D50726971

Reviewed By: wconstab

Differential Revision: D50726970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112850
Approved by: https://github.com/wconstab
2023-11-06 19:29:32 +00:00
c6ecd018d5 Fix docstring errors (#112693)
This PR reduces docstring erros to 0 from total 128. This can be verified by running, pydocstyle path-to-distributed_c10d.py --count

Where, path-to-distributed_c10d.py is `torch/distributed/distributed_c10d.py`

BEFORE the PR:
`pydocstyle torch/distributed/distributed_c10d.py --count`
128
AFTER the PR:
`pydocstyle torch/distributed/distributed_c10d.py --count`
0

Fixes #112640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112693
Approved by: https://github.com/H-Huang
2023-11-06 18:45:05 +00:00
5248bc9c8e [LTC] Fix type inference for native_layer_norm_backward (#112948)
## Description
Fix a bug in compute_shape_native_layer_norm_backward function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112948
Approved by: https://github.com/Skylion007
2023-11-06 18:30:08 +00:00
a810126cf7 [FSDP][optim_state_dict] Skip the parameter if the parameter does not belong to the current FSDP instance (#112804)
Skip the fsdp managed parameter if the parameter is not managed by the current FSDP instance. This can happen if the not all FSDP instances have all the parameters. This can happen with FSDP + some MPMD style parallelism.

Differential Revision: [D50562170](https://our.internmc.facebook.com/intern/diff/D50562170/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112804
Approved by: https://github.com/wz337
2023-11-06 18:23:36 +00:00
5f562afff3 [DTensor] min, max and prod sharding propagation rules (#112403)
* `torch/distributed/_tensor/ops/math_ops.py` and `test/distributed/_tensor/test_math_ops.py`: add min, max and prod sharding propagation rules
* `torch/distributed/_tensor/sharding_prop.py` Validate OutputSpec to provide better errors when provided invalid specs
* `torch/distributed/_tensor/op_schema.py`: import `OpOverload` directly to aid linters

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112403
Approved by: https://github.com/wanchaol
2023-11-06 18:02:39 +00:00
b6e85eb8d5 [quant][pt2] Support quantized conv bias in QAT fusion (#112528)
Summary: Previously QAT fusion assumes bias is not quantized.
This works for the existing XNNPACKQuantizer, but not for custom
quantizers that wish to quantize the bias. This commit supports
this by adding the necessary patterns. This requires refactoring
the code, however, since it previously assumed that there will
only be one pair of q-dq (from conv weight) in the matched
pattern, and this is no longer true.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_bias_derived_qspec

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar

Differential Revision: [D50856377](https://our.internmc.facebook.com/intern/diff/D50856377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112528
Approved by: https://github.com/jerryzh168
2023-11-06 17:58:57 +00:00
e39668770a [CUDA] 64-bit indexing fixes for cross-entropy kernels (#112096)
For #108345, #111484

Addresses the forward kernels implicated in the issues, but will take another look at the backward kernels (in follow-up PRs if necessary).

The spatial softmax kernel is changed to use signed integer indexing rather than unsigned as `ScalarType` only has signed integer types declared for now, but this should be a minor change.

CC @ptrblck @crcrpar (who landed a few related PRs recently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112096
Approved by: https://github.com/mikaylagawarecki
2023-11-06 17:37:08 +00:00
a50f6d3685 Move release docker container builds to ubuntu22.04 (#113032)
Move Official Docker builds for the release to :
nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113032
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-11-06 17:33:40 +00:00
3f62531191 Fix: docstring errors in torch.nn.utils - parametrizations.py/prune.py/weight_norm.py (#113021)
Fixes #112631. As the previous PR #112943 has some accidental merge and it resolved through this PR.

- torch/nn/utils/parametrizations.py
**Before - 6**
```
torch\nn\utils\parametrizations.py:1 at module level:
        D100: Missing docstring in public module
torch\nn\utils\parametrizations.py:23 in private function `_make_orthogonal`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\parametrizations.py:23 in private function `_make_orthogonal`:
        D210: No whitespaces allowed surrounding docstring text
torch\nn\utils\parametrizations.py:178 in public function `orthogonal`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
torch\nn\utils\parametrizations.py:309 in public function `weight_norm`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
torch\nn\utils\parametrizations.py:483 in public function `spectral_norm`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
6
```
**After - 1**
```
torch\nn\utils\parametrizations.py:1 at module level:
        D100: Missing docstring in public module
1
```
- torch/nn/utils/prune.py
**Before - 100**
```
torch\nn\utils\prune.py:1 at module level:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch\nn\utils\prune.py:1 at module level:
        D400: First line should end with a period (not 's')
torch\nn\utils\prune.py:13 in public class `BasePruningMethod`:
        D204: 1 blank line required after class docstring (found 0)
torch\nn\utils\prune.py:21 in public method `__call__`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:21 in public method `__call__`:
        D400: First line should end with a period (not ')')
torch\nn\utils\prune.py:34 in public method `compute_mask`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:34 in public method `compute_mask`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch\nn\utils\prune.py:53 in public method `apply_mask`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:53 in public method `apply_mask`:
        D400: First line should end with a period (not 'g')
torch\nn\utils\prune.py:74 in public method `apply`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:74 in public method `apply`:
        D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:74 in public method `apply`:
        D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:200 in public method `prune`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:200 in public method `prune`:
        D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:200 in public method `prune`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch\nn\utils\prune.py:229 in public method `remove`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:229 in public method `remove`:
        D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:229 in public method `remove`:
        D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes')
torch\nn\utils\prune.py:256 in public class `PruningContainer`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:264 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:277 in public method `add_pruning_method`:
        D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:297 in public method `__len__`:
        D105: Missing docstring in magic method
torch\nn\utils\prune.py:300 in public method `__iter__`:
        D105: Missing docstring in magic method
torch\nn\utils\prune.py:303 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch\nn\utils\prune.py:307 in public method `compute_mask`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:307 in public method `compute_mask`:
        D400: First line should end with a period (not 's')
torch\nn\utils\prune.py:307 in public method `compute_mask`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
torch\nn\utils\prune.py:335 in private nested function `_combine_masks`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:335 in private nested function `_combine_masks`:
        D400: First line should end with a period (not ':')
torch\nn\utils\prune.py:404 in public class `Identity`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:404 in public class `Identity`:
        D400: First line should end with a period (not 'e')
torch\nn\utils\prune.py:410 in public method `compute_mask`:
        D102: Missing docstring in public method
torch\nn\utils\prune.py:416 in public method `apply`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:416 in public method `apply`:
        D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:416 in public method `apply`:
        D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:442 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:447 in public method `compute_mask`:
        D102: Missing docstring in public method
torch\nn\utils\prune.py:469 in public method `apply`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:469 in public method `apply`:
        D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:469 in public method `apply`:
        D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:486 in public class `L1Unstructured`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:486 in public class `L1Unstructured`:
        D400: First line should end with a period (not 's')
torch\nn\utils\prune.py:498 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:503 in public method `compute_mask`:
        D102: Missing docstring in public method
torch\nn\utils\prune.py:527 in public method `apply`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:527 in public method `apply`:
        D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:527 in public method `apply`:
        D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:564 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:571 in public method `compute_mask`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:571 in public method `compute_mask`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch\nn\utils\prune.py:634 in public method `apply`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:634 in public method `apply`:
        D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:634 in public method `apply`:
        D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:653 in public class `LnStructured`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:653 in public class `LnStructured`:
        D400: First line should end with a period (not 'r')
torch\nn\utils\prune.py:669 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:677 in public method `compute_mask`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:677 in public method `compute_mask`:
        D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch\nn\utils\prune.py:747 in public method `apply`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:747 in public method `apply`:
        D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:747 in public method `apply`:
        D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:779 in public class `CustomFromMask`:
        D101: Missing docstring in public class
torch\nn\utils\prune.py:783 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:786 in public method `compute_mask`:
        D102: Missing docstring in public method
torch\nn\utils\prune.py:793 in public method `apply`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:793 in public method `apply`:
        D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:793 in public method `apply`:
        D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:806 in public function `identity`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:806 in public function `identity`:
        D400: First line should end with a period (not 'e')
torch\nn\utils\prune.py:806 in public function `identity`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
torch\nn\utils\prune.py:839 in public function `random_unstructured`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:839 in public function `random_unstructured`:
        D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:874 in public function `l1_unstructured`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:874 in public function `l1_unstructured`:
        D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:916 in public function `random_structured`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:916 in public function `random_structured`:
        D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:955 in public function `ln_structured`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:955 in public function `ln_structured`:
        D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:1000 in public function `global_unstructured`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1000 in public function `global_unstructured`:
        D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:1120 in public function `custom_from_mask`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1120 in public function `custom_from_mask`:
        D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:1154 in public function `remove`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1154 in public function `remove`:
        D400: First line should end with a period (not 'e')
torch\nn\utils\prune.py:1154 in public function `remove`:
        D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes')
torch\nn\utils\prune.py:1184 in public function `is_pruned`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1184 in public function `is_pruned`:
        D400: First line should end with a period (not 'r')
torch\nn\utils\prune.py:1211 in private function `_validate_pruning_amount_init`:
        D401: First line should be in imperative mood (perhaps 'Validate', not 'Validation')
torch\nn\utils\prune.py:1243 in private function `_validate_pruning_amount`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1243 in private function `_validate_pruning_amount`:
        D400: First line should end with a period (not 'e')
torch\nn\utils\prune.py:1243 in private function `_validate_pruning_amount`:
        D401: First line should be in imperative mood (perhaps 'Validate', not 'Validation')
torch\nn\utils\prune.py:1265 in private function `_validate_structured_pruning`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1265 in private function `_validate_structured_pruning`:
        D400: First line should end with a period (not '-')
torch\nn\utils\prune.py:1265 in private function `_validate_structured_pruning`:
        D401: First line should be in imperative mood (perhaps 'Validate', not 'Validation')
torch\nn\utils\prune.py:1284 in private function `_compute_nparams_toprune`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1284 in private function `_compute_nparams_toprune`:
        D400: First line should end with a period (not 'a')
torch\nn\utils\prune.py:1308 in private function `_validate_pruning_dim`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1308 in private function `_validate_pruning_dim`:
        D400: First line should end with a period (not ':')
torch\nn\utils\prune.py:1318 in private function `_compute_norm`:
        D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1318 in private function `_compute_norm`:
        D400: First line should end with a period (not 'n')
100
```
**After - 14**
```
torch\nn\utils\prune.py:266 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:299 in public method `__len__`:
        D105: Missing docstring in magic method
torch\nn\utils\prune.py:302 in public method `__iter__`:
        D105: Missing docstring in magic method
torch\nn\utils\prune.py:305 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch\nn\utils\prune.py:411 in public method `compute_mask`:
        D102: Missing docstring in public method
torch\nn\utils\prune.py:445 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:450 in public method `compute_mask`:
        D102: Missing docstring in public method
torch\nn\utils\prune.py:502 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:507 in public method `compute_mask`:
        D102: Missing docstring in public method
torch\nn\utils\prune.py:570 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:677 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:790 in public class `CustomFromMask`:
        D101: Missing docstring in public class
torch\nn\utils\prune.py:794 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\prune.py:797 in public method `compute_mask`:
        D102: Missing docstring in public method
14
```
- torch/nn/utils/weight_norm.py
**Before - 10**
```
torch\nn\utils\weight_norm.py:1 at module level:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch\nn\utils\weight_norm.py:1 at module level:
        D400: First line should end with a period (not '8')
torch\nn\utils\weight_norm.py:12 in public class `WeightNorm`:
        D101: Missing docstring in public class
torch\nn\utils\weight_norm.py:16 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\weight_norm.py:23 in public method `compute_weight`:
        D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:29 in public method `apply`:
        D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:59 in public method `remove`:
        D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:66 in public method `__call__`:
        D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:73 in public function `weight_norm`:
        D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
torch\nn\utils\weight_norm.py:137 in public function `remove_weight_norm`:
        D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes')
10
```
**After - 6**
```
torch\nn\utils\weight_norm.py:10 in public class `WeightNorm`:
        D101: Missing docstring in public class
torch\nn\utils\weight_norm.py:14 in public method `__init__`:
        D107: Missing docstring in __init__
torch\nn\utils\weight_norm.py:21 in public method `compute_weight`:
        D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:27 in public method `apply`:
        D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:57 in public method `remove`:
        D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:64 in public method `__call__`:
        D102: Missing docstring in public method
6
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113021
Approved by: https://github.com/lezcano
2023-11-06 17:24:32 +00:00
88920b26be [Cmake] Check that gcc-9.4 or newer is used (#112858)
As this is the oldest gcc that is fully compatible with C++17 standard.
- Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`.
- As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report```

Fixes https://github.com/pytorch/pytorch/issues/101839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-11-06 17:19:53 +00:00
77d5f0379e Revert "[HigherOrderOp] remove _deprecated_global_ns (#112757)"
This reverts commit fa81237af74e21e8d5b8e2d0f600ee9056bde4b8.

Reverted https://github.com/pytorch/pytorch/pull/112757 on behalf of https://github.com/PaliC due to breaking a bunch of executorch tests ([comment](https://github.com/pytorch/pytorch/pull/112757#issuecomment-1795503740))
2023-11-06 17:04:19 +00:00
a1d1b73a7c Revert "Use OpOverload instead of OpOverloadPacket for size/stride/etc slots (#112119)"
This reverts commit 2337d8d0625f230f9a0469c5806e282fa4b964e9.

Reverted https://github.com/pytorch/pytorch/pull/112119 on behalf of https://github.com/PaliC due to still breaking trt tests :( refer to diff ([comment](https://github.com/pytorch/pytorch/pull/112119#issuecomment-1795496395))
2023-11-06 17:01:50 +00:00
679ca510b0 Revert "[Cmake] Check that gcc-9.4 or newer is used (#112858)"
This reverts commit ad894cd0728e97c649cd9b33e1f98b18fa12a1da.

Reverted https://github.com/pytorch/pytorch/pull/112858 on behalf of https://github.com/PaliC due to breaking internal tests (check diff for test page) ([comment](https://github.com/pytorch/pytorch/pull/112858#issuecomment-1795485009))
2023-11-06 16:56:09 +00:00
185515368b Add generated opcheck test for if the pt2_compliant_tag is incorrectly applied (#112759)
Summary:
If there are xfails in the failures_dict and the operator has the
pt2_compliant_tag, then we raise an error. These generated tests are separate
from those in the failures dict because we don't actually need any sample
inputs to check this.

Test Plan: - New tests

Differential Revision: D50936201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112759
Approved by: https://github.com/ezyang
2023-11-06 13:45:35 +00:00
376217cc0b [BE]: Apply FURB145 to make code more readable and idiomatic. (#112990)
Testing out some new rules that are in beta, I think I will apply this one codebase wide once it's out of preview. Replaces the hack of using `[:]` to do copies of list with the proper copy method. More efficient and more readable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112990
Approved by: https://github.com/ezyang
2023-11-06 13:15:04 +00:00
fa9045a872 [xla hash update] update the pinned xla hash (#113011)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113011
Approved by: https://github.com/pytorchbot
2023-11-06 10:57:25 +00:00
2bc1378d7b Revert "[aotinductor] Solves a problem where a tensor is returned more than once (#112177)"
This reverts commit a91baaf314999abaaf93260f87b1ee109bb36541.

Reverted https://github.com/pytorch/pytorch/pull/112177 on behalf of https://github.com/PaliC due to breaking internal tests (refer to internal diff) ([comment](https://github.com/pytorch/pytorch/pull/112177#issuecomment-1794153272))
2023-11-06 06:20:32 +00:00
455241bbd3 Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU (#112138)
Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112138
Approved by: https://github.com/cpuhrsch
2023-11-06 06:01:29 +00:00
bd9be877e4 [aotinductor] Move cache_dir to utils.py (#112728)
Summary: Some tests can utilize cache_dir()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112728
Approved by: https://github.com/jansel, https://github.com/chenyang78
ghstack dependencies: #112651
2023-11-06 03:42:10 +00:00
46a34e8c75 Inductor cpp wrapper: fix QMaxPool (#112379)
Based on the `Argument types` section in this [file](cb942ef2b1/aten/src/ATen/native (func)), for non-inplace `Tensor` type in schema, it should be mapped to C++ argument of type `const Tensor&`.

For `quantized_max_pool1d` and `quantized_max_pool2d`, the type of the `qx` input is `Tensor` type in the schema, thus modified the C++ type to be `const Tensor&`:
cb942ef2b1/aten/src/ATen/native/quantized/library.cpp (L222-L223)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112379
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #112373, #112378
2023-11-06 02:07:51 +00:00
3be0e1cd58 c10::DriverAPI Try opening libcuda.so.1 (#112996)
As `libcuda.so` is only installed on dev environment (i.e. when CUDAToolkit is installed), while `libcuda.so.1` is part of NVIDIA driver.
Also, this will keep it aligned with a5cb8f75a7/aten/src/ATen/cuda/detail/LazyNVRTC.cpp (L16)

Also, change `TORCH_INTERNAL_ASSERT` to `TORCH_CHECK` as one can legitimate fail to open one, if driver could not be found.

Fixes https://github.com/pytorch/pytorch/issues/112957
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112996
Approved by: https://github.com/kit1980, https://github.com/Skylion007
ghstack dependencies: #112994, #112995
2023-11-05 23:20:22 +00:00
d0a80f8af1 Better errors in c10::DriverAPI on dl failure (#112995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112995
Approved by: https://github.com/Skylion007
ghstack dependencies: #112994
2023-11-05 23:20:22 +00:00
57191172f8 [BE] Use static local variable instead of call_once (#112994)
See https://en.cppreference.com/w/cpp/language/storage_duration#Static_local_variables

And also, it's weird to mix two paradigms together, as static local variable is used to initialize `DriverAPI::get()` singleton mere 3 lines below
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112994
Approved by: https://github.com/Skylion007
2023-11-05 23:20:11 +00:00
9c1fb2cbb3 [BE]: Enable ruff PIE794 and fix bugs it found in test suite (#112989)
Enables some tests that were incorrectly not being run and enables PIE794 globally. This rule checks if a classvar is defined twice as flags it as it is likely a bug. In fact, we found several cases where it was a bug. It does have a couple of false positives which I flagged upstream and replaced with noqas: https://github.com/astral-sh/ruff/issues/8497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112989
Approved by: https://github.com/malfet
2023-11-05 22:11:53 +00:00
07123bc198 [ROCm] Build Triton in Centos for ROCm (#112050)
Triton build for centos-based ROCm Dockerfile was missing. This brings centos Dockerfile up-to-date with ubuntu Dockerfile. No CI job covers this change; this change is independently verified by ROCm QA team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112050
Approved by: https://github.com/jataylo, https://github.com/malfet
2023-11-05 20:43:56 +00:00
a5cb8f75a7 [dynamo] Replace checkpointing with speculate/restart in graph_break_if_unsupported (#112921)
See comment in #112902 for context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112921
Approved by: https://github.com/voznesenskym
ghstack dependencies: #112902
2023-11-05 17:09:29 +00:00
7818a2887a [dynamo] Replace InstructionTranslator.checkpoint with speculate/restart (#112902)
In my work on making guards installed eagerly (look up the stack), I found that our checkpoint/restore mechanism is very broken.  There is lots of state (especially in shape_env) which we don't checkpoint and restore properly.  We also have lots of mutable state on variable trackers already which is not checkpointed/restored.  (See other PRs in this stack for some spot fixes.)

Since we wanted to get rid of this anyway for making VariableTracker mutable, I figured I would just switch to restarting analysis.

For other usages of copy_graphstate/restore_graphstate:
1) Many usages were pointless and not needed, these are removed in PRs below this.
2) Some other usage (similar to this one) is removed in PRs above this.
3) The tricky one I am not handling is higher_order_ops, which uses checkpoint/restore a lot.    There might be some cases there where this speculate/restart trick won't work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112902
Approved by: https://github.com/voznesenskym
2023-11-05 17:09:29 +00:00
7a18376187 Add Half support for poisson and use float for Half cumulative distribution on CPU (#112124)
Add Half support for poisson and use float for Half cumulative distribution on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112124
Approved by: https://github.com/cpuhrsch
2023-11-05 16:10:27 +00:00
674c104d12 Fix RecursionError in Inductor for large for loops (#112320)
Fixes https://github.com/pytorch/pytorch/issues/111686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112320
Approved by: https://github.com/peterbell10
2023-11-05 13:12:54 +00:00
e64d250210 Add a tool for a semi-automatic optimization of bsr_dense_mm meta parameters. (#112737)
Finding optimal meta parameters for bsr_dense_mm and bsr_scatter_mm triton kernels is a tedious job. This PR introduces a tool (a Python script `torch/sparse/_triton_ops_meta.py`) that finds the optimal set of meta parameters for a given set of matrix multiplication inputs and their block sizes. Currently, such a set is found for square bsr tensor inputs with sizes 256...16384 and square blocksizes 16...128, and dense tensor inputs with sizes 256...131072.
As a result, bsr_dense_mm performance has increased as follows (`NVIDIA A100-SXM4-80GB`):
- for blocksize 16x16, the average/maximum speed up is about 40/60 %.
- for blocksize 32x32, the average/maximum speed up is about 28/45 %.
- for blocksize 64x64, the average/maximum speed up is about 26/43 %.
- for blocksize 128x128, the average/maximum speed up is about 12/28 %.

To enable the performance improvements through meta parameter optimization for other CUDA devices, one must execute the `_triton_ops_meta.py` which will calculate the optimal meta parameters and store the results in a dictionary object defined in `_triton_ops_meta.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112737
Approved by: https://github.com/cpuhrsch
2023-11-05 12:52:09 +00:00
26b5e27ace Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU (#112132)
Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112132
Approved by: https://github.com/cpuhrsch
2023-11-05 12:31:38 +00:00
64f326097b [dynamo] Refactor handling of state in context managers (#112939)
The prior handling was rather buggy...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112939
Approved by: https://github.com/voznesenskym, https://github.com/yanboliang
ghstack dependencies: #112897, #112898, #112920, #112899
2023-11-05 03:10:30 +00:00
ea4b63db62 Back out "[aotinductor] Add example_value metadata to nodes (#112415)" (#112946)
Summary:
Original commit changeset: 967c6272c8e2

Original Phabricator Diff: D50802786

D50802786 is introding perf regression for AOTInductor internal models.

Differential Revision: D51002032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112946
Approved by: https://github.com/houseroad
2023-11-05 01:27:42 +00:00
3a41fff5c0 [dynamo] Remove empty_checkpoint (#112899)
Refactor to make it easier to remove `self.checkpoint`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112899
Approved by: https://github.com/voznesenskym, https://github.com/yanboliang
ghstack dependencies: #112897, #112898, #112920
2023-11-05 00:44:21 +00:00
d78b5e5403 [dynamo] Remove checkpoint in GenericContextManager (#112920)
Checkpointing here is pointless since we just call `unimplemented()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112920
Approved by: https://github.com/voznesenskym, https://github.com/yanboliang
ghstack dependencies: #112897, #112898
2023-11-05 00:44:21 +00:00
2ba2525d12 [dynamo] Remove checkpoint in conditional (#112898)
Checkpointing here is pointless since we just call `unimplemented()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112898
Approved by: https://github.com/voznesenskym, https://github.com/yanboliang
ghstack dependencies: #112897
2023-11-05 00:44:02 +00:00
a6b42b5ada [dynamo] Remove checkpoint in inline_user_function_return (#112897)
This usage is pointless since if we are throwing an exception the state doesn't matter.

Extra graphs are from fixing a AttributeError("tensor_variable") which previosly caused the remainer of the frame to fallback.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112897
Approved by: https://github.com/voznesenskym, https://github.com/yanboliang
2023-11-05 00:43:52 +00:00
847c7c6da6 Update ruff to v0.1.4 (#112966)
Updates ruff which fixes some bugs and updates an API to be used more consistently `rule now takes --output-format with the old argname deprecated`. A lot of rule bugfixes and autofixes have been added to the pydocstyle rules which will be useful for the docathon.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112966
Approved by: https://github.com/kit1980, https://github.com/justinchuby
2023-11-05 00:00:11 +00:00
f908b0e9a3 [dynamo] Enable typechecking for hooks.py (#112565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112565
Approved by: https://github.com/Skylion007
ghstack dependencies: #112561, #112562, #112563, #112564
2023-11-04 19:37:06 +00:00
fe41a9ce08 [dynamo] Enable typechecking for resume_execution.py (#112564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112564
Approved by: https://github.com/williamwen42, https://github.com/eellison
ghstack dependencies: #112561, #112562, #112563
2023-11-04 19:37:06 +00:00
3b34c818ac [dynamo] Enable typechecking for test_minifier_common.py (#112563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112563
Approved by: https://github.com/Skylion007, https://github.com/eellison
ghstack dependencies: #112561, #112562
2023-11-04 19:36:56 +00:00
ca4fe028c8 [dynamo] Enable typechecking for replay_record.py (#112562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112562
Approved by: https://github.com/Skylion007
ghstack dependencies: #112561
2023-11-04 19:36:38 +00:00
b8ac5bbcbd [dynamo] Enable typechecking for bytecode_transformation.py (#112561)
As part of this diff, I have upgraded the `python_version` config setting to 3.11. `bytecode_transformation.py` (and a few other files) have functions using APIs only available in Python 3.11+. Those APIs are gated by a sys.version_info check in their typeshed .pyi files. So setting the min version to 3.11 allows those functions to typecheck properly.

An alternative is to make the relevant types Any:

```
if sys.version_info >= (3, 11):
    _Positions = dis.Positions
else:
    _Positions = Any
```

However, with python_version = 3.8, that means we're not getting any useful typechecking signal when encountering values of type _Position.

Changing the python_version to 3.11 does mean that we will stop typechecking codepaths that run only on lower versions, but that seems a small price to pay. It does also mean that we won't catch code that uses newer APIs without the appropriate version check, but again, not sure this has much of an impact.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112561
Approved by: https://github.com/ezyang
2023-11-04 19:36:27 +00:00
854882bbf4 Add test for init_process_group timeout (#112803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112803
Approved by: https://github.com/H-Huang
ghstack dependencies: #112611
2023-11-04 19:10:41 +00:00
247b5bdbb5 [dynamo (easy)] Add skip reason to debug logs (#112869)
Fixes https://github.com/pytorch/pytorch/issues/112867

Example logs
```
[2023-11-03 12:51:02,230] torch._dynamo.eval_frame: [DEBUG] skipping: helper (reason: in skipfiles, file: /usr/lib/python3.10/contextlib.py)
[2023-11-03 12:51:02,230] torch._dynamo.eval_frame: [DEBUG] skipping: __init__ (reason: in skipfiles, file: /usr/lib/python3.10/contextlib.py)
[2023-11-03 12:51:02,230] torch._dynamo.eval_frame: [DEBUG] skipping: __enter__ (reason: in skipfiles, file: /usr/lib/python3.10/contextlib.py)
[2023-11-03 12:51:02,230] torch._dynamo.eval_frame: [DEBUG] skipping: backend_cache_wrapper (reason: in skipfiles, file: /home/jonch/Desktop/Programming/mlsys/pytorch/torch/_dynamo/eval_frame.py)
[2023-11-03 12:51:02,230] torch._dynamo.eval_frame: [DEBUG] skipping: _maybe_init_guarded_backend_cache (reason: in skipfiles, file: /home/jonch/Desktop/Programming/mlsys/pytorch/torch/_dynamo/eval_frame.py)
[2023-11-03 12:51:02,230] torch._dynamo.eval_frame: [DEBUG] skipping: innermost_fn (reason: in skipfiles, file: /home/jonch/Desktop/Programming/mlsys/pytorch/torch/_dynamo/eval_frame.py)
[2023-11-03 12:51:02,230] torch._dynamo.eval_frame: [DEBUG] skipping: _set_current_backend (reason: in skipfiles, file: /home/jonch/Desktop/Programming/mlsys/pytorch/torch/_dynamo/eval_frame.py)
[2023-11-03 12:51:02,230] torch._dynamo.eval_frame: [DEBUG] skipping: __init__ (reason: in skipfiles, file: /usr/lib/python3.10/contextlib.py)
[2023-11-03 12:51:02,230] torch._dynamo.eval_frame: [DEBUG] skipping: __enter__ (reason: in skipfiles, file: /usr/lib/python3.10/contextlib.py)
[2023-11-03 12:51:02,230] torch._dynamo.eval_frame: [DEBUG] skipping: enable_dynamic (reason: in skipfiles, file: /home/jonch/Desktop/Programming/mlsys/pytorch/torch/_dynamo/eval_frame.py)
[2023-11-03 12:51:02,247] [0/0] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing fn /home/jonch/Desktop/sdpa.py:1635
[2023-11-03 12:51:02,248] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /home/jonch/Desktop/sdpa.py:1635 in fn (fn)
[2023-11-03 12:51:02,248] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]     def fn(x):
[2023-11-03 12:51:02,313] [0/0] torch._dynamo.output_graph: [DEBUG] create_graph_input L_x_ L['x']
[2023-11-03 12:51:02,314] [0/0] torch._dynamo.variables.builder: [DEBUG] wrap_to_fake L['x'] (3,) [<DimDynamic.STATIC: 2>] [None]
[2023-11-03 12:51:02,314] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /home/jonch/Desktop/sdpa.py:1636 in fn (fn)
[2023-11-03 12:51:02,314] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]         x = x + 1
[2023-11-03 12:51:02,314] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST x []

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112869
Approved by: https://github.com/jansel
2023-11-04 18:08:42 +00:00
d5fff7338e BUG: gracefully fall back to numpy.random if asked in dynamo.config (#109205)
Graph break if `config.use_numpy_random_stream=True` instead of a hard failure in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109205
Approved by: https://github.com/lezcano
2023-11-04 14:54:05 +00:00
9af3f98faf [DTensor] Fix DTensor.from_local() returns DTensor with wrong size for uneven sharded tensor (#110781)
Fixes #110762

This PR:
fixes issue described in #110762 by adding kwarg for shape and stride when creating DTensor using `DTensor.from_local()`. When `shape` and `stride` are provided, we skip calcualtion for `tensor_shape` and `tensor_stride` using `compute_global_tensor_info()`, as `compute_global_tensor_info()` always assume even sharding.

Test plan:
```
python3 test/distributed/_tensor/test_dtensor.py -k test_from_local_uneven_sharding
python3 test/distributed/_tensor/test_dtensor.py -k test_from_local_uneven_sharding_raise_error
```

cc. @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110781
Approved by: https://github.com/wanchaol
2023-11-04 11:21:10 +00:00
cyy
add78ac425 Fix a type error in AppendOnlyList (#112362)
AppendOnlyList::emplace_back allocates an array and overwrites the first slot, which is unsafe on a non-trivial type. This PR fixes it and add other checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112362
Approved by: https://github.com/aaronenyeshi
2023-11-04 07:06:42 +00:00
ad894cd072 [Cmake] Check that gcc-9.4 or newer is used (#112858)
As this is the oldest gcc that is fully compatible with C++17 standard.
- Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`.
- As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report```

Fixes https://github.com/pytorch/pytorch/issues/101839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-11-04 05:40:08 +00:00
dfb26d5999 Reland "Symintify repeat_interleave (#109133)" (#112726)
This reverts commit 08dbfecdbdf2af6f66b3226881c71d8977431197.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112726
Approved by: https://github.com/albanD
2023-11-04 05:15:55 +00:00
596dab4277 [DeviceMesh] Remove _validate_mesh from device_mesh.py (#112928)
Plan B for https://github.com/pytorch/pytorch/pull/112839

Motivation for the change:
1. We need to remove `funcol` as a dependency for device_mesh.py to resolve circular dependency issues when introducing device_mesh as an arg for DDP. In the meantime, we should not go from funcol to non-funcol as @voznesenskym suggested. Therefore, we want to remove this all_gather check completely.
2. For large scale, it would not make sense to validate the mesh at global scale anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112928
Approved by: https://github.com/wanchaol
2023-11-04 05:12:27 +00:00
fb044e2b17 [aot_autograd] Check that autocast states are never mutated by graphs passed to AOTAutograd (#112822)
Fixes https://github.com/pytorch/pytorch/issues/112659

As explained in https://github.com/pytorch/pytorch/pull/112396, Dynamo will never pass a graph to AOTAutograd that mutates autocast state.

If it is not needed, we do not want to support mutation wrappers for autocast state like for grad mode (https://github.com/pytorch/pytorch/pull/112396)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112822
Approved by: https://github.com/bdhirsh
2023-11-04 03:29:55 +00:00
0ac748cd29 Make pattern-matcher failure diagnostics lazy (again) and added an error message if format string is too long (#112923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112923
Approved by: https://github.com/eellison
ghstack dependencies: #112476
2023-11-04 02:54:17 +00:00
418c5206ec Make test_distributed_spawn.py tell you how to run it correctly (#112924)
Sample output if incorrect/missing args are specified:

```
RuntimeError: Missing expected env vars for `test_distributed_spawn.py`.  Please
ensure to specify the following:
'BACKEND' = one of ('gloo', 'nccl', 'ucc')
'WORLD_SIZE' = int >= 2
'TEMP_DIR' specifying a directory containing a barrier file named
'barrier'.

e.g.
touch /tmp/barrier && TEMP_DIR=/tmp BACKEND='nccl' WORLD_SIZE=2 python
/data/users/whc/pytorch/test/distributed/test_distributed_spawn.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112924
Approved by: https://github.com/wanchaol
2023-11-04 02:43:43 +00:00
b4ce501137 [Inductor] [Quant] Re-structure Quantization testcase pattern matcher check (#112570)
**Summary**
This Diff re-structures Quantization testcase pattern matcher check. Instead of checking all the pattern matched in the Inductor, we will only check the core pattern match count and node numbers such as: dequant promotion, QConv/Linear Unary and QConv Binary.

**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_q
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112570
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-11-04 01:11:34 +00:00
042445b7d3 Add new Macro to count ops and time lazy tracing (#112679)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112679
Approved by: https://github.com/alanwaketan
2023-11-04 00:40:29 +00:00
075cb6bab6 [pytorch-vulkan] slices to support zero-size output (#112879)
Summary: With D50030659, we are able to support zero-size tensor. Hence remove the check in slice. Also update related tests.

Test Plan:
```
[yipjustin@189650.od ~/fbsource (876ab81e3)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*slice*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/ops/Slice.cpp
1 additional file change events
Buck UI: https://www.internalfb.com/buck2/85adf6a3-7d17-4685-8d8b-a0b600df0b73
Network: Up: 44KiB  Down: 1.3MiB  (reSessionID-5afd53d4-0303-4f4d-a245-1eb810308fd3)
Jobs completed: 6. Time elapsed: 22.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 1, local: 1)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *slice*
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.slice_width_success
[       OK ] VulkanAPITest.slice_width_success (160 ms)
[ RUN      ] VulkanAPITest.slice_height_success
[       OK ] VulkanAPITest.slice_height_success (6 ms)
[ RUN      ] VulkanAPITest.slice_feature_success
[       OK ] VulkanAPITest.slice_feature_success (84 ms)
[ RUN      ] VulkanAPITest.slice_batch_success
[       OK ] VulkanAPITest.slice_batch_success (5 ms)
[ RUN      ] VulkanAPITest.slice_zero_sized
[       OK ] VulkanAPITest.slice_zero_sized (0 ms)
[ RUN      ] VulkanAPITest.slice_invalidinputs_exceptions
[       OK ] VulkanAPITest.slice_invalidinputs_exceptions (0 ms)
[----------] 6 tests from VulkanAPITest (257 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (257 ms total)
[  PASSED  ] 6 tests.
```

Differential Revision: D50961979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112879
Approved by: https://github.com/manuelcandales
2023-11-04 00:22:38 +00:00
62cbe86ac0 [torch] Skip the assertion on the return type when the annotation is a forward reference (#112870)
Summary:
The assertion is causing build failures when running Pysa, our security-focused static analyzer.
This is because we run `pyre infer` on the source code before analyzing it, which introduces annotations such as `def foo() -> 'torch._tensor.Tensor'`.
This does not work with the `out_wrapper` decorator which relies on inspecting the signature of the decorated function.
Let's skip the check on the return type if we detect that it was introduced by `pyre infer`.

Test Plan: eyes

Differential Revision: D50976601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112870
Approved by: https://github.com/ZainRizvi
2023-11-04 00:22:13 +00:00
e36dba3a94 [Cutlass 3.2.2 submodule upgrade] Adapt Inductor cutlass backend to Cutlass 3.2.2 (#112762)
The inductor cutlass backend was written against Cutlass version 3.1.x,
there are some incompatible changes in Cutlass 3.2.2 which the
Inductor cutlass backend needs to adapt to.

Test plan:

If third_party/cutlass is upgraded to Cutlass tag v3.2.2,
several tests within test/inductor/test_max_autotune.py start to
fail. With this diff applied, they pass again.

Differential Revision: [D50986555](https://our.internmc.facebook.com/intern/diff/D50986555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112762
Approved by: https://github.com/ipiszy, https://github.com/drisspg
2023-11-04 00:10:50 +00:00
8f10a2321d [pytorch-vulkan] log, log_softmax (#112828)
Summary: tsia

Test Plan:
```
[yipjustin@189650.od ~/fbsource (631468db3)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*softmax*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/ops/Softmax.cpp
File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp
1 additional file change events
Buck UI: https://www.internalfb.com/buck2/d4f62e52-aba9-448a-a181-cf8881affb14
Network: Up: 0B  Down: 0B
Jobs completed: 4. Time elapsed: 0.5s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *softmax*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.softmax
[       OK ] VulkanAPITest.softmax (467 ms)
[ RUN      ] VulkanAPITest.log_softmax
[       OK ] VulkanAPITest.log_softmax (95 ms)
[----------] 2 tests from VulkanAPITest (563 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (563 ms total)
[  PASSED  ] 2 tests.

  YOU HAVE 1 DISABLED TEST

[yipjustin@189650.od ~/fbsource (631468db3)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*log*"
Buck UI: https://www.internalfb.com/buck2/e8210eb5-fd56-45f7-bf6c-5024931e778e
Network: Up: 0B  Down: 0B
Jobs completed: 4. Time elapsed: 0.2s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *log*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.log_softmax
[       OK ] VulkanAPITest.log_softmax (572 ms)
[ RUN      ] VulkanAPITest.unary_op_log
[       OK ] VulkanAPITest.unary_op_log (0 ms)
[ RUN      ] VulkanAPITest.unary_op_log_
[       OK ] VulkanAPITest.unary_op_log_ (59 ms)
[ RUN      ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7677: Skipped
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 4 tests from VulkanAPITest (633 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (633 ms total)
[  PASSED  ] 3 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 1 DISABLED TEST
```

Differential Revision: D50961359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112828
Approved by: https://github.com/manuelcandales
2023-11-04 00:08:44 +00:00
df149581bc Tabulate outputs in inference benchmark (#112900)
- Fix error where script was always compiling model
- Make`runner.sh` parse outputs into nice `.md` format

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112900
Approved by: https://github.com/albanD
ghstack dependencies: #112582, #112863
2023-11-03 23:53:30 +00:00
6ba2748690 [Quant] [PT2] Enable Decomposed quant per tensor/channel to accept bfloat16 input (#112225)
**Summary**
- PR 4 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable `decomposed quant_per_tensor` and `quant_per_channel` accepts bfloat16 input.

**TestPlan**
```
python -m pytest test_quantized_tensor.py -k test_decomposed_quantize_per_tensor_bfloat16_input
python -m pytest test_quantized_tensor.py -k test_decomposed_quantize_per_channel_bfloat16_input
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112225
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-11-03 23:47:43 +00:00
67e8762e83 [Inductor] Kill has_aliasing (#112875)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112875
Approved by: https://github.com/Chillee
2023-11-03 23:22:22 +00:00
65b74c9254 Make init_process_group timeout kwarg override pg_options (#112611)
This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored.  Make it sane and warn if 2 values are provided.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112611
Approved by: https://github.com/H-Huang
2023-11-03 23:13:03 +00:00
fa81237af7 [HigherOrderOp] remove _deprecated_global_ns (#112757)
As titled.

Test Plan:
existing test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112757
Approved by: https://github.com/zou3519
2023-11-03 23:03:18 +00:00
55971c5c4e Enable concurrent reader for getRecord function (#112818)
Summary:
Use concurrent multiple readers to access record from different start index. It can provide better performance when the data being accessed is large.
bypass-github-pytorch-ci-checks

Test Plan:
```
buck2 run @//mode/dev //caffe2/caffe2/serialize:inline_container_test
```

Reviewed By: YazhiGao

Differential Revision: D50957607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112818
Approved by: https://github.com/houseroad, https://github.com/huydhn
2023-11-03 22:55:27 +00:00
57a3af900e Add suggested changes to init.py (#112864)
A follow-up of PR #112617  on issue #112596

Added suggested changes from the review.
-  More specific on the type of uniform and normal distribution used.

```py
def xavier_uniform_(tensor: Tensor, gain: float = 1.) -> Tensor:
    r"""Fill the input `Tensor` with values using a Xavier uniform distribution.

    The method is described in `Understanding the difficulty of training...
"""
```

```py
def kaiming_normal_(
    tensor: Tensor, a: float = 0, mode: str = 'fan_in', nonlinearity: str = 'leaky_relu'
):
    r"""Fill the input `Tensor` with values using a Kaiming normal distribution.

    The method is described in `Delving deep into rectifiers: Surpassing...
"""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112864
Approved by: https://github.com/kit1980
2023-11-03 22:46:48 +00:00
973f730dda [DCP] Add test for planner option for load_sharded_optimizer_state_dict (#112891)
Add test for a user submitted PR: https://github.com/pytorch/pytorch/pull/112259
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112891
Approved by: https://github.com/fegin
2023-11-03 22:37:49 +00:00
63fc48257a Configure labeler for 'module: distributed' (#112812)
To opt-in to getting notified based on this label, simply add yourself
to the summary field at the top of this issue:
https://github.com/pytorch/pytorch/issues/24422

Uses same file paths as current CODEOWNERS.

Note: can easily add sub-labels for components within distributed if we
want.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112812
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-11-03 21:51:49 +00:00
6e1494ec7c correct output dir (#112760)
I was incorrectly overwriting the cudagraphs freezing dir for cudagaphs freezing autotune.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112760
Approved by: https://github.com/desertfire
2023-11-03 21:19:44 +00:00
f58ecd4823 docs: fix docstrings for datapipes and other (#112765)
Fixes #112636

Before: 265
```
torch/utils/data/datapipes/dataframe/structures.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/dataframe/structures.py:8 in public class `DataChunkDF`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/dataframe/structures.py:8 in public class `DataChunkDF`:
        D208: Docstring is over-indented
torch/utils/data/datapipes/dataframe/structures.py:8 in public class `DataChunkDF`:
        D400: First line should end with a period (not ',')
torch/utils/data/datapipes/dataframe/structures.py:13 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/dataframe/structures.py:17 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/datapipe.py:43 in public class `IterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/datapipe.py:119 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:122 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:135 in public method `register_function`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:139 in public method `register_datapipe_as_function`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:161 in public method `__getstate__`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/datapipe.py:161 in public method `__getstate__`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/utils/data/datapipes/datapipe.py:171 in public method `__reduce_ex__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:180 in public method `set_getstate_hook`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:186 in public method `set_reduce_ex_hook`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:191 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:197 in public method `__str__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:203 in public method `__dir__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:208 in public method `reset`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/datapipe.py:208 in public method `reset`:
        D400: First line should end with a period (not ',')
torch/utils/data/datapipes/datapipe.py:217 in public class `DFIterDataPipe`:
        D101: Missing docstring in public class
torch/utils/data/datapipes/datapipe.py:223 in public class `MapDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/datapipe.py:261 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:274 in public method `register_function`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:278 in public method `register_datapipe_as_function`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:293 in public method `__getstate__`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/datapipe.py:293 in public method `__getstate__`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/utils/data/datapipes/datapipe.py:303 in public method `__reduce_ex__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:312 in public method `set_getstate_hook`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:318 in public method `set_reduce_ex_hook`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:323 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:329 in public method `__str__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:335 in public method `__dir__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:392 in public class `DataChunk`:
        D101: Missing docstring in public class
torch/utils/data/datapipes/datapipe.py:393 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/datapipe.py:397 in public method `as_str`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:401 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:404 in public method `raw_iterator`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/callable.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/callable.py:23 in public class `MapperIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/callable.py:23 in public class `MapperIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/callable.py:63 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/callable.py:121 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/callable.py:125 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/callable.py:173 in public class `CollatorIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/callable.py:213 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/combinatorics.py:18 in public class `SamplerIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combinatorics.py:29 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:44 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:47 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:56 in public class `ShufflerIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combinatorics.py:56 in public class `ShufflerIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combinatorics.py:56 in public class `ShufflerIterDataPipe`:
        D400: First line should end with a period (not 'r')
torch/utils/data/datapipes/iter/combinatorics.py:94 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:114 in public method `set_shuffle`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:118 in public method `set_seed`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:122 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:137 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:142 in public method `reset`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:150 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:165 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:179 in public method `__del__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/combining.py:26 in public class `ConcaterIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:26 in public class `ConcaterIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:26 in public class `ConcaterIterDataPipe`:
        D400: First line should end with a period (not 'l')
torch/utils/data/datapipes/iter/combining.py:44 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:51 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:55 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:64 in public class `ForkerIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:92 in public method `__new__`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:108 in private class `_ContainerTemplate`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:108 in private class `_ContainerTemplate`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:108 in private class `_ContainerTemplate`:
        D400: First line should end with a period (not 'd')
torch/utils/data/datapipes/iter/combining.py:126 in private method `get_length_by_instance`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/utils/data/datapipes/iter/combining.py:126 in private method `get_length_by_instance`:
        D400: First line should end with a period (not '`')
torch/utils/data/datapipes/iter/combining.py:136 in private class `_ForkerIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:136 in private class `_ForkerIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:136 in private class `_ForkerIterDataPipe`:
        D400: First line should end with a period (not 's')
torch/utils/data/datapipes/iter/combining.py:275 in private class `_ChildDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:275 in private class `_ChildDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:275 in private class `_ChildDataPipe`:
        D400: First line should end with a period (not 's')
torch/utils/data/datapipes/iter/combining.py:320 in private method `_set_main_datapipe_valid_iterator_id`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:343 in private method `_check_valid_iterator_id`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/utils/data/datapipes/iter/combining.py:351 in public class `DemultiplexerIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:351 in public class `DemultiplexerIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:351 in public class `DemultiplexerIterDataPipe`:
        D400: First line should end with a period (not 'n')
torch/utils/data/datapipes/iter/combining.py:384 in public method `__new__`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:399 in private class `_DemultiplexerIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:399 in private class `_DemultiplexerIterDataPipe`:
        D400: First line should end with a period (not 's')
torch/utils/data/datapipes/iter/combining.py:534 in public class `MultiplexerIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:534 in public class `MultiplexerIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:534 in public class `MultiplexerIterDataPipe`:
        D400: First line should end with a period (not ',')
torch/utils/data/datapipes/iter/combining.py:549 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:553 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:566 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:572 in public method `reset`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:575 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:585 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:593 in public method `__del__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:599 in public class `ZipperIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:599 in public class `ZipperIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:615 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:622 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:626 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/filelister.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/filelister.py:15 in public class `FileListerIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/filelister.py:36 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/filelister.py:58 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/filelister.py:62 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/fileopener.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/fileopener.py:15 in public class `FileOpenerIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/fileopener.py:15 in public class `FileOpenerIterDataPipe`:
        D400: First line should end with a period (not 'm')
torch/utils/data/datapipes/iter/fileopener.py:42 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/fileopener.py:66 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/fileopener.py:69 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/grouping.py:31 in public class `BatcherIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/grouping.py:31 in public class `BatcherIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/grouping.py:31 in public class `BatcherIterDataPipe`:
        D400: First line should end with a period (not 's')
torch/utils/data/datapipes/iter/grouping.py:55 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:68 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:79 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:91 in public class `UnBatcherIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/grouping.py:91 in public class `UnBatcherIterDataPipe`:
        D400: First line should end with a period (not 'l')
torch/utils/data/datapipes/iter/grouping.py:112 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:118 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:143 in public class `GrouperIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/grouping.py:143 in public class `GrouperIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/grouping.py:143 in public class `GrouperIterDataPipe`:
        D400: First line should end with a period (not ',')
torch/utils/data/datapipes/iter/grouping.py:185 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:233 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:257 in public method `reset`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/grouping.py:261 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:278 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:294 in public method `__del__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/routeddecoder.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/routeddecoder.py:19 in public class `RoutedDecoderIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/routeddecoder.py:19 in public class `RoutedDecoderIterDataPipe`:
        D400: First line should end with a period (not 'a')
torch/utils/data/datapipes/iter/routeddecoder.py:37 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/routeddecoder.py:53 in public method `add_handler`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/routeddecoder.py:56 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/routeddecoder.py:62 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/selecting.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/selecting.py:21 in public class `FilterIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/selecting.py:46 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/selecting.py:70 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/sharding.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/sharding.py:17 in public class `SHARDING_PRIORITIES`:
        D101: Missing docstring in public class
torch/utils/data/datapipes/iter/sharding.py:30 in public class `ShardingFilterIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/sharding.py:30 in public class `ShardingFilterIterDataPipe`:
        D400: First line should end with a period (not 's')
torch/utils/data/datapipes/iter/sharding.py:39 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/sharding.py:47 in public method `apply_sharding`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/sharding.py:74 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/sharding.py:79 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/streamreader.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/streamreader.py:10 in public class `StreamReaderIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/streamreader.py:10 in public class `StreamReaderIterDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/streamreader.py:10 in public class `StreamReaderIterDataPipe`:
        D400: First line should end with a period (not 'l')
torch/utils/data/datapipes/iter/streamreader.py:27 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/streamreader.py:31 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/utils.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/utils.py:9 in public class `IterableWrapperIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/utils.py:29 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/utils.py:33 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/utils.py:49 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/callable.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/map/callable.py:14 in public function `default_fn`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/map/callable.py:20 in public class `MapperMapDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/callable.py:20 in public class `MapperMapDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/map/callable.py:45 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/callable.py:55 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/callable.py:58 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/map/combinatorics.py:15 in public class `ShufflerIterDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/combinatorics.py:55 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combinatorics.py:68 in public method `set_shuffle`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:72 in public method `set_seed`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:76 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:85 in public method `reset`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:92 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:95 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:110 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/map/combining.py:12 in public class `ConcaterMapDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/combining.py:12 in public class `ConcaterMapDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/map/combining.py:34 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combining.py:43 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:52 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:58 in public class `ZipperMapDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/combining.py:58 in public class `ZipperMapDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/map/combining.py:76 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combining.py:85 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:94 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/grouping.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/map/grouping.py:12 in public class `BatcherMapDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/grouping.py:12 in public class `BatcherMapDataPipe`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/map/grouping.py:12 in public class `BatcherMapDataPipe`:
        D400: First line should end with a period (not 's')
torch/utils/data/datapipes/map/grouping.py:34 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/grouping.py:47 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/grouping.py:60 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/utils.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/map/utils.py:9 in public class `SequenceWrapperMapDataPipe`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/utils.py:32 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/utils.py:45 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/utils.py:48 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/utils/common.py:26 in public function `validate_input_col`:
        D400: First line should end with a period (not 'n')
torch/utils/data/datapipes/utils/common.py:26 in public function `validate_input_col`:
        D401: First line should be in imperative mood (perhaps 'Check', not 'Checks')
torch/utils/data/datapipes/utils/common.py:127 in private function `_check_unpickable_fn`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/common.py:127 in private function `_check_unpickable_fn`:
        D400: First line should end with a period (not 'g')
torch/utils/data/datapipes/utils/common.py:127 in private function `_check_unpickable_fn`:
        D401: First line should be in imperative mood (perhaps 'Check', not 'Checks')
torch/utils/data/datapipes/utils/common.py:156 in public function `match_masks`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:170 in public function `get_file_pathnames_from_root`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:207 in public function `get_file_binaries_from_pathnames`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:220 in public function `validate_pathname_binary_tuple`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:290 in public class `StreamWrapper`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/utils/common.py:290 in public class `StreamWrapper`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/common.py:290 in public class `StreamWrapper`:
        D400: First line should end with a period (not 'y')
torch/utils/data/datapipes/utils/common.py:298 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/common.py:315 in public method `close_streams`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/utils/data/datapipes/utils/common.py:331 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:335 in public method `close`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/common.py:351 in public method `autoclose`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/common.py:351 in public method `autoclose`:
        D400: First line should end with a period (not 's')
torch/utils/data/datapipes/utils/common.py:359 in public method `__dir__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:364 in public method `__del__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:368 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:371 in public method `__next__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:374 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:380 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:383 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/decoder.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/utils/decoder.py:31 in public function `basichandlers`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:87 in public function `handle_extension`:
        D202: No blank lines allowed after function docstring (found 1)
torch/utils/data/datapipes/utils/decoder.py:87 in public function `handle_extension`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/decoder.py:87 in public function `handle_extension`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/utils/data/datapipes/utils/decoder.py:115 in public class `ImageHandler`:
        D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/utils/decoder.py:115 in public class `ImageHandler`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/decoder.py:139 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:143 in public method `__call__`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:187 in public function `imagehandler`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:194 in public function `videohandler`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:215 in public function `audiohandler`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:236 in public class `MatHandler`:
        D101: Missing docstring in public class
torch/utils/data/datapipes/utils/decoder.py:237 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:247 in public method `__call__`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:253 in public function `mathandler`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:261 in public function `extension_extract_fn`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:270 in public class `Decoder`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/decoder.py:276 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:282 in public method `add_handler`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:292 in public method `decode1`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:309 in public method `decode`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:326 in public method `__call__`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/snapshot.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/utils/snapshot.py:11 in private function `_simple_graph_snapshot_restoration`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/snapshot.py:11 in private function `_simple_graph_snapshot_restoration`:
        D400: First line should end with a period (not ',')
torch/utils/data/datapipes/utils/snapshot.py:11 in private function `_simple_graph_snapshot_restoration`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/utils/tensorboard/_convert_np.py:1 at module level:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/utils/tensorboard/_convert_np.py:9 in public function `make_np`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/tensorboard/_convert_np.py:9 in public function `make_np`:
        D400: First line should end with a period (not ':')
265
```

After: 166
```
torch/utils/data/datapipes/dataframe/structures.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/dataframe/structures.py:10 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/dataframe/structures.py:14 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/datapipe.py:120 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:123 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:136 in public method `register_function`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:140 in public method `register_datapipe_as_function`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:173 in public method `__reduce_ex__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:182 in public method `set_getstate_hook`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:188 in public method `set_reduce_ex_hook`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:193 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:199 in public method `__str__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:205 in public method `__dir__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:221 in public class `DFIterDataPipe`:
        D101: Missing docstring in public class
torch/utils/data/datapipes/datapipe.py:266 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:279 in public method `register_function`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:283 in public method `register_datapipe_as_function`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:309 in public method `__reduce_ex__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:318 in public method `set_getstate_hook`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:324 in public method `set_reduce_ex_hook`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:329 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:335 in public method `__str__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:341 in public method `__dir__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:398 in public class `DataChunk`:
        D101: Missing docstring in public class
torch/utils/data/datapipes/datapipe.py:399 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/datapipe.py:403 in public method `as_str`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:407 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:410 in public method `raw_iterator`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/callable.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/callable.py:65 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/callable.py:123 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/callable.py:127 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/callable.py:216 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/combinatorics.py:30 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:45 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:48 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:97 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:117 in public method `set_shuffle`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:121 in public method `set_seed`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:125 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:140 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:145 in public method `reset`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:153 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:168 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:182 in public method `__del__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/combining.py:46 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:53 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:57 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:95 in public method `__new__`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:388 in public method `__new__`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:556 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:560 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:573 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:579 in public method `reset`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:582 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:592 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:600 in public method `__del__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:624 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:631 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:635 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/filelister.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/filelister.py:37 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/filelister.py:59 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/filelister.py:63 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/fileopener.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/fileopener.py:41 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/fileopener.py:65 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/fileopener.py:68 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/grouping.py:57 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:70 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:81 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:115 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:121 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:190 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:238 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:262 in public method `reset`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/grouping.py:266 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:283 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:299 in public method `__del__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/routeddecoder.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/routeddecoder.py:38 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/routeddecoder.py:54 in public method `add_handler`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/routeddecoder.py:57 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/routeddecoder.py:63 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/selecting.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/selecting.py:47 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/selecting.py:71 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/sharding.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/sharding.py:17 in public class `SHARDING_PRIORITIES`:
        D101: Missing docstring in public class
torch/utils/data/datapipes/iter/sharding.py:40 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/sharding.py:48 in public method `apply_sharding`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/iter/sharding.py:75 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/sharding.py:80 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/streamreader.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/streamreader.py:29 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/streamreader.py:33 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/utils.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/iter/utils.py:30 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/utils.py:34 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/utils.py:50 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/callable.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/map/callable.py:14 in public function `default_fn`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/map/callable.py:47 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/callable.py:57 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/callable.py:60 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/map/combinatorics.py:56 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combinatorics.py:69 in public method `set_shuffle`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:73 in public method `set_seed`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:77 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:86 in public method `reset`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:93 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:96 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:111 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/map/combining.py:36 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combining.py:45 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:54 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:80 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combining.py:89 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:98 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/grouping.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/map/grouping.py:36 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/grouping.py:49 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/grouping.py:62 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/utils.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/map/utils.py:33 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/map/utils.py:46 in public method `__getitem__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/map/utils.py:49 in public method `__len__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/utils/common.py:157 in public function `match_masks`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:171 in public function `get_file_pathnames_from_root`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:208 in public function `get_file_binaries_from_pathnames`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:221 in public function `validate_pathname_binary_tuple`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:300 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/common.py:331 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:335 in public method `close`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/common.py:356 in public method `__dir__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:361 in public method `__del__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:365 in public method `__iter__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:368 in public method `__next__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:371 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:377 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:380 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/decoder.py:1 at module level:
        D100: Missing docstring in public module
torch/utils/data/datapipes/utils/decoder.py:31 in public function `basichandlers`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:141 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:145 in public method `__call__`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:189 in public function `imagehandler`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:196 in public function `videohandler`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:217 in public function `audiohandler`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:238 in public class `MatHandler`:
        D101: Missing docstring in public class
torch/utils/data/datapipes/utils/decoder.py:239 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:249 in public method `__call__`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:255 in public function `mathandler`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:263 in public function `extension_extract_fn`:
        D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:279 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:285 in public method `add_handler`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:295 in public method `decode1`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:312 in public method `decode`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:329 in public method `__call__`:
        D102: Missing docstring in public method
torch/utils/data/datapipes/utils/snapshot.py:1 at module level:
        D100: Missing docstring in public module
166
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112765
Approved by: https://github.com/ejguan
2023-11-03 21:01:19 +00:00
132cb57e47 Skip aliasing correction for lift_fresh. (#112202)
Fix: #111506

This PR skips aliasing correction on `lift_fresh` calls. Reasoning is: although unlifted and lifted tensors are technically aliases, they are from different levels of abstraction (`FunctionalTensorWrapper` and `XLATensor`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112202
Approved by: https://github.com/bdhirsh
2023-11-03 20:46:30 +00:00
c799689437 Refactor inference benchmark and add runner script to do sweep (#112863)
- Added `runner.sh` that does a sweep over `batch_size=(1, 32, 64, 128, 256)` and `compile=(True, False)`
- Added GPU utilization as a metric
- Converted frontend from 2 processes (one putting requests into `request_queue` and one reading from `response_queue` and collecting metrics) to a single process with 3 threads (one putting requests into `request_queue` and one reading from `response_queue` and collecting metrics and one polling `nvidia-smi` for gpu utilization)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112863
Approved by: https://github.com/albanD
ghstack dependencies: #112582
2023-11-03 20:26:43 +00:00
cyy
dc1a3581e4 Remove c10::variant (#112725)
Maybe it's time to remove.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112725
Approved by: https://github.com/albanD
2023-11-03 18:31:58 +00:00
a91baaf314 [aotinductor] Solves a problem where a tensor is returned more than once (#112177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112177
Approved by: https://github.com/zhxchen17
2023-11-03 18:26:08 +00:00
a3db4377eb docs: Fix some docstring errors in torch.nn.utils parametrize/spectral_norm/stateless (#112786)
Fixes https://github.com/pytorch/pytorch/issues/112630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112786
Approved by: https://github.com/lezcano
2023-11-03 18:19:43 +00:00
d084a024ae [easy] skipIfTorchInductor - use condition variable (#112774)
Fixes #112465
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112774
Approved by: https://github.com/jon-chuang, https://github.com/aaronenyeshi
2023-11-03 17:55:32 +00:00
2c3ab60506 [profiler] skip flop compute for Nested tensor (#112767)
Summary:
Since nested tensor doesn't have size(), when profiler with_flops is turned on, it throws exception in saveExtraArgs().

It is tricky to support flop computation for Nested tensor because it has dynamic shape. So skip the flop compute for Nested tensor for now instead of throwing exception.

Test Plan:
Used profiler with NT, the log shows this warning instead of throwing.
```/torch/nested/_internal/nested_tensor.py:205: UserWarning: Failed to save extra arguments for flops computation of op aten::add with input[0] as nested tensor. (Triggered internally at fbcode/caffe2/torch/csrc/profiler/util.cpp:433.)```

Differential Revision: D50919789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112767
Approved by: https://github.com/aaronenyeshi
2023-11-03 17:44:00 +00:00
43fb5147e2 [BE] Enable Ruff's Flake8 PYI001 (#112823)
Enable [unprefixed-type-param (PYI001)](https://docs.astral.sh/ruff/rules/unprefixed-type-param/#unprefixed-type-param-pyi001)

Link: #110950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112823
Approved by: https://github.com/Skylion007
2023-11-03 17:25:39 +00:00
e2e5897269 [CI] Do not use packaging in run_tests.py (#112873)
It used to check that CUDA is newer than 11.6, but all of them are

Yet another mitigation towards missing `packaging` on MacOS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112873
Approved by: https://github.com/huydhn
2023-11-03 17:22:46 +00:00
feb479757f Make addc[mul|div] support different out dtypes (#112682)
By adding `.cast_common_dtype_to_outputs(true)` to `build_ternary_op`.
According to profiling, this change does not result in additional kernel invocation on GPUs, i.e. following script
```python
import torch
def bench_addcdiv(size=(32*1024**2, 5), device="cuda"):
    x=torch.rand(size, device=device, dtype=torch.float)
    y=torch.rand(size, device=device, dtype=torch.double)
    with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as prof:
      torch.addcdiv(x, x, x, out=y)
    rc=prof.key_averages()
    print(rc)

if __name__ == "__main__":
    bench_addcdiv()
```
Shows that before and after the change it took roughly the same time to finish the computation.
Before:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                       cudaLaunchKernel        92.99%      20.096ms        92.99%      20.096ms      20.096ms       0.000us         0.00%       0.000us       0.000us             1
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       1.605ms       100.00%       1.605ms       1.605ms             1
                                  cudaDeviceSynchronize         7.01%       1.515ms         7.01%       1.515ms       1.515ms       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 21.611ms
Self CUDA time total: 1.605ms
```
After:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                       cudaLaunchKernel        92.92%      19.996ms        92.92%      19.996ms      19.996ms       0.000us         0.00%       0.000us       0.000us             1
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       1.603ms       100.00%       1.603ms       1.603ms             1
                                  cudaDeviceSynchronize         7.08%       1.523ms         7.08%       1.523ms       1.523ms       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 21.519ms
Self CUDA time total: 1.603ms
```
Add regression test.

Fixes https://github.com/pytorch/pytorch/issues/112490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112682
Approved by: https://github.com/albanD
2023-11-03 17:03:06 +00:00
028e4fc6fa Add packaging to requirements-macOS.txt (#112854)
Fixes https://github.com/pytorch/pytorch/issues/102299 and https://github.com/pytorch/pytorch/issues/112832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112854
Approved by: https://github.com/DanilBaibak, https://github.com/huydhn
2023-11-03 16:55:12 +00:00
44a28a5efa [DCP][test] Make dim_0 size of params scale with world_size in torch/distributed/checkpoint/test_fsdp_optim_state.py (#112825)
Make dim_0 size of params scale with world_size so it can be used to test the impact on performance when scaling up. More context of performance improvement is added in: https://github.com/pytorch/pytorch/pull/111687

For this cherry-pick pair, we remove `_shard_tensor()` call in `load_sharded_optimizer_state_dict()` in optimizer.py, which is reported to scale poorly with number of GPUs. The reason behind is that `_shard_tensor()` calls into `dist.all_gather_object()`, which is extremely expensive in communication when world_size becomes large.

main: https://github.com/pytorch/pytorch/pull/111096
cherry-pick: https://github.com/pytorch/pytorch/pull/111687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112825
Approved by: https://github.com/fegin
2023-11-03 16:37:57 +00:00
fd6e571207 [aot_autograd / dynamo] restore grad_mode and other globals to state prior to tracing; add grad_mode mutations to runtime wrapper (#112396)
Fixes https://github.com/pytorch/pytorch/issues/112072

Grad mode mutations, which are the responsibility of aotautograd, need to be persisted outside of the graph as side-effects in the runtime wrapper.

To facilitate this, and to maintain global state hygeine, we restore the grad mode to their value prior to tracing, for both dynamo (alongside other global states)  and aot_autograd.

This is in line with the assumption that aot_autograd should work as though it were called from eager, before the given GraphModule has been run.

It is assumed that other global state (autocast mode, torch function) already maintain hygeine via their context manager APIs.

---

### Future Work?

Should we also do this for:
1. autocast mode
2. torch_function_enabled

Answer: no. (at least at present)

It is assumed that other global state (autocast mode, torch function) already maintain hygeine via their context manager APIs.

Furthermore, mutating this state directly is currently unsupported in dynamo, unlike `set_grad_enabled`

Repro:
```python
import torch
def fn(x):
    x = x + 1
    torch.set_autocast_enabled(True)
    return x + 1

print(torch.compile(fn, fullgraph=True)(torch.zeros(1)))

# torch._dynamo.exc.Unsupported: call_method UserDefinedObjectVariable(set_autocast_enabled) __call__ [ConstantVariable(bool)] {}
```

```python
import torch
def fn(x):
    x = x + 1
    torch.overrides.BaseTorchFunctionMode.__enter__()
    return x + 1, torch._C._is_torch_function_enabled()

print(torch.compile(fn, fullgraph=True)(torch.zeros(1)))

# torch._dynamo.exc.Unsupported: 'call_function TorchFunctionMode.__enter__ in skip_files /home/jonch/Desktop/Programming/mlsys/pytorch/torch/overrides.py, skipped according skipfiles.SKIP_DIRS'
```

~~I believe 1. is clearly yes - even if it is a corner case (autocast only has ctx manager public API, while dynamo will always emit ctx manager exits before compiling the graph, so one needs to use the internal _enter_autocast API to directly perform a mutation).~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112396
Approved by: https://github.com/bdhirsh
2023-11-03 16:14:09 +00:00
001573b687 [Inductor] Support one node creating multiple mutations in scheduler (#112547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112547
Approved by: https://github.com/Chillee
2023-11-03 16:01:31 +00:00
cyy
21bc37fad8 [5/N] Apply clang-tidy to aten/src/ATen/core (#112219)
Enlarge clang-tidy coverage to aten/src/ATen/core/* files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112219
Approved by: https://github.com/Skylion007
2023-11-03 15:51:21 +00:00
3e2c9410e1 Fix docstring errors in memory.py, nvtx.py (#112751)
Fixes #112590

Fixed docstring errors in `torch/cuda/memory.py` and `torch/cuda/nvtx.py`.

memory.py
Before
```
torch/cuda/memory.py:1 at module level:
        D100: Missing docstring in public module
torch/cuda/memory.py:67 in public function `caching_allocator_alloc`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
torch/cuda/memory.py:103 in public function `caching_allocator_delete`:
        D401: First line should be in imperative mood (perhaps 'Delete', not 'Deletes')
torch/cuda/memory.py:122 in public function `set_per_process_memory_fraction`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:148 in public function `empty_cache`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:148 in public function `empty_cache`:
        D400: First line should end with a period (not 'g')
torch/cuda/memory.py:163 in public function `memory_stats`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:163 in public function `memory_stats`:
        D400: First line should end with a period (not 'a')
torch/cuda/memory.py:163 in public function `memory_stats`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:264 in public function `memory_stats_as_nested_dict`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:272 in public function `reset_accumulated_memory_stats`:
        D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:292 in public function `reset_peak_memory_stats`:
        D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
        D400: First line should end with a period (not 'y')
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
        D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
        D400: First line should end with a period (not 'e')
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
        D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:365 in public function `memory_allocated`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:365 in public function `memory_allocated`:
        D400: First line should end with a period (not 'n')
torch/cuda/memory.py:365 in public function `memory_allocated`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
        D400: First line should end with a period (not 'n')
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:405 in public function `memory_reserved`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:405 in public function `memory_reserved`:
        D400: First line should end with a period (not 's')
torch/cuda/memory.py:405 in public function `memory_reserved`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
        D400: First line should end with a period (not 's')
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:443 in public function `memory_cached`:
        D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:452 in public function `max_memory_cached`:
        D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:461 in public function `memory_snapshot`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:474 in public function `memory_summary`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:474 in public function `memory_summary`:
        D400: First line should end with a period (not 'r')
torch/cuda/memory.py:474 in public function `memory_summary`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
        D202: No blank lines allowed after function docstring (found 1)
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
        D400: First line should end with a period (not 's')
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:648 in public function `mem_get_info`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:648 in public function `mem_get_info`:
        D400: First line should end with a period (not 'n')
torch/cuda/memory.py:648 in public function `mem_get_info`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:684 in private function `_record_memory_history`:
        D202: No blank lines allowed after function docstring (found 1)
torch/cuda/memory.py:684 in private function `_record_memory_history`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:684 in private function `_record_memory_history`:
        D400: First line should end with a period (not 'y')
torch/cuda/memory.py:684 in private function `_record_memory_history`:
        D401: First line should be in imperative mood (perhaps 'Enable', not 'Enables')
torch/cuda/memory.py:742 in private function `_snapshot`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:742 in private function `_snapshot`:
        D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
torch/cuda/memory.py:818 in private function `_dump_snapshot`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:818 in private function `_dump_snapshot`:
        D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
        D400: First line should end with a period (not 'y')
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:894 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/memory.py:904 in public function `change_current_allocator`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:904 in public function `change_current_allocator`:
        D401: First line should be in imperative mood (perhaps 'Change', not 'Changes')
torch/cuda/memory.py:917 in private function `_get_current_allocator`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
58
```
After
```
torch/cuda/memory.py:151 in public function `empty_cache`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:151 in public function `empty_cache`:
        D400: First line should end with a period (not 'g')
torch/cuda/memory.py:439 in public function `memory_cached`:
        D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:448 in public function `max_memory_cached`:
        D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:676 in private function `_record_memory_history`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:676 in private function `_record_memory_history`:
        D400: First line should end with a period (not 'y')
torch/cuda/memory.py:841 in public function `get_allocator_backend`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:841 in public function `get_allocator_backend`:
        D400: First line should end with a period (not 'y')
8
```

nvtx.py
Before
```
torch/cuda/nvtx.py:1 at module level:
        D100: Missing docstring in public module
torch/cuda/nvtx.py:24 in public function `range_push`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:24 in public function `range_push`:
        D400: First line should end with a period (not 'd')
torch/cuda/nvtx.py:35 in public function `range_pop`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:35 in public function `range_pop`:
        D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:43 in public function `range_start`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:43 in public function `range_start`:
        D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:81 in public function `range`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:81 in public function `range`:
        D400: First line should end with a period (not 'g')
9
```
After
```
torch/cuda/nvtx.py:41 in public function `range_start`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:41 in public function `range_start`:
        D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:79 in public function `range`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:79 in public function `range`:
        D400: First line should end with a period (not 'g')
4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112751
Approved by: https://github.com/kit1980
2023-11-03 15:19:17 +00:00
29716e865c Enforce both input tensor shapes of CosineEmbeddingLoss to be equal. (#112782)
…Added a test to prevent regressions.

Fixes #112732.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112782
Approved by: https://github.com/lezcano
2023-11-03 15:15:06 +00:00
2337d8d062 Use OpOverload instead of OpOverloadPacket for size/stride/etc slots (#112119)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112119
Approved by: https://github.com/yanboliang
2023-11-03 13:54:41 +00:00
7f143d7ef5 [aotinductor] Allow specifying a .so name in the aot_inductor.output_path config (#112651)
Differential Revision: [D50902585](https://our.internmc.facebook.com/intern/diff/D50902585)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112651
Approved by: https://github.com/chenyang78
2023-11-03 12:56:18 +00:00
871e27a61c [Quant] [PT2] Remove the output Annotation of Conv/Linear in x86InductorQuantizer (#112140)
**Summary**
- PR 3 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Remove the output annotation of QConv/QLinear in X86InductorQuantizer.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d
python -m pytest test_mkldnn_pattern_matcher.py -k test_qlinear
python -m pytest test_x86inductor_quantizer.py -k Conv2d
python -m pytest test_x86inductor_quantizer.py -k Linear
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112140
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #112010, #112126
2023-11-03 08:24:55 +00:00
a53d29cc18 Enable oneDNN QLinear FP32/BF16 output (#112126)
**Summary**
- PR 2 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable QLinear (relu) with BFloat16 or Float32 output.

**TestPlan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qlinear_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112126
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
ghstack dependencies: #112010
2023-11-03 08:20:54 +00:00
b6fc7af8a0 Enable oneDNN QConv FP32/BF16 output (#112010)
**Summary**

- PR 1 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable QConv (relu, add, add_relu) with BFloat16 or Float32 output.

**Test Plan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qconv1d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv3d_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_relu_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_float_output_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112010
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
2023-11-03 08:16:45 +00:00
9089242048 Fix typo under test directory (#112346)
This PR fixes typo in comments and messages under `test` directory. This PR also fixes related typo in messages under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112346
Approved by: https://github.com/kit1980, https://github.com/ezyang
2023-11-03 07:53:33 +00:00
94ebf52ea3 [cuda] introduce trace tracker callback in cache allocator (#112238)
Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.

- It allows external of cache allocator to "attach" trace tracker callbacks.
- When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
- **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
- **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock.

See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.

Differential Revision: D50726971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112238
Approved by: https://github.com/zdevito
2023-11-03 07:38:09 +00:00
53fff56ab8 Graph break cleanly for test_nestedtensor (#112662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112662
Approved by: https://github.com/jbschlosser
2023-11-03 07:20:43 +00:00
88b98191b7 [FSDP][state_dict] Add world_size 1 unittest (#112669)
As title

Differential Revision: [D50754433](https://our.internmc.facebook.com/intern/diff/D50754433/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112669
Approved by: https://github.com/wz337, https://github.com/fduwjj
2023-11-03 07:02:43 +00:00
458e7d09fd Add meta func for scaled mm (#112609)
# Summary
Adds a meta implementation for _scaled_mm which is required for dynamic shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112609
Approved by: https://github.com/eellison, https://github.com/malfet
2023-11-03 03:44:22 +00:00
3be99012d4 Switch some more SymInt tests to TORCH_CHECK_ALWAYS_SHOW_CPP_STACKTRACE (#112626)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112626
Approved by: https://github.com/bdhirsh
2023-11-03 03:15:26 +00:00
62c88ba0fc E2E test for FSDP, HSDP, FSDP+TP in Distributed Checkpointing (#112541)
Adds E2E tests for saving/loading distributed checkpoints. Supported so far are:

- FSDP
- HSDP
- FSDP + TP

Each method is also tested using `torch.compile`

To run all tests:
`python test/distributed/checkpoint/test/distributed/checkpoint/e2e/test_e2e_save_and_load.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112541
Approved by: https://github.com/fegin, https://github.com/wz337
2023-11-03 03:04:31 +00:00
4a17693d19 [CODEMOD][caffe2] replace uses of np.float with np.float64 (#112675)
Differential Revision: D50752096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112675
Approved by: https://github.com/Skylion007
2023-11-03 03:00:51 +00:00
8665a51baf Initialize logging facility when running ProcessGroupNCCLTest (#112809)
If code is compiled without `glog`, there are no way to control log levels other than explicitly calling `c10::initLogging()`

Test plan: Run `TORCH_CPP_LOG_LEVEL=0 ./bin/ProcessGroupNCCLTest` and observe extra log messages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112809
Approved by: https://github.com/fduwjj
2023-11-03 02:26:13 +00:00
0d95378341 [Profiler][Easy] Make timestamps in memory timelines be in microseconds (us) (#112772)
Summary: Convert the timestamps in memory timelines from ns to us.

Test Plan: CI

Differential Revision: D50937241

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112772
Approved by: https://github.com/anupambhatnagar, https://github.com/davidberard98
2023-11-03 00:41:41 +00:00
2d5fec4d59 Revert "Enable concurrent reader for getRecord function (#111426)"
This reverts commit 12a6f5aa6bf3e11668293c36b436eead2f3b8614.

Reverted https://github.com/pytorch/pytorch/pull/111426 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111426#issuecomment-1791733096))
2023-11-03 00:22:21 +00:00
32039883d1 Set default for IS_FBCODE flag (#112766)
Summary:
If IS_FBCODE is False, then we print an OSS repro if a test fails. We do
set IS_FBCODE manually on most internal tests, but we don't do it for
all of them. This PR changes it so that the IS_FBCODE gets set to the
correct default value (and then tests are able to override them if
they'd like).

Test Plan:
- Tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112766
Approved by: https://github.com/williamwen42
2023-11-03 00:01:07 +00:00
13d62e28a3 [Inductor] Add Dynamic shape support to user defined triton kernels (#112523)
1) This PR moves the grid function codegen to wrapper so that we can use
   IndentBuffers as opposed to manually adding tabs for indentation.
2) In inductor, emits the grid function in the body of the kernel call so
   that it can use free symbols from dynamic shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112523
Approved by: https://github.com/Chillee
2023-11-02 23:58:50 +00:00
f6dc09c1b1 [dynamo] Fix typo in higher_order_ops.py (#112750)
"unsupported" was undefined

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112750
Approved by: https://github.com/zou3519
2023-11-02 23:43:17 +00:00
12dab00173 Fix Docstring errors in init.py (#112617)
Fixes #112596

Fix docstring errors in init.py

### Before the change -> 38 errors
```
╭─user@pc ~/Path/to/pytorch  ‹fix/docstring_init›
╰─➤  pydocstyle torch/nn/init.py --count                                                                                                                                             127 ↵
torch/nn/init.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/init.py:68 in public function `calculate_gain`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:123 in public function `uniform_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:123 in public function `uniform_`:
        D400: First line should end with a period (not 'm')
torch/nn/init.py:123 in public function `uniform_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:141 in public function `normal_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:141 in public function `normal_`:
        D400: First line should end with a period (not 'l')
torch/nn/init.py:141 in public function `normal_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:165 in public function `trunc_normal_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:165 in public function `trunc_normal_`:
        D400: First line should end with a period (not 'd')
torch/nn/init.py:165 in public function `trunc_normal_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:187 in public function `constant_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:203 in public function `ones_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:216 in public function `zeros_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:229 in public function `eye_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:229 in public function `eye_`:
        D400: First line should end with a period (not 'y')
torch/nn/init.py:229 in public function `eye_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:249 in public function `dirac_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:249 in public function `dirac_`:
        D400: First line should end with a period (not 'c')
torch/nn/init.py:249 in public function `dirac_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:311 in public function `xavier_uniform_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:311 in public function `xavier_uniform_`:
        D400: First line should end with a period (not 'd')
torch/nn/init.py:311 in public function `xavier_uniform_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:338 in public function `xavier_normal_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:338 in public function `xavier_normal_`:
        D400: First line should end with a period (not 'd')
torch/nn/init.py:338 in public function `xavier_normal_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:376 in public function `kaiming_uniform_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:376 in public function `kaiming_uniform_`:
        D400: First line should end with a period (not 'd')
torch/nn/init.py:376 in public function `kaiming_uniform_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:425 in public function `kaiming_normal_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:425 in public function `kaiming_normal_`:
        D400: First line should end with a period (not 'd')
torch/nn/init.py:425 in public function `kaiming_normal_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:462 in public function `orthogonal_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:462 in public function `orthogonal_`:
        D400: First line should end with a period (not 's')
torch/nn/init.py:462 in public function `orthogonal_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:507 in public function `sparse_`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:507 in public function `sparse_`:
        D400: First line should end with a period (not 'e')
torch/nn/init.py:507 in public function `sparse_`:
        D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
38
```

### After the change -> 0 errors
```
╭─user@pc ~/Path/to/pytorch  ‹fix/docstring_init*›
╰─➤  pydocstyle torch/nn/init.py --count
0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112617
Approved by: https://github.com/mikaylagawarecki
2023-11-02 23:42:17 +00:00
2e29172942 Revert "Add meta func for scaled mm (#112609)"
This reverts commit 75174c379712433af1ff810b36e34573b3d2587e.

Reverted https://github.com/pytorch/pytorch/pull/112609 on behalf of https://github.com/huydhn due to Sorry for reverting this change, but it is failing ROCm jobs 75174c3797 ([comment](https://github.com/pytorch/pytorch/pull/112609#issuecomment-1791704037))
2023-11-02 23:37:16 +00:00
c63693ca27 Revert "[Fix] add validation logics to TCPStore queries (#107607)"
This reverts commit 50a99812172dd7d1e808fad8dc44665c1770df50.

Reverted https://github.com/pytorch/pytorch/pull/107607 on behalf of https://github.com/huydhn due to For some reason, lint job was not run on the PR and now start failing trunk, please rebase and fix lint before relanding 50a9981217 ([comment](https://github.com/pytorch/pytorch/pull/107607#issuecomment-1791702818))
2023-11-02 23:34:08 +00:00
c27a03a4e5 [ONNX] Cast scale back to fp16 after _attention_scale. (#112554)
### **Description**:
The problem is that the graph was cast to `fp32` at a certain point but never reverted to `fp16`, causing the rest of the graph to run on `fp32`. This change aims to fix that issue and improve performance.

### **Changes Made**:
- Modified the ONNX exporter code to ensure that the graph is correctly cast back to `fp16` after a necessary cast to `fp32`.

### **Why This Change is Necessary**:
This change is necessary to ensure that the exported ONNX graph remains in `fp16` where appropriate, leading to significant gains in performance and memory savings. Without this fix, the graph would run entirely in `fp32`, causing suboptimal performance.

### **Testing**:
- Performed extensive testing with various models and scenarios to validate the correctness of the changes.

### **Benchmarking Results**:

Experiments Ran on:
8 GPUS - Tesla V100 - 32GB

**Before Fix: ort + 4 hidden layers + without fix**

- **Train Runtime**: 78.7088 seconds
- **Train Samples per Second**: 10.164
- **Train Steps per Second**: 1.271
- **Train Loss**: 5.624655108451844
- **Epoch**: 0.3

**After Fix: ort + 4 hidden layers + with fix**

- **Train Runtime**: 72.5636 seconds
- **Train Samples per Second**: 11.025
- **Train Steps per Second**: 1.378
- **Train Loss**: 5.6252727746963505
- **Epoch**: 0.3

We can see 7.79% perf gain after this fix.

- I only ran it on 4 hidden layers due to GPU constraints, the perf gain is going to be much higher on the full model.
- You could see the gain on other models that uses _attention_scale as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112554
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-11-02 23:18:53 +00:00
0a92ec9452 warn once for use flash attention and memory efficient attention (#112773)
Summary: these logs can get pretty spammy if we use TORCH_WARN, it could be better to use TORCH_WARN_ONCE

Test Plan: ci

Differential Revision: D50941941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112773
Approved by: https://github.com/drisspg
2023-11-02 22:58:28 +00:00
e9d7fac89c [state_dict][10/N] Let set_state_dict returns IncompatibleKeys (#112414)
load_state_dict returns IncompatibleKeys, so set should also return the same information for the users.

Differential Revision: [D50748157](https://our.internmc.facebook.com/intern/diff/D50748157/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112414
Approved by: https://github.com/wz337
ghstack dependencies: #112167, #112203
2023-11-02 22:39:38 +00:00
3904b81420 [pytree] Add back a default serialized name (#112748)
Previously we added a change which required users to pass in a serialized name if they want to serialize a pytree so that the serialized name does not depend on the python environment. However this is currently breaking AOTInductor benchmark tests as AOTInductor will serialize the pytree into the .so for flattening/unflattening the inputs. However, the registration for those pytree types in the AOTInductor benchmarks are in the huggingface repo, so I'm not sure what's a good fix for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112748
Approved by: https://github.com/zhxchen17, https://github.com/malfet
2023-11-02 22:34:42 +00:00
50a9981217 [Fix] add validation logics to TCPStore queries (#107607)
This PR fixes #106294.

Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607
Approved by: https://github.com/cbalioglu, https://github.com/XilunWu
2023-11-02 22:12:45 +00:00
12a6f5aa6b Enable concurrent reader for getRecord function (#111426)
Summary:
Zion-4s core has poor perf when it comes to reading the large tensor (e.g. 300G), no matter for manifold downloading or reading from files. In this diff, I changed the getRecord function from single thread to multiple threads by passing multiple readers to getRecord function and access the same record at different chunks with different readers.
We control the number of additional reader with the`sigrid_model_manager_additional_reader` flag. The default value is 0. When `additional_reader=2`, we allocate `2` extra read client threads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111426
Approved by: https://github.com/jiayisuse
2023-11-02 22:07:04 +00:00
9d0c3e21d0 [state_dict][9/N] Add get and set APIs for model and optimizer state_dict (#112203)
The original get_state_dict and set_state_dict pair is too complicated because of the possible combinations of usages. This PR adds the APIs to get/set model_state_dict and optimizer_state_dict seperately.

Differential Revision: [D50713584](https://our.internmc.facebook.com/intern/diff/D50713584/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112203
Approved by: https://github.com/wz337
ghstack dependencies: #112167
2023-11-02 22:03:57 +00:00
0adb28b77d Show CUDAExtension example commands as code (#112764)
The default rendering of these code snippets renders the `TORCH_CUDA_ARCH_LIST` values with typographic quotes which prevent the examples from being directly copyable. Use code style for the two extension examples.

Fixes #112763
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112764
Approved by: https://github.com/malfet
2023-11-02 21:47:50 +00:00
07c9b053f7 Enable planner to be used for loading sharded optimizer state dict (#112259)
This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259
Approved by: https://github.com/wz337
2023-11-02 21:40:30 +00:00
b10fa8a447 Adds lucasllc to CODEOWNERS in distributed (#112055)
Adds myself to CODEOWNERS in distributed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112055
Approved by: https://github.com/H-Huang
2023-11-02 21:29:45 +00:00
db7a3cc436 fix missing nvml in c10/cuda/driver_api.cpp issue (#112121)
Since https://github.com/pytorch/pytorch/pull/99699 introduced a dependency on nvml for oom reporting in `c10/cuda/driver_api.h`, `c10/cuda/driver_api.cpp`, and `reportProcessMemoryInfo` from `c10/cuda/CUDACachingAllocator.cpp`, we've seen failures regarding cuda expandable segments and oom reporting in NVIDIA's internal CI, specifically on Jetson devices which don't have nvml support as it is incompatible with Jetson. Example failures using the latest upstream on Orin AGX node:

`python test/test_cuda.py -k test_notifies_oom` generates

```
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 1643, in _worker
    results[t] = torch.nn.functional.conv2d(results[t], weight, padding=0)
RuntimeError: CUDA driver error: out of memory
```

`python test/test_cuda_expandable_segments.py` generates

```
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/test/test_cuda_expandable_segments.py", line 12, in <module>
    exec(compile(open(filepath).read(), filepath, mode='exec'))
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 66, in <module>
    class TestCuda(TestCase):
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 1609, in TestCuda
    @unittest.skipIf(not TEST_CUDNN, 'CUDNN not available')
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 4628, in wrapped
    self._value = self._cb()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_cuda.py", line 20, in <lambda>
    TEST_CUDNN = LazyVal(lambda: TEST_CUDA and torch.backends.cudnn.is_acceptable(torch.tensor(1., device=CUDA_DEVICE)))
RuntimeError: handle_0 INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.
```

This PR intends to fix this issue by adding various dlopen checks to make sure nvml actually exists, and safely fall back to using the older libcuda based features of cuda expandable segments and oom reporting if nvml is not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112121
Approved by: https://github.com/eqy, https://github.com/ngimel, https://github.com/albanD
2023-11-02 21:28:05 +00:00
4e67c69a7d [TD] Support downgrading test relevance (#112671)
Allow heuristics to actually downgrade the relevance of a test.  Note that NONE/UNLIKELY tests will still get executed, but they will be ran at the end of the CI

The Relevance chosen affects the outcome when Heuristics offer conflicting predictions. A relevance higher up in this list means higher confidence in the declared relevance:

HIGH > NONE > PROBABLE > UNLIKELY > UNRANKED

Given that we assume ordering based on the list in init right now since the lists are appended, do a similar thing for UNLIKELY and NONE
ex HEURISTICS = [a, b, c, d]
currently all things in b.high and added after a.high
if b.none includes things in a.high, a.high trumps
if b.none includes things in a.probable, then b.none trumps since none is stronger than probable
if b.unlikely includes things from a.high/probable, a.high/probable trumps since unlikely and probable are at a higher strength
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112671
Approved by: https://github.com/clee2000
2023-11-02 21:02:40 +00:00
d9ad7ac390 Skip test_fork_wait_4 and test_fork_wait_4_async (#112743)
Fixes #109782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112743
Approved by: https://github.com/jbschlosser
2023-11-02 20:46:29 +00:00
157bda1bf0 Fix pydocstyle errors in torch/nn/module (#112674)
Fixes  #112601

```
pydocstyle torch/nn/modules/module.py  --count
```
On master:
115
After my changes on this PR:
8

The remaining 8 are due to missing docstrings in the magic methods:
```
torch/nn/modules/module.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/modules/module.py:1635 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/nn/modules/module.py:1640 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/nn/modules/module.py:1674 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/nn/modules/module.py:1689 in public method `__setattr__`:
        D105: Missing docstring in magic method
torch/nn/modules/module.py:1748 in public method `__delattr__`:
        D105: Missing docstring in magic method
torch/nn/modules/module.py:2480 in public method `__repr__`:
        D105: Missing docstring in magic method
torch/nn/modules/module.py:2505 in public method `__dir__`:
        D105: Missing docstring in magic method

```

Should I add them too? Happy to do it, I just wasn't sure if you wanted these documented. Please let me know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112674
Approved by: https://github.com/mikaylagawarecki
2023-11-02 20:40:56 +00:00
ac9476ba99 Add .boxed() to c10d::ProcessGroup and c10d::Work's pybind (#111997)
Summary:
When passed from C++ to Python, `c10d::ProcessGroup` and `c10d::Work` are automatically converted to their pybind class which can't be used for dispatcher ops. `.boxed()` exposes `c10d::ProcessGroup` and `c10d::Work` as boxed custom class object to Python.

```python
import tempfile

import torch
import torch.distributed as dist

if __name__ == "__main__":
    with tempfile.NamedTemporaryFile(delete=False) as tmpf:
        dist.init_process_group(
            backend="nccl", init_method=f"file://{tmpf.name}", rank=0, world_size=1
        )
        group = dist.group.WORLD
        print(group)
        print(group.boxed())
```

```
<torch.distributed.distributed_c10d.ProcessGroup object at 0x7fe42fb78d30>
ScriptObject <__torch__.torch.classes.c10d.ProcessGroup>
```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111997
Approved by: https://github.com/lw
2023-11-02 20:35:20 +00:00
6a3922d523 BUG: compile np.array(list_of_arrays) (#112711)
Add a shortcut for a sequence of arrays only. This remove a graph break on a common pattern of
`np.array([np.cos(theta), np.sin(theta)])` and its ilk.

This PR is a simpified alternative to https://github.com/pytorch/pytorch/pull/112521 --- it still breaks on mixing arrays and scalars or array_likes (e.g.  `np.array([[1, 2], np.array[3, 4]])`) and instead adds a simple shortcut.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112711
Approved by: https://github.com/lezcano
2023-11-02 20:18:16 +00:00
c1dc4cda5b Delete unused is_inside_mode (#112677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112677
Approved by: https://github.com/bdhirsh
2023-11-02 19:57:35 +00:00
eadb6aca9d Improve repeat_interleave error message to report repeats/input sizes. (#112729)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112729
Approved by: https://github.com/albanD
2023-11-02 19:50:15 +00:00
50767a075a [export] Clean up verifier [1/n]. (#112505)
Summary: Some adjustments to verifier so that it's easier to use it correctly. We will enable verifier later, so the current diff is no-op.

Test Plan: CI

Differential Revision: D50839295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112505
Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi
2023-11-02 19:36:06 +00:00
8198474eb7 Fix scope name when parent scope is empty for torch.onnx.export (#112654)
Previous to this PR, we only checked TorchScript nodes for scope compatibility, skipping their parent's scope reference check.
This PR fixes adds a check not only for the node being traversed, but its parents as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112654
Approved by: https://github.com/BowenBao
2023-11-02 19:31:32 +00:00
9d09d29297 [DTensor] Add rand_like, randn_like, randint_like ops to shard propagation (#112576)
Add rand_like, randn_like, randint_like ops to shard propagation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112576
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-11-02 18:45:43 +00:00
0bd2955f15 Memory leak from bsr_scatter_mm_indices_data argument cache (#112301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112301
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-11-02 18:43:10 +00:00
75174c3797 Add meta func for scaled mm (#112609)
# Summary
Adds a meta implementation for _scaled_mm which is required for dynamic shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112609
Approved by: https://github.com/eellison, https://github.com/malfet
2023-11-02 18:42:41 +00:00
dd957138ec Pin Docker images to main (#112692)
This will help prevent a commit like 77901321d9 pushing to release branch from overwrite the Docker images used in main.  In addition, the `DEFAULT_TAG` can be easily updated to `2.1` for example when doing branch cut release.  This basically pins the Docker images like https://github.com/pytorch/pytorch/pull/111971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112692
Approved by: https://github.com/malfet
2023-11-02 17:39:45 +00:00
543a618ae8 [inductor][fx pass] Fix a split cat bug in the pre grad (#112667)
Summary: blue reels vdd v3 has unit test failure, we fix the bug

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- --exact 'pytorch/benchmark/fb/test_gpu:run_test_gpu - test_train_blue_reels_vdd_v3_inductor_accuracy (pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu)'
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13229323914259182
Network: Up: 2.5MiB  Down: 8.3MiB  (reSessionID-b3362362-c80a-4ac2-8332-bc1321aaf0bd)
Jobs completed: 6. Time elapsed: 5:13.2s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0

```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- --exact 'pytorch/benchmark/fb/test_gpu:run_test_gpu - test_train_blue_reels_vdd_v3_inductor_speedup (pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu)'
```
Buck UI: https://www.internalfb.com/buck2/aa3031a9-3f1b-4f42-a78c-decbf2beb14f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074810906355
Network: Up: 1.3GiB  Down: 40MiB  (reSessionID-801ddf16-ff5d-4135-9758-ff286d1d59aa)
Jobs completed: 69. Time elapsed: 10:12.4s.
Cache hits: 10%. Commands: 61 (cached: 6, remote: 4, local: 51)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D50901626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112667
Approved by: https://github.com/xuzhao9, https://github.com/Skylion007
2023-11-02 17:33:15 +00:00
7cbf9869d5 Add v0 inference benchmark script (#112582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112582
Approved by: https://github.com/albanD
2023-11-02 17:21:15 +00:00
b1f50ead4f [state_dict][8/N] Ignore meta parameters (#112167)
This PR let `get_state_dict` ignore the parameters that are on the meta device.

This PR also demonstrates a possible use case of ignoring meta parameters -- checkpointing pipeline parallelism.

Differential Revision: [D50672521](https://our.internmc.facebook.com/intern/diff/D50672521/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112167
Approved by: https://github.com/wz337
2023-11-02 17:10:03 +00:00
6929ebf2b0 [quant][docs] Add x86 inductor quant docs (#112648)
Summary:
att

Test Plan:
.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112648
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/andrewor14
2023-11-02 17:02:09 +00:00
954cba2ede [optim/dynamo] shortcut adagrad with has_complex (#112722)
Follow up to https://github.com/pytorch/pytorch/pull/110706, it was missed as depended on another fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112722
Approved by: https://github.com/albanD
2023-11-02 16:50:45 +00:00
ca33dd780e Revert "[pytree] Add back a default serialized name (#112748)"
This reverts commit ca72d23613f7976b3ad70e54234b125c1b763dde.

Reverted https://github.com/pytorch/pytorch/pull/112748 on behalf of https://github.com/angelayi due to sorry, was trying to fix CI and broke CI ([comment](https://github.com/pytorch/pytorch/pull/112748#issuecomment-1791098635))
2023-11-02 16:47:59 +00:00
82e428723a Followup patch for cpuinfo fix in ppc64le (#112707)
Previously a crash in PyTorch on power systems was fixed with #110708.
 Even with the fix, the torch_test.py test throws the following error
for one of the tests.
 "Error in cpuinfo: processor architecture is not supported in cpuinfo"
This is a follow up patch to fix this error.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112707
Approved by: https://github.com/albanD
2023-11-02 16:34:41 +00:00
174aef71af Clarify maximize option in optimizer.py (#112724)
While reading the documentation of the optimizers I noticed the description of the `maximize` option is misleading. It currently reads as if the parameters would we maximized, which is factually incorrect. This PR proposes a more clear description.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112724
Approved by: https://github.com/albanD
2023-11-02 16:34:37 +00:00
25e17f3522 Revert "Use OpOverload instead of OpOverloadPacket for size/stride/etc slots (#112119)"
This reverts commit dd24e92949ad13960dc91fac93c3be5a43579201.

Reverted https://github.com/pytorch/pytorch/pull/112119 on behalf of https://github.com/ZainRizvi due to Breaking internal tests. See D50912326 ([comment](https://github.com/pytorch/pytorch/pull/112119#issuecomment-1791072363))
2023-11-02 16:32:25 +00:00
1245a7e75b Revert "Remove default timeout from PGNCCL::Options ctor (#112555)"
This reverts commit 85e93632e7804bfe64316cbc491aa803a68b0701.

Reverted https://github.com/pytorch/pytorch/pull/112555 on behalf of https://github.com/wconstab due to This PR is wrong, see above explanation ([comment](https://github.com/pytorch/pytorch/pull/112555#issuecomment-1791063778))
2023-11-02 16:27:33 +00:00
75f6d52971 [DTensor] Fix DeviceMesh.__repr__ to output valid Python syntax (#112401)
Fix `DeviceMesh.__repr__` to output valid Python syntax

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112401
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-11-02 16:18:15 +00:00
ca72d23613 [pytree] Add back a default serialized name (#112748)
Previously we added a change which required users to pass in a serialized name if they want to serialize a pytree so that the serialized name does not depend on the python environment. However this is currently breaking AOTInductor benchmark tests as AOTInductor will serialize the pytree into the .so for flattening/unflattening the inputs. However, the registration for those pytree types in the AOTInductor benchmarks are in the huggingface repo, so I'm not sure what's a good fix for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112748
Approved by: https://github.com/zhxchen17, https://github.com/malfet
2023-11-02 16:18:03 +00:00
09df6b771b Add a note about performant record_stream use. (#112526)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112526
Approved by: https://github.com/albanD
2023-11-02 15:50:22 +00:00
51a38380d1 Fix torch.load(..., weights_only=True) for NT (#112516)
Found when looking into #112509
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112516
Approved by: https://github.com/soulitzer
2023-11-02 14:41:04 +00:00
85e93632e7 Remove default timeout from PGNCCL::Options ctor (#112555)
Providing this timeout to the Options ctor is overriding user-provided
values in cases where is_high_priority_stream is set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112555
Approved by: https://github.com/fduwjj, https://github.com/H-Huang
2023-11-02 14:16:48 +00:00
a1ab22b81d Reland "Trigger specialization when you call size()/stride() from C++ (#111935)" (#112605)
This reverts commit 22221c6d60613e498aa67b7f7f0f83ec97e35b8a.

Differential Revision: [D50886564](https://our.internmc.facebook.com/intern/diff/D50886564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112605
Approved by: https://github.com/voznesenskym
2023-11-02 13:27:31 +00:00
68dead4a6c [c10d] print NCCL_SUFFIX in NCCL version log at PG init (#112560)
Summary: See title

Test Plan:
- Build with NCCL-EXP that defines NCCL_SUFFIX "meta-exp"
output:
```
I1031 16:04:01.328174 611521 ProcessGroupNCCL.cpp:918] [Rank 1] ProcessGroupNCCL initialization options: NCCL version: 2.18.3-meta-exp, NCCL_ASYNC_ERROR_HANDLING: 3, NCCL_DESYNC_D    EBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: INFO, ID=140577310728192
```

- Build with default NCCL with empty NCCL_SUFFIX
output:
```
I1031 20:35:45.665733 2360419b12 ProcessGroupNCCL.cpp:918] [Rank 1] ProcessGroupNCCL initialization options: NCCL version: 2.18.3, NCCL_ASYNC_ERROR_HANDLING: 3,...
```

Differential Revision: D50863335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112560
Approved by: https://github.com/xw285cornell
2023-11-02 09:56:52 +00:00
0276d5621a Fix typo in compilation_unit.h (#112572)
Fix typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112572
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-11-02 08:26:59 +00:00
ae85ba820f [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-11-02 07:39:13 +00:00
db66f15785 docs: fix docstrings in distributed.py and others (fixes #112604) (#112657)
Fixes #112604

Fixes docstring by following `pydocstyle` outputs.

- torch/nn/parallel/distributed.py
Before: 84
```
torch/nn/parallel/distributed.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/parallel/distributed.py:92 in private function `_cast_buffers`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:103 in private function `_setup_mixed_precision_params`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:103 in private function `_setup_mixed_precision_params`:
        D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
torch/nn/parallel/distributed.py:143 in private function `_find_tensors`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:273 in private method `__init__`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:273 in private method `__init__`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/nn/parallel/distributed.py:287 in private method `main_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:287 in private method `main_hook`:
        D400: First line should end with a period (not 'd')
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
        D400: First line should end with a period (not 'l')
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
        D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs')
torch/nn/parallel/distributed.py:332 in public class `DistributedDataParallel`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:332 in public class `DistributedDataParallel`:
        D400: First line should end with a period (not 'n')
torch/nn/parallel/distributed.py:633 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/parallel/distributed.py:960 in private method `_fire_reducer_autograd_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:960 in private method `_fire_reducer_autograd_hook`:
        D401: First line should be in imperative mood (perhaps 'Fire', not 'Fires')
torch/nn/parallel/distributed.py:969 in private method `_root_copy_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:969 in private method `_root_copy_hook`:
        D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:1012 in private method `_module_wait_for_copy_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1012 in private method `_module_wait_for_copy_hook`:
        D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
        D400: First line should end with a period (not ':')
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
        D401: First line should be in imperative mood (perhaps 'Initialize', not 'Initialization')
torch/nn/parallel/distributed.py:1146 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1154 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
        D400: First line should end with a period (not 'o')
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
        D401: First line should be in imperative mood (perhaps 'Assign', not 'Assigns')
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
        D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
        D400: First line should end with a period (not 'P')
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
        D401: First line should be in imperative mood; try rephrasing (found 'A')
torch/nn/parallel/distributed.py:1340 in private method `_get_active_ddp_module`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:1340 in private method `_get_active_ddp_module`:
        D403: First word of the first line should be properly capitalized ('Torchdynamo', not 'TorchDynamo')
torch/nn/parallel/distributed.py:1517 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1527 in public method `scatter`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1530 in public method `to_kwargs`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1539 in public method `gather`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1542 in public method `train`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1617 in public method `join`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1617 in public method `join`:
        D400: First line should end with a period (not 'f')
torch/nn/parallel/distributed.py:1617 in public method `join`:
        D401: First line should be in imperative mood; try rephrasing (found 'A')
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
        D400: First line should end with a period (not 'y')
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:1752 in public method `join_device`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1756 in public method `join_process_group`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
        D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
        D401: First line should be in imperative mood (perhaps 'Allow', not 'Allows')
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
        D400: First line should end with a period (not 'a')
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
        D400: First line should end with a period (not 'P')
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
        D400: First line should end with a period (not 'a')
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:2005 in public method `will_sync_module_buffers`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:2060 in private method `_default_broadcast_coalesced`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2060 in private method `_default_broadcast_coalesced`:
        D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:2128 in private method `_get_data_parallel_params`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:2128 in private method `_get_data_parallel_params`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
        D400: First line should end with a period (not 'r')
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
        D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
        D400: First line should end with a period (not 'g')
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
        D400: First line should end with a period (not 'l')
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
        D401: First line should be in imperative mood; try rephrasing (found 'It')
torch/nn/parallel/distributed.py:2227 in private method `_remove_autograd_hooks`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:2227 in private method `_remove_autograd_hooks`:
        D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes')
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
        D400: First line should end with a period (not 'd')
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
        D401: First line should be in imperative mood (perhaps 'Check', not 'Checks')
84
```

After: 12
```
torch/nn/parallel/distributed.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/parallel/distributed.py:618 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/parallel/distributed.py:1133 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1141 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1503 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1513 in public method `scatter`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1516 in public method `to_kwargs`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1525 in public method `gather`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1528 in public method `train`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1734 in public method `join_device`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1738 in public method `join_process_group`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1986 in public method `will_sync_module_buffers`:
        D102: Missing docstring in public method
12
```

- torch/nn/utils/_named_member_accessor.py
Before: 23
```
torch/nn/utils/_named_member_accessor.py:12 in public function `set_tensor`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:29 in public function `swap_tensor`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:85 in public function `swap_submodule`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:109 in public class `NamedMemberAccessor`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:109 in public class `NamedMemberAccessor`:
        D400: First line should end with a period (not 's')
torch/nn/utils/_named_member_accessor.py:115 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/utils/_named_member_accessor.py:122 in public method `get_submodule`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:155 in public method `swap_submodule`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:164 in public method `get_tensor`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:185 in public method `set_tensor`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:194 in public method `del_tensor`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:211 in public method `swap_tensor`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:224 in public method `get_tensors`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:233 in public method `set_tensors`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:249 in public method `set_tensors_dict`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:261 in public method `del_tensors`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:276 in public method `swap_tensors`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:296 in public method `swap_tensors_dict`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:325 in public method `check_keys`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:340 in public method `named_parameters`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:349 in public method `named_buffers`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:358 in public method `named_tensors`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:368 in public method `named_modules`:
        D200: One-line docstring should fit on one line with quotes (found 3)
23
```

After: 4
```
torch/nn/utils/_named_member_accessor.py:12 in public function `set_tensor`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:29 in public function `swap_tensor`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:85 in public function `swap_submodule`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:116 in public method `__init__`:
        D107: Missing docstring in __init__
4
```

- torch/nn/utils/_per_sample_grad.py
Before: 3
```
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
        D400: First line should end with a period (not ')')
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
        D402: First line should not be the function's "signature"
3
```
After: 0
```
0
```

- torch/nn/utils/init.py
Before: 3
```
torch/nn/utils/init.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/utils/init.py:6 in public function `skip_init`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/init.py:6 in public function `skip_init`:
        D400: First line should end with a period (not 'g')
3
```
After: 1
```
torch/nn/utils/init.py:1 at module level:
        D100: Missing docstring in public module
1
```

- torch/nn/utils/memory_format.py
Before: 4
```
torch/nn/utils/memory_format.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
        D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
        D400: First line should end with a period (not '`')
4
```
After: 1
```
torch/nn/utils/memory_format.py:1 at module level:
        D100: Missing docstring in public module
1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112657
Approved by: https://github.com/fduwjj
2023-11-02 05:52:47 +00:00
b07cfd79fe [DeviceMesh] Move DeviceMesh out from torch.distributed._tensor (#112364)
Move DeviceMesh out as a standalone module. Once we make sure everything is migrated and doc is ready, we will make `torch.distributed._device_mesh` public in follow-up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112364
Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/fduwjj
2023-11-02 04:44:25 +00:00
6f681ab5d9 [torch.compile] autograd.Function with multiple return values (#112475)
Fixes #106389

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112475
Approved by: https://github.com/zou3519
2023-11-02 04:43:49 +00:00
59869903b3 Fix mem eff bias bug (#112673)
This fixes #112577
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112673
Approved by: https://github.com/cpuhrsch
2023-11-02 04:40:51 +00:00
40ab6409da [Trivial change] Remove duplicate line in freezing.py (#112538)
## Description

`aten = torch.ops.aten` was being called twice.
Removed one assignment in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112538
Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/eellison
2023-11-02 03:20:18 +00:00
493ae78201 [inductor] nan-checker (#112091)
This PR is spilt out of https://github.com/pytorch/pytorch/pull/108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code).

Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison  if there is easy way to make it work with cudagraphs.  But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose.

Test command:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112091
Approved by: https://github.com/eellison, https://github.com/jansel
2023-11-02 02:32:04 +00:00
01e4984bac Add decomposition for dynamo_export + ExportedProgram and remove None from input (#112444)
This PR introduces the ability to produce GraphModules with Core ATen IR only through decompositions. It also removes `None` from user inputs as ONNX does not supports them

Tests for these features will be executed when #112289 is merged, but for reference, they are as below:

```python
    def test_log_sigmoid(self):
        # This produces op as `torch.ops.aten.log_sigmoid_forward`, instead of the more
        # conventional `torch.ops.aten.log_sigmoid`.
        class Model(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.m = torch.nn.LogSigmoid()

            def forward(self, x):
                return self.m(x)

        input = torch.randn(2)
        self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime(
            Model(), (input,), model_type=self.model_type
        )

    def test_none_input(self):
        class NoneInputModel(torch.nn.Module):
            def forward(
                self, x: torch.Tensor, y: Optional[torch.Tensor], z: torch.Tensor
            ):
                if y is None:
                    return x + z
                return x + y + z

        self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime(
            NoneInputModel(),
            (torch.randn(1, 2), None, torch.randn(1, 2)),
            model_type=self.model_type,
        )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112444
Approved by: https://github.com/BowenBao
2023-11-02 02:30:59 +00:00
6c19de07cd [Quant] [PT2] Add ConvBNAdd(ReLU) Annotation into X86InductorQuantizer (#111281)
**Summary**
This PR adds ConvBNAdd(ReLU) QAT Annotation into `X86InductorQuantizer`.

**Test Plan**
```
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_binary_with_quantizer_api
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_binary_unary_with_quantizer_api
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_add
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_add_relu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111281
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #111280
2023-11-02 02:05:49 +00:00
56ca0043f6 [Quant] [PT2] Enable QAT Quantization flow in X86InductorQuantizer (#111280)
**Summary**
This PR enables PT2 QAT Quantization flow in `X86InductorQuantizer`.

**Test Plan**
```
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_with_quantizer_api
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary_with_quantizer_api
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_relu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111280
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-11-02 02:03:10 +00:00
8191fb3e06 [Reland2] [inductor][BE] split triton_meta and inductor_meta (#112351)
triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future.

This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton.

Tests - wait for CI

Differential Revision: [D50864493](https://our.internmc.facebook.com/intern/diff/D50864493)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112351
Approved by: https://github.com/eellison
2023-11-02 00:40:12 +00:00
ff35e1e45b [pytree] Add custom treespec fqn field (#112428)
Custom classes that are serialized with pytree are serialized by default with `f”{class.__module__}.{class.__name__}”`. This is a dependency from our serialized program directly into the outer Python environment. If a user moves the class to a different directory, the serialized program will be unable to be loaded. So, we will require users to pass in an FQN if they want to serialize their custom treespec type.

Differential Revision: [D50886366](https://our.internmc.facebook.com/intern/diff/D50886366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112428
Approved by: https://github.com/suo
2023-11-02 00:26:41 +00:00
131e0f1b75 [export] Separate out graph signature (#112412)
Differential Revision: [D50800524](https://our.internmc.facebook.com/intern/diff/D50800524)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112412
Approved by: https://github.com/zhxchen17
2023-11-02 00:18:28 +00:00
b63335c27a Make ci_expected_accuracy/update_expected.py apply csv linter (#112655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112655
Approved by: https://github.com/desertfire
2023-11-02 00:05:14 +00:00
af1a8f4cb2 Allow passing in dynamic_shapes without original argument name (#112298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112298
Approved by: https://github.com/avikchaudhuri
2023-11-02 00:03:36 +00:00
c1e2ccdb97 AssertionError -> AttributeError in cuBLASModule (#112606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112606
Approved by: https://github.com/eellison
2023-11-01 23:23:10 +00:00
258874888b Refine replacements with equality tests on runtime asserts (#112156)
Just poppin' off some TODOs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112156
Approved by: https://github.com/albanD, https://github.com/aakhundov
ghstack dependencies: #112155
2023-11-01 23:02:17 +00:00
793c62b79c Allow binary pointwise operations to cause refinement on unbacked SymInts (#112155)
To do this, there is a little detour to remove hint caching for unbacked
SymInts; now, we just always attempt to update the hint (using
maybe_evaluate_static; this is much better than the replace we were
doing before) if we don't think we know it.

With this change, we now can generally infer that i0 == 1 is false for
a size-like unbacked SymInt.  So if we write the size match /
broadcasting test very carefully (see comment), we will eventually
end up expect_true(sizeA == sizeB), which is good enough to cause
refinement.  Phew!

I think I still want to setup a replacement if you do i0 == s0, but I'm
going to do that in a follow up.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112155
Approved by: https://github.com/aakhundov, https://github.com/voznesenskym
2023-11-01 23:02:17 +00:00
4f5acf8329 Log non-pt2_compliant ops encountered by Dynamo (#112581)
Summary:
See internal diff for more changes. Whenever we encounter a non-compliant op,
we add it to a set on the OutputGraph. When a compilation event happens, we log
the contents of this set.

I'm planning on flipping the `only_allow_pt2_compliant_ops` config from False
to True after the logging determines that existing models do not use
non-compliant ops.

Test Plan: - Tested the logging internally locally

Differential Revision: D50884828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112581
Approved by: https://github.com/yanboliang
2023-11-01 22:53:16 +00:00
00d6d2f66b [aotinductor] Add example_value metadata to nodes (#112415)
split_cat fx passes expect the `example_value` metadata on every node. However, the graph module from _export_torch_ir does not contain this metadata, causing the split_cat fx passes to not run. So, I added a pass to add this metadata to every node in the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112415
Approved by: https://github.com/frank-wei
2023-11-01 22:44:50 +00:00
f8285b1195 [dynamo] Fix nested torch function mode not setting correct value on exiting (#112621)
Shold exit to the dynamo stubbed value, not the real value, as the real value is never mutated.

Fixes https://github.com/pytorch/pytorch/issues/112620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112621
Approved by: https://github.com/jansel
2023-11-01 22:07:35 +00:00
9e2af971fc [Quantization] Add "quantization_tag" as metadata to fx proxy (#108764)
Summary:
In order to make sure that quantization_tag is preserved through second
stage export, this PR adds it as a special metadata that should be
preserved.

Since quantization in export path will work on top of pre dispatch
graph, subsequent post dispatch op decomposition, will decompose ops
that quant workflow tagged. In order to make sure that the patterns
identified by quantizer, remains identifiable, even after decompositions
are applied, we must preserve "quantization_tag".

This enables backend delegates, that quantized a model for specific
backend, to be able to identify "quantized" patterns.

Test Plan:
metadata porting tests

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D49056259](https://our.internmc.facebook.com/intern/diff/D49056259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108764
Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168
2023-11-01 21:41:58 +00:00
e06288f8f1 skip test in test_eager_transforms.py while Triton lacks ARM support (#112092)
fix the failure with test_compile_vmap_hessian in test_eager_transforms.py. Skipping the test while we wait for ARM support from Triton. cc @ptrblck @eqy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112092
Approved by: https://github.com/eqy, https://github.com/huydhn
2023-11-01 21:33:18 +00:00
5b0840c71b Guarantee expr is a sympy.Expr before xreplace'ing it (#112619)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112619
Approved by: https://github.com/eellison, https://github.com/voznesenskym
2023-11-01 21:26:27 +00:00
5d7f23b1f4 [HighOrderOp] allow aliasing a variable from outer scope in higher order op (#112537)
Fixes #112169

This PR follows voz's idea of disabling rename inside higher order operator body to avoid the confusion between renaming and mutating, where we'd like to allow rename and forbid mutation. Specifically, the confusion is because rename creates a new variable tracker and calls replace_all for MutableLocal. We either have to 1. look at the fields of the variable tracker to determine whether it's just a name change or 2. pass some information into replace_all and telling it it's a rename op so don't check for side-effects. Both approach seems undesirable, or 3. make rename mutate the user_code_variable_name for the variable tracker (note that: we've been doing this for MutableSideEffects). All approaches seem undesirable.

We end up disabling rename if dynamo is speculating inside a higher order operator's body.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112537
Approved by: https://github.com/zou3519
2023-11-01 20:59:00 +00:00
9d23440c81 Nvfuser code base nuke (#111447)
removing nvfuser code base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111447
Approved by: https://github.com/albanD
2023-11-01 20:53:14 +00:00
5a6f8014c4 Add a decomposition for _weight_norm_interface. (#112193)
Fixes #112086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112193
Approved by: https://github.com/ezyang
2023-11-01 19:51:11 +00:00
1b86d5ef2f [Ci] Add arm64 libtorch CI config (#112474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112474
Approved by: https://github.com/ZainRizvi, https://github.com/seemethere
ghstack dependencies: #112451, #112452
2023-11-01 19:09:34 +00:00
7f77ec37be [Inductor] Clarify mutation related comments (#112466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112466
Approved by: https://github.com/Chillee
2023-11-01 18:39:58 +00:00
dd24e92949 Use OpOverload instead of OpOverloadPacket for size/stride/etc slots (#112119)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112119
Approved by: https://github.com/yanboliang
2023-11-01 18:26:01 +00:00
ab20bab729 [ONNX] Fix partial name matching when searching parameter tensors (#112517)
Now we remove name in `onnx_input_names` once it's matched by a parameter so that the same name won't be matched twice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112517
Approved by: https://github.com/thiagocrepaldi
2023-11-01 18:25:26 +00:00
623a311d22 fix torch.distributed.rpc example incorrect usage (#112367)
Fixes #112366
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112367
Approved by: https://github.com/H-Huang
2023-11-01 18:08:32 +00:00
54c7d0d99d [GHF] Bot should reopen PR after revert (#112614)
Fixes https://github.com/pytorch/test-infra/issues/4692
Test plan, see https://github.com/malfet/deleteme/pull/58#issuecomment-1789365259 / https://github.com/malfet/deleteme/actions/runs/6723011476
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112614
Approved by: https://github.com/seemethere, https://github.com/ezyang
ghstack dependencies: #112613
2023-11-01 18:03:32 +00:00
4a2242e479 [BE] Use GITHUB_API_URL (#112613)
To avoid hardcoding the same string constant over and over again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112613
Approved by: https://github.com/seemethere
2023-11-01 18:03:32 +00:00
fd209543d5 Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377)
Part of #109802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377
Approved by: https://github.com/albanD, https://github.com/aaronenyeshi
2023-11-01 16:10:09 +00:00
cce5016653 [Profiler] Manual Submodule Update for Kineto (#112540)
Summary:
Update the submodule of the Kineto project. Includes the following changes:

  - Fix HAS_CUPTI macro uses
  - Added error condition count tracking and prints
  - Collect more info on cudaEventRecord for stream wait sync events
  - Fix CUDA 11.7 support for new cudaLaunchKernelExC
  - Fix newlines in error info logging causing broken JSON
  - Kineto samples programs are fixed and updated
  - ROCm lib path fixed.
  - Clearing rocTracer cached data causing memory leaks
  - Fix int overflow in counter of activities
  - Populate collective metadata from CPU op to GPU kernels
  - Updated TEARDOWN_CUPTI to check value is 1

Test Plan: CI

Differential Revision: D50861994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112540
Approved by: https://github.com/davidberard98
2023-11-01 16:03:04 +00:00
84f59d893a [fx] Cache translation_validation_enabled on ShapeEnv (#112493)
`ShapeEnv` has tons of functionallity that is conditioned on this
`translation_validation_enabled()` check, to the point where 8% of
time in `empty_strided` is spent just in that function.

However, it doesn't really make sense for the value of
`translation_validation_enabled()` to change throughout the life of a `ShapeEnv`
so we might as well run the check once and store it in the `ShapeEnv`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112493
Approved by: https://github.com/lezcano
ghstack dependencies: #112418
2023-11-01 14:37:28 +00:00
9e89c36a54 [FakeTensor] Reuse flat_args throughout FakeTensorMode.dispatch (#112418)
This function repeatedly flattens and unflattens the `args, kwargs` pair so we
get a quite significant perf improvement from saving the `flat_args` and
operating directly on those. I see a 15% improvement in dispatch for
`empty_strided`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112418
Approved by: https://github.com/lezcano
2023-11-01 14:37:28 +00:00
a126bbfea3 [AOTInductor] Include AOTI debug folder in package (#112514)
Summary:
Allow user to set debug dir for Inductor

Include AOTInductor debug folder in the package.

```
zipinfo package.zip
Archive:  package.zip
Zip file size: 1325264 bytes, number of entries: 46
-rw----     0.0 fat      212 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/aotinductor_pickle_data.json
-rw----     0.0 fat     6024 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.0/fx_graph_runnable.py
-rw----     0.0 fat     9031 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.0/fx_graph_readable.py
-rw----     0.0 fat     9202 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.0/fx_graph_transformed.py
-rw----     0.0 fat    10865 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.0/ir_pre_fusion.txt
-rw----     0.0 fat    10865 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.0/ir_post_fusion.txt
-rw----     0.0 fat    13553 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.0/output_code.py
-rw----     0.0 fat     5822 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.1/fx_graph_runnable.py
-rw----     0.0 fat     8817 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.1/fx_graph_readable.py
-rw----     0.0 fat     8988 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.1/fx_graph_transformed.py
-rw----     0.0 fat    10858 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.1/ir_pre_fusion.txt
-rw----     0.0 fat    10858 bl stor 80-000-00 00:00 package/data/aotinductor/merge-a100/debug/torchinductor/model___9.1/ir_post_fusion.txt
```

Test Plan: CIs

Reviewed By: chenyang78

Differential Revision: D50815320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112514
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-11-01 08:25:11 +00:00
29f3d392bf Inductor cpp wrapper: support QLinear (#112378)
Align the type of `post_op_args` in the schema of `onednn::qlinear_pointwise` to be the same as other fusion OPs like qconv, conv, conv_transpose, linear by changing from `float[]` to `Scalar?[]`:
cb942ef2b1/aten/src/ATen/native/quantized/library.cpp (L260-L266)

cb942ef2b1/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp (L48-L59)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112378
Approved by: https://github.com/jgong5, https://github.com/desertfire
ghstack dependencies: #112373
2023-11-01 06:22:16 +00:00
337d69e40a Inductor cpp wrapper: support QConv (#112373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112373
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-11-01 06:15:49 +00:00
e061144aaf [inductor] replace ops.div with ops.truediv (#112243)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112243
Approved by: https://github.com/lezcano
ghstack dependencies: #112234
2023-11-01 05:50:51 +00:00
2ed3a73e40 [dynamo] treat torch.device, torch.dtype as constant literal; revise guards to have access to torch module (#112426)
Just like e.g. container - list/set of constant literals, these are constant literals.

We follow up to https://github.com/pytorch/pytorch/pull/112416, enforcing that we always use `ConstantVariable` to represent these.

Replace https://github.com/pytorch/pytorch/pull/112284, https://github.com/pytorch/pytorch/pull/112332 as incomplete, in case there is no movement there.

Ought to fix: https://github.com/pytorch/pytorch/issues/109910

We remove old guards special-casing, which fell back on str equality when not having access to `torch` module in `eval`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112426
Approved by: https://github.com/ezyang
2023-11-01 05:28:28 +00:00
76918367ff fix(dynamo): Optimizer._init_group did not handle return value (#110709)
blocks: https://github.com/pytorch/pytorch/pull/110706

Causes a bug for all optimizers that use _init_group return value.

`compile` + _init_group ret value is not on testing path. So we also add test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110709
Approved by: https://github.com/ezyang
2023-11-01 05:22:42 +00:00
c73da67d46 new_qtensor support privateuseone allocator. (#111464)
I want to create a quant tensor through `PerTensorAffineQuantizer`. But I found that it will throw error because of the lake of judgment for PrivateUse1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111464
Approved by: https://github.com/ezyang
2023-11-01 05:16:58 +00:00
748c1a1d81 [dynamo] Be stricter about HigherOrderOperator kwargs (#111938)
kwargs need to be handled carefully in speculate subgraph. We should be clearer about the contract of what the inputs are.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111938
Approved by: https://github.com/zou3519
2023-11-01 04:10:09 +00:00
320ac546ed Clarify difference between share_memory and from_file (#111856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111856
Approved by: https://github.com/albanD
ghstack dependencies: #111688
2023-11-01 03:25:09 +00:00
df0a3c0541 Upload ROCm artifacts from the new workflow to S3 (#112442)
This is raised as a regression after https://github.com/pytorch/pytorch/pull/111394#issuecomment-1785858263.  The jobs are now in a different workflow, so their artifacts weren't uploaded to S3 like other trunk jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112442
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2023-11-01 03:06:15 +00:00
dcd94814a3 [inductor][fx pass] Add split-stack-tahn-unbind pattern detection (#111854)
Summary: We add a new patten to further close the gap between fxt and pt2

Test Plan:
# unit test
```
buck2 test mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1407375224343119

# icvr local test
[P865759493](https://www.internalfb.com/intern/paste/P865759493/)

before vs after transformation after "merge_getitem_cat_pass":
https://www.internalfb.com/intern/diffing/?paste_number=854132317

# e2e test
The proposal is bundled D50207610, D50397173 and D50100667

### ICVR
baseline:
f489286934
baseline + optimus:
f489287369
proposal:
f492987960

### CMF
baseline:
f489195078
baseline + optimus:
f489215258
proposal:
f492970293

### IG_CTR
baseline:
f489237630
baseline + optimus:
f489238767
proposal:
f492977663

Differential Revision: D50397173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111854
Approved by: https://github.com/jackiexu1992
2023-11-01 03:04:25 +00:00
a1e222ef02 metric table (#109245)
In dynamo/inductor, sometimes it helps to gather metrics/statistics for each model in different levels like model level, graph level, kernel level or pair of fusion nodes level. This kind of thing will be very easy to do with Scuba, but we only have scuba in fbcode. This PR build metric tables to solve part of the problem.

Q: why not log to stdout/err direclty
A: sometimes we need more structured data. E.g., it would be helpful to gather all the stats in a CSV and then do post-processing (like calculating a geomean etc.). Also metric table will tag each row with the model name which is helpful.

Q: what's the difference with speedup_indcutor.csv
A: speedup_indcutor.csv is a special case that gather statistics on model level: i.e., we have one row for each model. But recording statistics on finer grain level like graph etc. is also helpful.

Example use cases:
- As a followup on the bechmark fusion PR, I want to gather all the 'slow' fusion and analyze them. With the metric table, I can easily log slow fusion for each model into a csv file. Here is the log gathered for huggingface:
 https://gist.github.com/shunting314/964e73cc98368b301414ec7b7ad4c702 .
- To help understand the effect of 'loop ordering after fusion' PR, it would be helpful to gather stats like how many fusions happens for each graph. Previously we log the metric to stderr directly. But logging these metrics in a structural way is useful.
- gather number of registers, register spills, shared memory usage for each kernel in each model with runnable kernel code logged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109245
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-11-01 02:33:42 +00:00
5296c14094 Add inverse gamma distribution and fix sign bug in PowerTransform. (#104501)
This PR comprises a few small contributions:

1. `PowerTransform` returned a sign of `+1` irrespective of exponent. However, it should return the sign of the exponent because the gradient has the same sign as the exponent. That issue has been fixed.
2. Added tests to catch errors akin to 1. in the future.
3. Added an `InverseGamma` distribution as a `TransformedDistribution` with `PowerTransform(-1)` and `Gamma` base distribution. The `InverseGamma` is often used as a prior for the length scale of Gaussian processes to aggressively suppress short length scales (see [here](https://betanalpha.github.io/assets/case_studies/gaussian_processes.html#323_Informative_Prior_Model) for a discussion).

Note: I added a `positive` constraint for the support of the inverse gamma distribution because the `PowerTransform(-1)` can fail for `nonnegative` constraints if the random variable is zero.

```python
>>> torch.distributions.InverseGamma(0.5, 1.0).log_prob(torch.zeros(1))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-758aa22deacd> in <module>
----> 1 torch.distributions.InverseGamma(0.5, 1.0).log_prob(torch.zeros(1))

~/git/pytorch/torch/distributions/transformed_distribution.py in log_prob(self, value)
    140         """
    141         if self._validate_args:
--> 142             self._validate_sample(value)
    143         event_dim = len(self.event_shape)
    144         log_prob = 0.0

~/git/pytorch/torch/distributions/distribution.py in _validate_sample(self, value)
    298         valid = support.check(value)
    299         if not valid.all():
--> 300             raise ValueError(
    301                 "Expected value argument "
    302                 f"({type(value).__name__} of shape {tuple(value.shape)}) "

ValueError: Expected value argument (Tensor of shape (1,)) to be within the support (GreaterThan(lower_bound=0.0)) of the distribution InverseGamma(), but found invalid values:
tensor([0.])
```

This differs from the scipy implementation.

```python
>>> scipy.stats.invgamma(0.5).pdf(0)
0.0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104501
Approved by: https://github.com/fritzo, https://github.com/ezyang
2023-11-01 02:26:25 +00:00
0347b36b52 SummaryWriter.add_figure: add type hints (#110021)
Discovered a bug in our code that could have been prevented by type hints, so I added them 😄
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110021
Approved by: https://github.com/ezyang
2023-11-01 02:19:09 +00:00
6dd002f24e avoid readonly arrays (#112524)
Since PyTorch does not have readonly tensors, compiling code with readonly numpy arrays warns about possible UB. Thus detect readonly arrays, flip them to be writeable and clone the resulting tensor.

BTW, this is a break away from numpy semantics: the resulting array is writeable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112524
Approved by: https://github.com/lezcano
2023-11-01 02:15:03 +00:00
3cee033b98 Reland of a bunch of pattern matcher + indexing fixes (#112476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112476
Approved by: https://github.com/oulgen
2023-11-01 02:13:44 +00:00
ef1f08c5a0 State_dict serialization for meta tensors (#112213)
Summary: Add cases for serializing meta tensors from state_dict

Test Plan: sandcastle

Differential Revision: D50718161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112213
Approved by: https://github.com/zhxchen17, https://github.com/houseroad
2023-11-01 01:07:09 +00:00
41720c2a48 [dynamo] add infinite generators itertools.{count, repeat, cycle} (#110967)
Fixes https://github.com/pytorch/pytorch/pull/110953/files#r1352868935

Depends on: https://github.com/pytorch/pytorch/pull/110953

Why not use these for `repeat(item, count)`:
> These are not preferred as they return an opaque VariableTracker. In particular, one cannot do `enumerate(repeat(1))`. `repeat(1, 10)` benefits from the integration enjoyed by `ListVariableIterator`

Follow ups:
- [ ] make listiterator an IteratorVariable, define iterator integrations on base IteratorVariable where unspecialized https://github.com/pytorch/pytorch/pull/110967#discussion_r1356656469
    - Please make a new issue for this
- [ ] explore integrating cpython itertools test suite https://github.com/pytorch/pytorch/pull/110967#discussion_r1358326402
- [ ] Use something other than `StopIteration` to handle iterator termination https://github.com/pytorch/pytorch/pull/110967#discussion_r1358336038
- [ ] Add test case for consuming iterator simultaneously from two code points https://github.com/pytorch/pytorch/pull/110967/files#r1358325511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110967
Approved by: https://github.com/ezyang
2023-11-01 00:33:17 +00:00
9bfebf754f [dynamo] fix graph break, improve hygeine - enforce using ConstantVariable for torch.device,torch.dtype (#112416)
Fixes https://github.com/pytorch/pytorch/pull/112332/files#r1375690808

Simplify code paths, fix graph break

```
torch._dynamo.exc.InternalTorchDynamoError: TorchVariable() has no type
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112416
Approved by: https://github.com/lezcano
2023-11-01 00:19:52 +00:00
74e6c877e9 Revert "[inductor] Memory planning (#112178)"
This reverts commit f64a97c6f88873363c5b3c4c33f231b5578085b2.

Reverted https://github.com/pytorch/pytorch/pull/112178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems that ROCm will need to be fixed for the new test too f64a97c6f8 ([comment](https://github.com/pytorch/pytorch/pull/112178#issuecomment-1788195311))
2023-11-01 00:03:56 +00:00
333d5821ee [ROCm] Add gcnArchName to collect_env and torch.cuda.get_device_properties (#107477)
Printing just the device name is not helpful when investigating PyTorch issues filed for specific AMD GPUs, as the support/issue might depend on the gfx arch, which is part of the gcnArchName property.

`torch.cuda.get_device_properties(0).gcnArchName` will print the value of the `gcnArchName` property: eg.
```
>>> torch.cuda.get_device_properties(0).gcnArchName
'gfx906:sramecc+:xnack-'
```

```
root@6f064e3c19fb:/data/pytorch/test# python ../torch/utils/collect_env.py
...
GPU models and configuration: AMD Radeon Graphics(gfx906:sramecc+:xnack-)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107477
Approved by: https://github.com/albanD
2023-10-31 23:05:36 +00:00
4daf8afe8e Revert "Fix bug: not creating empty tensor with correct sizes and device. (#106734)" (#112170)
This reverts commit 528a2c0aa97d152b8004254040076b8ae605bf9f.

The PR is wrong, see #110941.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112170
Approved by: https://github.com/albanD
2023-10-31 23:02:33 +00:00
0f4d2904be [dynamo] compiled_autograd support for post_acc_grad hooks (#112326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112326
Approved by: https://github.com/jansel
ghstack dependencies: #112325
2023-10-31 22:53:01 +00:00
16953482d9 Revert "Enable planner to be used for loading sharded optimizer state dict (#112259)"
This reverts commit 6188f2e899e58cc120afd571094a97047bf97681.

Reverted https://github.com/pytorch/pytorch/pull/112259 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal builds. @wz337 can you please help fix this? ([comment](https://github.com/pytorch/pytorch/pull/112259#issuecomment-1788119247))
2023-10-31 22:27:48 +00:00
c8b74fd012 Add assigntome-docathon workflow (#112525)
- Adding a workflow to enable docathon participants assign issues to themselves.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112525
Approved by: https://github.com/clee2000
2023-10-31 22:23:32 +00:00
9e0cd64c5e [fx] Add Graph option for replace_pattern (#112409)
Summary:
Allow doing pattern replacement with just an fx.Graph instead of a fx.GraphModule,
which can let callers avoid paying the cost of `recompile()` for a small graph if they
don't need the module.

This is a significant speedup if you use hundreds of small patterns for replacement.

Test Plan: Tested in a diff stacked on top of this: {D50756722}

Reviewed By: SherlockNoMad, angelayi

Differential Revision: D50756723

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112409
Approved by: https://github.com/ZainRizvi
2023-10-31 22:16:20 +00:00
53acdb66f7 [primtorch] aten.normal decomp has wrong return type due to elementwise_type_promotion_wrapper (#112467)
Fixes https://github.com/pytorch/pytorch/issues/112449

elementwise_type_promotion_wrapper will promote `aten.normal` to the dtypes of `mean`, `std` args.

But this is incorrect if we provide the dtype param. Hence, we allow overriding the result_dtype if a specified dtype arg is available.

This problem is unique to `aten.normal`, all other ops decorated do not have a dtype param.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112467
Approved by: https://github.com/lezcano
2023-10-31 20:57:09 +00:00
24f217ee64 [Nested tensor] Add more ops in Python subclass nested tensor (#112302)
Summary: Add dropout, split_with_sizes, and silu operations in python subclass nested tensor

Test Plan: unit tests

Differential Revision: D50676812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112302
Approved by: https://github.com/soulitzer, https://github.com/jbschlosser
2023-10-31 20:57:05 +00:00
17fd4885aa [dynamo] Support custom dict constructor with kwargs (#112513)
Summary:

As of https://github.com/pytorch/pytorch/pull/103192, dynamo
supports code that creates OrderedDict instances using kwargs
for the key-value pairs rather than passing a dict literal.

But custom dicts (for example subclasses of OrderedDict) follow
a different codepath so that we can check for conditions such
as a custom `__init__` that need to force a graph break.

This commit allows kwargs for custom dict constructors - if the
args are empty and the class is not also a dataclass (which is
the case that, for example, a
`transformers.modeling_outputs.ModelOutput` instance will wind
up hitting) then treat the kwargs as the key-value pairs.

NOTE: For this to behave 100% correctly, we are relying on
the fact that python dicts behave like ordered dicts so that they
preserve the kwargs' ordering. Technically it is not guaranteed that
future versions of Python will respect this; if that behavior changes
we would need to ensure that dynamo uses OrderedDict for kwargs all
the way down in order to handle special cases like OrderedDict where
the kwargs' ordering does matter.

Test Plan:

```
pytest test/dynamo/test_functions.py
```

I also verified that the new test fails without the changes to
`dicts.py`.

Reviewers: yanboliang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112513
Approved by: https://github.com/yanboliang
2023-10-31 20:55:38 +00:00
f74d766632 feat(optim): use has_complex shortcut flag for all applicable optimizers, use _view_as_real auxiliary function (#110706)
Follow up to: https://github.com/pytorch/pytorch/pull/110607

CC: @lezcano @janeyx99
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110706
Approved by: https://github.com/lezcano
2023-10-31 20:33:03 +00:00
90bef4411e [Profiler] Disable CUPTI Teardown when using CUDA Graphs (#112507)
Summary:
CUDA Graph does not work well with CUPTI teardown.
    1) crashes on 1st lazy CUPTI re-init after teardown (CUDA 11)
    2) crashes on 2nd non-lazy CUPTI re-init after teardown (CUDA 12)

Workaround: turn off CUPTI teardown when using CUDA Graphs completely.

Test Plan: CI

Differential Revision: D50811284

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112507
Approved by: https://github.com/davidberard98
2023-10-31 20:17:05 +00:00
bc098c7fc2 Revert "[dynamo] ExecutorchCallDelegateHigherOrderVariable - add sanity check that input and output tensors are disjoint (#111960)"
This reverts commit 25f06ee51b0a113d13612cdc4dc7275250436bd0.

Reverted https://github.com/pytorch/pytorch/pull/111960 on behalf of https://github.com/izaitsevfb due to Breaks internal tests, [T168506136](https://www.internalfb.com/intern/tasks/?t=168506136) ([comment](https://github.com/pytorch/pytorch/pull/111960#issuecomment-1787964742))
2023-10-31 20:14:20 +00:00
b1b3d489f3 Revert "[dynamo] Be stricter about HigherOrderOperator kwargs (#111938)"
This reverts commit eb8af4dc675c625bbe2a28077e5951d4bbe8b862.

Reverted https://github.com/pytorch/pytorch/pull/111938 on behalf of https://github.com/izaitsevfb due to Reverting to unblock the revert of #111960 ([comment](https://github.com/pytorch/pytorch/pull/111938#issuecomment-1787960567))
2023-10-31 20:10:58 +00:00
f64a97c6f8 [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-10-31 20:02:30 +00:00
aa649f713f [dynamo, test] remove #ops comparison to fx.symbolic_trace from dynamo standard_test (#112420)
Fix https://github.com/pytorch/pytorch/issues/112230 by removing the comparison of number of ops in dynamo vs. fx.symbolic_trace. A number of tests fail in `test_functions.py` fail because the number of ops is no longer the same, but this seems to be acceptable behavior by dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112420
Approved by: https://github.com/jansel, https://github.com/int3
2023-10-31 19:55:47 +00:00
bb45f89cd9 Hackable distributed filesystem reader and writer (#106635)
I propose some changes so that the `FileSystemReader` and `FileSystemWriter` can be used on other file systems. User only needs to provide `path` as a subclass of `Path` that overrides the necessary interfaces.

For example, one can utilize `tf.io.gfile` to implement an interface to save to or load from HDFS. The following code snippet shows a working implementation.

```python
from pathlib import Path
import tensorflow as tf

class GFileWrapper(tf.io.gfile.GFile):
    def __init__(self, path, mode="r") -> None:
        super().__init__(path, mode)

    def write(self, data):
        return super().write(bytes(data))

    # a not quite efficient readinto, but it works
    def readinto(self, buffer):
        # read up to buffer's length
        data = self.read(len(buffer))
        length = len(data)
        buffer[:length] = data
        return length

class HdfsPath(type(Path())):
    def __new__(cls, *pathsegments):
        return super().__new__(cls, *pathsegments)

    @staticmethod
    def _fix_path(path):
        path = str(path)
        if path.startswith("hdfs:/") and not path.startswith("hdfs://"):
          path = path.replace("hdfs:/", "hdfs://")
        return path

    def open(self, mode="r", *args, **kwargs):
        return GFileWrapper(HdfsPath._fix_path(self), mode=mode)

    def mkdir(self, **kwargs) -> None:
        return tf.io.gfile.makedirs(HdfsPath._fix_path(self))

    def rename(self, target):
        return tf.io.gfile.rename(HdfsPath._fix_path(self), HdfsPath._fix_path(target))
```

```python
writer = FileSystemWriter(HdfsPath("hdfs://..."), sync_files=False)
reader = FileSystemReader(HdfsPath("hdfs://..."))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106635
Approved by: https://github.com/fduwjj
2023-10-31 19:36:18 +00:00
1df1ae66cc [DTensor] Assert shard dim is less than tensor ndim (#112404)
Assert shard dim is less than tensor ndim. Previously, an index error on line 154 is thrown and the error is not clear

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112404
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-10-31 19:36:14 +00:00
6ae21e73d3 [inductor] FX graph cache: Add support for symbolic shapes (#111421)
Summary: Add support for caching graphs that have tensor args with symbolic shapes. The high-level appraoch is to serialize guards with the on-disk cached object and validating those guards pass before serving a cached object.

Test Plan: New unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111421
Approved by: https://github.com/ezyang
2023-10-31 19:31:05 +00:00
1483097679 Update how Dynamo decides to graph break on an OpOverloadPacket (#112200)
Previously, under config.only_allow_pt2_compliant_ops, Dynamo graph
breaks when it see an OpOverloadPacket where any overloads are not
PT2 compliant. This is potentially brittle: if someone (unlikely) adds
a new overload for a custom operator, then this would cause a
previously non-graph-breaking call to the OpOverloadPacket to graph break.

In this PR:
- When Dynamo is about to write a call to an operator to the FX graph,
we check if it is PT2 compliant.
- For OpOverload, we check to see if the tag is on it
- For OpOverloadPacket, we do overload resolution and check to see if
  the tag is on the OpOverload that it resolves to.

Test Plan:
- new tests, existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112200
Approved by: https://github.com/bdhirsh
2023-10-31 19:10:37 +00:00
fb0e3a5740 Refactor TD tests to own folder (#112166)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112166
Approved by: https://github.com/clee2000
ghstack dependencies: #112161
2023-10-31 18:50:54 +00:00
5f461e9ec1 Revert "Error early when dataclass is not registered (#112211)"
This reverts commit b165abaa3b5b2f81fcd69c1060e651aabe38e574.

Reverted https://github.com/pytorch/pytorch/pull/112211 on behalf of https://github.com/ZainRizvi due to Breaks internal builds. See D50820325 ([comment](https://github.com/pytorch/pytorch/pull/112211#issuecomment-1787794078))
2023-10-31 18:45:25 +00:00
a21851c69d fix(inductor): ForeachKernelSchedulerNode group shape should be opaque for graph debug (#110336)
~~Shape is assumed by `TensorMetadata` to be torch.Shape/tuple, however, some of the scheduler node groups utilize `int`, so convert to tuple.~~

Root cause is actually `foreach` scheduler node having silent-error group of int, when in fact it ought to be opaque `foreach`.

**Previously:** silent error / confusing shape of (0,)
![image](https://github.com/pytorch/pytorch/assets/9093549/5bc2a3c7-151f-4433-bbf8-044c7b03e989)

**Now:** clear that it is foreach which does not have well-defined shape:
![image](https://github.com/pytorch/pytorch/assets/9093549/8373080d-4519-4e74-8a3b-da463e9968da)

~~Alternate might be to create list of shapes for each of its subnodes. Actually, for debuggability sake, I may prefer this. We can ensure that the recursive generation of this string is only done dynamically in a debug code path. Else, incrementally computing it on initialization of ForeachKernel may also be feasible.~~ This is quite infeasible for 100s of params.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110336
Approved by: https://github.com/mlazos
2023-10-31 18:44:08 +00:00
2e40e09d57 [dynamo] {*}Tensor.__init__ from list of Tensor/ndarray as torch.stack(List[FakeTensor]) (#111741)
Follow up to https://github.com/pytorch/pytorch/pull/111665

Fixes: https://github.com/pytorch/pytorch/issues/106207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111741
Approved by: https://github.com/lezcano
2023-10-31 18:44:04 +00:00
2f51b9223c Make sure namedtuple are preserved when adding backward hooks on Module (#112433)
As per title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112433
Approved by: https://github.com/mikaylagawarecki
2023-10-31 18:40:35 +00:00
94f3df27e4 [aotinductor] reland: return a copy of any constant (#112370)
When the model returns a constant, we cannot "release" its handle,
because the constant doesn't have any handle at all. Instead,
we should allocate a new tensor and then return a copy of the constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112370
Approved by: https://github.com/hl475, https://github.com/desertfire
2023-10-31 18:36:44 +00:00
36164265ae [export oncall] add some examples during oncall (#112445)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112445
Approved by: https://github.com/ydwu4
2023-10-31 18:33:03 +00:00
fbafff3668 [reland][inductor] benchmark fusion (#112450)
reland https://github.com/pytorch/pytorch/pull/108193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112450
Approved by: https://github.com/jansel
2023-10-31 18:17:06 +00:00
481a7a9643 [execution trace] ignore some properties when symbolic size/strides exist (#112458)
Fixes #112235

Otherwise an exception will be thrown when we try to access storage or sizes on a tensor with symbolic size/strides.

Added a test in test/dynamo/test_profiler.py

Differential Revision: [D50821576](https://our.internmc.facebook.com/intern/diff/D50821576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112458
Approved by: https://github.com/aaronenyeshi
2023-10-31 18:13:03 +00:00
a5641bc56b [TD] Enable Test Class granularity on heuristics (#112161)
Changes the heuristic framework to support multiple prioritizing individual classes within a test file.

Components of this included:
- Updating TestPrioritizations to accept individual test classes being prioritized. Previously, when a heuristic wanted to prioritize a test file it would pass in the test's name, now to prioritize a class within a test it uses the notation "test::classname"
- Changes are fully backwards compatible with existing heuristics
- Test sharding now supports sharding individual tests (for when they're prioritized)
- When a TestClass is prioritized, we pass the appropriate "-k" flags down to pytest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112161
Approved by: https://github.com/huydhn
2023-10-31 18:11:05 +00:00
5cd1208415 [quant][pt2][be] Refactor QAT q-dq patterns (#112279)
Summary: This commit refactors q-dq patterns used in QAT fusion,
reducing code duplication. This is important for future efforts
to support quantizing bias.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112279
Approved by: https://github.com/jerryzh168
ghstack dependencies: #112159
2023-10-31 18:04:23 +00:00
231129ea36 [quant][pt2] Fix QAT conv-bn bias derived qspec (#112159)
Summary: Today, we have special handling for special qspecs like
`SharedQuantizationSpec` or `DerivedQuantizationSpec`, since these
qspecs refer to other nodes in the graph and these node references
need to be updated after replacement (since they referred to nodes
in the original graph that no longer exist in the new graph).

However, we only do the above for special nodes like conv, bn,
getitem, and relu. This doesn't cover the common use case of
having conv bias derive its qparams from those of conv input
activations and conv weight. This commit adds support for this
use case by also replacing the node references for these nodes.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_bias_derived_qspec

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar

Differential Revision: [D50697078](https://our.internmc.facebook.com/intern/diff/D50697078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112159
Approved by: https://github.com/jerryzh168
2023-10-31 18:04:23 +00:00
30237aaeec [MPS] Fix bug when value is of complex (#111937)
When the value of `fill` is of complex, this line `value.toDouble() == 0.0` will error out saying that converting complex to double will cause overflow. So we should firstly handle the complex value and then enter this condition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111937
Approved by: https://github.com/malfet
ghstack dependencies: #111885
2023-10-31 17:50:56 +00:00
3db0095ea2 [reland][quant][pt2e][be] Cleanup observer insertion logic (#111828) (#112453)
Summary: att, after SharedQuantizationSpec bug fix we are doing some checks before hand, this can simplify the logic when we insert observers

Test Plan:
contbuild & OSS CI, see bf998a2c5d

Test plan from GitHub:
python test/test_quantization.py TestQuantizePT2E

CIs

Differential Revision: D50816224

Pulled By: jerryzh168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112453
Approved by: https://github.com/andrewor14
2023-10-31 17:33:24 +00:00
a1c56df1f0 [inductor cpp] vectorize support for truediv (#112234)
Ops like group_norm has `ops.truediv` that doesn't have vectorization support yet. This PR adds the support.

`test_group_norm_vec`
Before:
```c++
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L))
            {
                {
                    #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                    #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
        {
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                    {
                        auto tmp0 = in_ptr0[static_cast<long>(x2 + (1024L*x1) + (32768L*x0))];
                        auto tmp1 = out_ptr0[static_cast<long>(x1 + (32L*x0))];
                        auto tmp3 = out_ptr1[static_cast<long>(x1 + (32L*x0))];
                        auto tmp10 = in_ptr1[static_cast<long>(x1)];
                        auto tmp12 = in_ptr2[static_cast<long>(x1)];
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = c10::convert<float>(1024.0);
                        auto tmp5 = tmp3 / tmp4;
                        auto tmp6 = c10::convert<float>(1e-05);
                        auto tmp7 = tmp5 + tmp6;
                        auto tmp8 = 1 / std::sqrt(tmp7);
                        auto tmp9 = decltype(tmp2)(tmp2 * tmp8);
                        auto tmp11 = decltype(tmp9)(tmp9 * tmp10);
                        auto tmp13 = tmp11 + tmp12;
                        out_ptr2[static_cast<long>(x2 + (1024L*x1) + (32768L*x0))] = tmp13;
                    }
                }
            }
        }
    }
}
```

After:
```c++
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L))
            {
                {
                    #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                    #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
        {
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (1024L*x1) + (32768L*x0)));
                        auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(out_ptr0[static_cast<long>(x1 + (32L*x0))]));
                        auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(out_ptr1[static_cast<long>(x1 + (32L*x0))]));
                        auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(in_ptr1[static_cast<long>(x1)]));
                        auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(in_ptr2[static_cast<long>(x1)]));
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1024.0));
                        auto tmp5 = tmp3 / tmp4;
                        auto tmp6 = at::vec::Vectorized<float>(static_cast<float>(1e-05));
                        auto tmp7 = tmp5 + tmp6;
                        auto tmp8 = tmp7.rsqrt();
                        auto tmp9 = tmp2 * tmp8;
                        auto tmp11 = tmp9 * tmp10;
                        auto tmp13 = tmp11 + tmp12;
                        tmp13.store(out_ptr2 + static_cast<long>(x2 + (1024L*x1) + (32768L*x0)));
                    }
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112234
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-10-31 17:15:21 +00:00
b91fcdf4aa [dynamo] Add support for register_post_accumulate_grad_hook (#112325)
lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112325
Approved by: https://github.com/jansel
2023-10-31 17:04:49 +00:00
04024926f4 Use pytree.tree_map_ everywhere (#112417)
Wherever we discard the output of `tree_map` it's better to call `tree_map_`
which doesn't unflatten the mapped results and so is a lot cheaper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112417
Approved by: https://github.com/lezcano
ghstack dependencies: #112391, #112392, #112393, #112394
2023-10-31 15:57:06 +00:00
66c32d099a Use pytree.arg_tree_leaves everywhere (#112394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112394
Approved by: https://github.com/lezcano
ghstack dependencies: #112391, #112392, #112393
2023-10-31 15:57:06 +00:00
046c0c66fd [pytree] Add arg_tree_leaves to optimize flattening function arguments (#112393)
We commonly do some variation of `tree_leaves((args, kwargs))`. This adds a new
function `arg_tree_leaves(*args, **kwargs)` which takes advantage of the known
structure of `args` and `kwargs` to skip their `flatten_fn`.

I see ~1 us improvement per call for args + kwargs, or a 0.5 us improvement
when passing just one of `args` or `kwargs`. For shallow structures, this can be
proportionally quite significant. For example, the empty_strided call I've been
using as a benchmark:
```
args = ((100, 100), (100, 1))
kwargs = dict(device="cuda")
```
Sees a 30% speedup from this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112393
Approved by: https://github.com/lezcano
ghstack dependencies: #112391, #112392
2023-10-31 15:57:00 +00:00
86196bf116 add batch impl. for inplace index_add operation (#112276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112276
Approved by: https://github.com/zou3519, https://github.com/kshitij12345, https://github.com/malfet
2023-10-31 13:47:53 +00:00
424c093fc7 Fix comment spelling error (#112468)
Fix tiny spelling error in comments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112468
Approved by: https://github.com/kit1980
2023-10-31 10:53:12 +00:00
a310cc8968 Add Half support for kthvalue, cross, hist, and logit on CPU (#112135)
Add Half support for kthvalue, cross, hist, and logit on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112135
Approved by: https://github.com/cpuhrsch
2023-10-31 09:12:47 +00:00
8d6b4322d0 [CI] Limit libtorch builds to shared-with-deps (#112452)
As that is the only variant that is being mentioned on  https://pytorch.org/get-started/locally/

And for MacOS those three flavors were just building and uploading the
same thing 3 times over, see [this](https://github.com/pytorch/pytorch/actions/runs/6689661275/job/18176516410) for example:
```
upload: ../../_temp/artifacts/libtorch-macos-2.2.0.dev20231030.zip to s3://pytorch/libtorch/nightly/cpu/libtorch-macos-2.2.0.dev20231030.zip
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112452
Approved by: https://github.com/huydhn
ghstack dependencies: #112451
2023-10-31 08:40:06 +00:00
70b392ae02 [dtensor] enable foreach operators for adam optimizer (#112108)
This PR enables basic foreach ops in DTensor for adam optimizer, to improve performance
compare to optimizer using torch.Tensor. Currently by default optimizer
won't do this for tensor subclass, we will need to enable this by
default in DTensor when all ops are covered, or early enable this when
exploring new FSDP, we just need to append DTensor to the optimizer
allow list.

Some latency measurement, on a 5-layer MLP model:
single tensor adam: 17ms
![Screenshot 2023-10-29 at 10 48 22 PM](https://github.com/pytorch/pytorch/assets/9443650/8937d786-b863-4318-88c2-12e43180ce8d)
foreach multitensor adam: 4ms
![Screenshot 2023-10-29 at 10 50 58 PM](https://github.com/pytorch/pytorch/assets/9443650/de105cc3-8e12-4765-938a-763d8e958194)

so around 4.25x improvement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112108
Approved by: https://github.com/wz337
2023-10-31 08:09:46 +00:00
e66ec5843f [RESUBMIT] Cleanup error reporting for ProcessGroupNCCL (#112419)
Continuing some of the work from https://github.com/pytorch/pytorch/pull/108191, I realized majority of errors raised from ProcessGroupNCCL were just generic RuntimeError.

In this PR, I've added appropriate error types to all the exceptions raised from ProcessGroupNCCL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112419
Approved by: https://github.com/fduwjj
2023-10-31 05:58:21 +00:00
cb942ef2b1 Revert "add batch impl. for inplace index_add operation (#112276)"
This reverts commit e3c8c63deaf594699d827e84869a3ecd7e2ab494.

Reverted https://github.com/pytorch/pytorch/pull/112276 on behalf of https://github.com/PaliC due to breaking linux binary builds ([comment](https://github.com/pytorch/pytorch/pull/112276#issuecomment-1786455375))
2023-10-31 05:10:47 +00:00
08dbfecdbd Revert "Symintify repeat_interleave (#109133)" (#112245)
This reverts commit 41e5d410cf4bfaaf264cc97b541e00d968be6db2.

Differential Revision: [D50804696](https://our.internmc.facebook.com/intern/diff/D50804696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112245
Approved by: https://github.com/eellison
2023-10-31 03:50:26 +00:00
6cebacdbc0 [vision hash update] update the pinned vision hash (#112455)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112455
Approved by: https://github.com/pytorchbot
2023-10-31 03:32:40 +00:00
710337244d [Inductor] Extend Pattern Matcher to Match Equivalent Function Invocation (#107832)
Fixes #104391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107832
Approved by: https://github.com/jansel
2023-10-31 03:32:33 +00:00
f50ec341bc inductor cpp wrapper: add GIL release and acquire (#111888)
Support multiple instances inference (in different threads of the same process) as in https://github.com/pytorch/pytorch/issues/93524#issuecomment-1421816158.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111888
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2023-10-31 03:23:30 +00:00
bb97ce4c7f [ivalue] operator<<: don't error on invalid IValue tags (#112232)
While running the profiler, we observed a scenario where we observe IValues with invalid tags. Specifically, we try to convert the IValue to a string here:

d3bf6803b6/torch/csrc/profiler/util.cpp (L306-L308)

and in the scenario with invalid IValues, an exception gets thrown here, in `operator<<`.

d3bf6803b6/aten/src/ATen/core/ivalue.cpp (L864)

IMO, `<<` shouldn't error if the IValue is bad; instead, we should just print that the IValue tag is invalid.

Differential Revision: [D50760040](https://our.internmc.facebook.com/intern/diff/D50760040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112232
Approved by: https://github.com/albanD
2023-10-31 02:15:43 +00:00
c3113514e9 Fix regression from pointwise + multi-level reduction fusion (#112297)
In https://github.com/pytorch/pytorch/pull/111122, an optimization is introduced for reduction + pointwise + multi-level reduction fusion. The main idea of this optimization is to have the first-level reduction of the multi-level reduction reuses the reduction sizes of the first reduction kernel so that there are better chances that the first reduction kernel and the first-level reduction of the multi-level reduction kernel can be fused. However, it introduces a bug for pattern pointwise + multi-level reduction, where the first-level reduction kernel wrongly reuses the reduction ranges (which is []) from the previous pointwise kernel. This PR fixes this issue.

Test plan:
`python timm_models.py --training --amp --performance --only=dm_nfnet_f0 --inductor`
Results before this PR: 0.869x
Results after this PR: 1.232x

Benchmark results:
![Screenshot 2023-10-30 at 2 30 10 PM](https://github.com/pytorch/pytorch/assets/10527447/c7b241c0-92a4-49ff-96fb-2805c8fcc45a)

<img width="1491" alt="Screenshot 2023-10-30 at 3 10 06 PM" src="https://github.com/pytorch/pytorch/assets/10527447/608d26ea-dcc5-4f2a-8700-4a928701392b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112297
Approved by: https://github.com/jansel
2023-10-31 01:47:46 +00:00
6ab1121bdc Enable Mypy checking for scheduler.py (#105600)
ATT, add type annotations and type assertions to pass Mypy checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105600
Approved by: https://github.com/int3
2023-10-31 01:47:13 +00:00
0ce8cf7c7a Update small wheel nccl-version to 2.19.3 (#112293)
To keep it in sync with https://github.com/pytorch/pytorch/pull/110827

Added check to `scripts/generate_binary_build_matrix.py` to validate submodule and small wheel nccl versions are the same

Step one in addressing https://github.com/pytorch/pytorch/issues/112285
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112293
Approved by: https://github.com/huydhn
2023-10-31 01:20:01 +00:00
236eff9531 [BE] Refactor repeated assets in test_foreach.py (#112348)
Tested conditions in `test_binary_op_list_error_cases` looks almost identical, although it tests method and in-place variants. Use for loop to make distinction a bit more explicit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112348
Approved by: https://github.com/albanD
ghstack dependencies: #112349
2023-10-31 01:11:44 +00:00
e3c8c63dea add batch impl. for inplace index_add operation (#112276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112276
Approved by: https://github.com/zou3519, https://github.com/kshitij12345
2023-10-31 00:59:18 +00:00
2f09da3a21 [dtensor] Introduce full_tensor API to DTensor (#112224)
This PR introduces a `full_tensor` API to DTensor, there were so many
callsites that exercises the `redistribute(replicate)` path and I feel
it deserves a separate API, mostly just a syntactic sugar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112224
Approved by: https://github.com/wz337
2023-10-31 00:44:09 +00:00
e2cd69a770 [CI] Call upload step upload (#112451)
Rather than `build`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112451
Approved by: https://github.com/huydhn
2023-10-31 00:37:14 +00:00
b8a10a8a2d Add batch decomposition for torch.unsafe_chunk (#110862)
This updates the docs as well to show `torch.unsafe_chunk`. Should the `unsafe_*` functions should not appear in the docs?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110862
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-10-31 00:37:08 +00:00
40569b28f4 Constrain fx_stride order for scaled_mm (#112430)
# Summary
CublasLT requires row_major @ col_major order for scaled_mm.  It is possible for the to not respect this constraint without adding this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112430
Approved by: https://github.com/eellison
2023-10-31 00:02:35 +00:00
12a9e09200 [inductor] Fix bug handling output_strides in fx graph cache (#112041)
Summary: The current implementation is not properly attaching output strides to the tracing context when an fx graph is loaded from the cache. That bugs leads to assertion failures like `AssertionError: expected size 3==3, stride 1==9 at dim=1`. This change saves the output strides in the serialized object cached on disk and inserts them into the tracing context whether the graph is loaded from cache or compiled.

Test Plan:
* New unit test using resnet18 (which repros the problem)
* Ran the timm benchmark suite with `--training`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112041
Approved by: https://github.com/ezyang
2023-10-30 23:49:10 +00:00
cf3aa985a9 Don't rewrite assert in pytest (#112436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112436
Approved by: https://github.com/angelayi
2023-10-30 23:20:02 +00:00
479f5eb029 [dynamo] Remove dead code - real_value_tensor_positive_aliases (#111911)
(legality) It is currently impossible (and should remain impossible) - (due to dedup guards - all static tensors are unique) - to access the same **static** tensor value from a **different source**.

As for `getattr(nn.Module, tensor)` source collisions, we will never instantiate a `nn.Module getattr` source for a static tensor, due to:
- side-effect tracking (as long as we track all static tensors - see also https://github.com/pytorch/pytorch/pull/112025 for extra sanity check)
- See: c8a5bb451e/torch/_dynamo/variables/builder.py (L227)

(no worse) In any case, this field is currently unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111911
Approved by: https://github.com/voznesenskym
2023-10-30 23:10:52 +00:00
6188f2e899 Enable planner to be used for loading sharded optimizer state dict (#112259)
This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259
Approved by: https://github.com/wz337
2023-10-30 22:51:09 +00:00
ac71fea1a8 [test][functorch] fix function name in factory_fns (#112315)
This PR fixes incorrect function name in `factory_fns`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112315
Approved by: https://github.com/zou3519
2023-10-30 22:08:07 +00:00
31c0ef934b [pytree] Remove LeafSpec construction cost in tree_flatten (#112392)
On my machine, `pytree.LeafSpec()` takes ~600ns but since every leaf spec is the
same, we can just use a global constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112392
Approved by: https://github.com/lezcano
ghstack dependencies: #112391
2023-10-30 21:45:45 +00:00
0f2b7a99e3 [pytree] Avoid constructing intermediate lists in tree_{flatten,leaves} (#112391)
Instead of concatenating lists of child nodes, this appends the leaf nodes
directly onto the list of leaves to be returned which gives a small perf
improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112391
Approved by: https://github.com/zou3519
2023-10-30 21:45:45 +00:00
da90c31593 [export] Upstream unflattener. (#112189)
Summary: Provide a way for users to get the original module structure back after exporting.

Test Plan: caffe2/test:test_export -- -r unflatten

Differential Revision: D50708490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112189
Approved by: https://github.com/suo, https://github.com/angelayi
2023-10-30 21:27:11 +00:00
67638d4dad torch.compile: fix bug of fallback_randn when 'generator' is None (#112240)
When I run Stable Diffusion in [Huggingface/Diffusers](https://github.com/huggingface/diffusers),an error occured:
```
LoweringException: AssertionError: should have been handled in replace_random.py.
   target:  aten.randn.generator
   args[0]:  [1, 4, 64, 64]
   kwargs: {'generator': None, 'dtype': torch.float16, 'layout': torch.strided, 'device': device(type='cuda', index=0), 'pin_memory': False}
```
It looks like some bug of dynamo, and you can reproduce this bug like this:
```python
import torch
def model(shape, generator):
      return torch.randn(shape, generator=generator, device="cuda:0")
model = torch.compile(model)
x = model((1, 3, 64, 64), None)
print(x)
```
Error occurs because 'None' is passed into ‘generator' ,  and dynamo has to process `torch.randn` into fx node `torch.ops.aten.randn.generator`.
aten.randn.generator is not processed by decomposition and  it is processed by lowering in [torch/_inductor/lowering.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/lowering.py#L1815), randn.generator is processed like this:
```python
@register_lowering(aten.randn)
def randn(*args, **kwargs):
    if kwargs.get("generator", None) is not None:
        return fallback_randn_generator(*args, **kwargs)
    elif config.fallback_random:
        return fallback_randn_default(*args, **kwargs)
    raise AssertionError("should have been handled in replace_random.py")
```
As you can see, because 'generator' is None, it will not step into `fallback_randn_generator`, and of course, if you don't open `config.fallback_random`, it will not step into `fallback_randn_default`, too. Actually, if 'generator' is None, it could also be processed as`aten.randn.default`.  And then, AssertionError will be throw, but in here, I will not disscuss too much about how to process this bug and will open an issue.

Actually, `config.fallback_random` offers a way to debug randn in [config.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py#L190), so I try to open `config.fallback_random` to debug my model. But when I open it by:
```python
# fallback to eager for random/dropout, this is slow but useful for debugging
fallback_random = True
```
Another error occurs!
```python
LoweringException: RuntimeError: Unknown keyword argument 'generator' for operator 'aten::randn'. Schema: aten::randn(SymInt[] size, *, ScalarType? dtype=None, Layouit? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
```
Obviously, `aten::randn` does not support `kwargs:{generator: None}`, so it should be popped before kwargs is feeded into `fallback_randn_default`.

That's all I'm going to say. Thanks for reading carefully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112240
Approved by: https://github.com/jansel
2023-10-30 21:10:54 +00:00
9f1ccd4dac Fix internal test listing errors (#112300)
For some reason, fbcode internal tests have list errors when a test is skipped and have ", " in the name. This fix tries to replace shape list into a string to avoid internal test listing errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112300
Approved by: https://github.com/chenyang78
2023-10-30 20:47:40 +00:00
80de49653a Prevent OOB access in foreach_list variants (#112349)
By checking that lists sizes are the same before computing forward gradients.

Before the change
```cpp
::std::vector<at::Tensor> _foreach_add_List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) {
  auto self_ = unpack(self, "self", 0);
  auto other_ = unpack(other, "other", 1);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, other );

  std::vector<bool> _any_has_forward_grad_result(self.size());
  for (const auto& i : c10::irange(self.size())) {
    _any_has_forward_grad_result[i] = isFwGradDefined(self[i]) || isFwGradDefined(other[i]);
  }
  ...
```
after the change:
```cpp
::std::vector<at::Tensor> _foreach_add_List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) {
    auto self_ = unpack(self, "self", 0);
    auto other_ = unpack(other, "other", 1);
    [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, other );

    TORCH_CHECK(
        self.size() == other.size(),
          "Tensor lists must have the same number of tensors, got ",
        self.size(),
          " and ",
        other.size());
    std::vector<bool> _any_has_forward_grad_result(self.size());
    for (const auto& i : c10::irange(self.size())) {
      _any_has_forward_grad_result[i] = isFwGradDefined(self[i]) || isFwGradDefined(other[i]);
    }

```
Add regression test

Fixes https://github.com/pytorch/pytorch/issues/112305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112349
Approved by: https://github.com/Chillee
2023-10-30 20:43:03 +00:00
a14f8e09bb [dynamo] torch._dynamo.optimize to torch.compile in cudagraph trees tests (#112314)
This somehow fixes test issues later on.  @eellison figured this one out and will try to figure out why.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112314
Approved by: https://github.com/eellison
2023-10-30 20:16:57 +00:00
69b9e54d45 Add openvino backend into torch.compile docs (#112321)
The torch.compile [docs page](https://pytorch.org/docs/stable/torch.compiler.html) shows commonly used backends through torch.compile. Recently, the OpenVINO backend for torch.compile was released. This PR adds the torch.compile openvino backend into the torch.compile docs page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112321
Approved by: https://github.com/msaroufim
2023-10-30 20:13:41 +00:00
4fbf884f58 [fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-buffer-overflow-far-from-bounds (size 4) in c10::IValue::IValue() (#110453)
Summary: This diff fixes an OOB read found by fuzzing in torch/../jit/mobile

Test Plan:
CI and
```
arc lionhead crash reproduce 853835926354224
```
doesn't crash anymore.

Differential Revision: D49537377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110453
Approved by: https://github.com/davidberard98
2023-10-30 20:08:22 +00:00
4b8a5e1854 [dynamo] Remove VariableTracker.as_specialized (#112363)
My local testing can't seem to find this function actually doing anything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112363
Approved by: https://github.com/yanboliang
2023-10-30 20:07:55 +00:00
b97afc4018 Support 'BaseOutput' and subclasses from 'diffusers' in dynamo (#111978)
Extending the workarounds for `transformers` `ModelOutput` to cover `diffusers` `BaseOutput`. Together with https://github.com/huggingface/diffusers/pull/5459 it should unblock export for `diffusers` models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111978
Approved by: https://github.com/jansel
2023-10-30 19:53:31 +00:00
d713b8dd5d Revert "[inductor] Fix bug handling output_strides in fx graph cache (#112041)"
This reverts commit 3d2041b34210bef3902f6ba86881b38ac0fbc57e.

Reverted https://github.com/pytorch/pytorch/pull/112041 on behalf of https://github.com/ZainRizvi due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/112041#issuecomment-1785929233))
2023-10-30 19:50:23 +00:00
fc0b0820fc Revert "Readded device_assert skipping in index and index_put (and also added (#112093)"
This reverts commit b110d87ac271db01fd1d24a6595cf9633ac1ce43.

Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1785922905))
2023-10-30 19:45:41 +00:00
4439b906c4 Revert "Some cleanups in pattern matcher (#112101)"
This reverts commit f7dc0ae16c4637be0a7f20a1d9cd4311e9a6d3e8.

Reverted https://github.com/pytorch/pytorch/pull/112101 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112101#issuecomment-1785920248))
2023-10-30 19:43:40 +00:00
052f7a3edc Revert "Added patterns for randperm + index_add (#112102)"
This reverts commit 1ff0b82be977107ab67ad2817ea76d46d3478d8f.

Reverted https://github.com/pytorch/pytorch/pull/112102 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112102#issuecomment-1785916704))
2023-10-30 19:41:29 +00:00
013f622dd2 grid_sample: support bfloat16 (#112331)
This adds bfloat16 support to `torch.nn.functional.grid_sample` this is particularly important when doing feature sampling such as for rendering techniques used in PyTorch3d or for camera projections to voxel grids such as in SimpleBEV.

Related to #57707

Test plan:

```
pytest test/test_nn.py -k grid_sample
pytest test/test_ops.py -k grid_sample
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112331
Approved by: https://github.com/zou3519
2023-10-30 19:31:41 +00:00
3b58755c1c Fix FakeTensor tolist when size is not symbolic (#112206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112206
Approved by: https://github.com/ezyang
ghstack dependencies: #112205
2023-10-30 19:25:10 +00:00
0cda4c8abe Replay view with view_func instead of as_strided in meta_utils for NT (#112205)
Currently meta_utils relies on as_strided when handling the view case (recursively meta-ify the base, and then do as_strided to simulate the view), but NestedTensor does not support as_strided today (though maybe it could?), so what we want to do instead is call Tensor. _view_func. Conveniently,  _view_func IS always available for nested tensors.

A detail to note is that _view_func actually incurs a guard because it needs to perform some metadata checks to make sure the view is still valid. This PR adds Tensor._unsafe_view_func which can avoid that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112205
Approved by: https://github.com/jbschlosser
2023-10-30 19:25:10 +00:00
503955f5ec [Pytorch][Vulkan] layer_norm (#112322)
Summary:
Generalize [layer_norm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) to all tensors of 2d to 4d. Using the mean and var operators in this diff stack, we can compute the layer_norm directly and remove the old shader file `layernorm.glsl`.
```
(input - input.mean(normalized_shape, keepdim=True)) / torch.sqrt(input.var(normalized_shape, correction=0, keepdims = True) + eps) * weight + bias
```

Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (0a5028d8c)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
  Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.layer_norm_invalid_inputs
[       OK ] VulkanAPITest.layer_norm_invalid_inputs (69 ms)
[ RUN      ] VulkanAPITest.layer_norm_2d
[       OK ] VulkanAPITest.layer_norm_2d (288 ms)
[ RUN      ] VulkanAPITest.layer_norm_3d
[       OK ] VulkanAPITest.layer_norm_3d (302 ms)
[ RUN      ] VulkanAPITest.layer_norm_4d
[       OK ] VulkanAPITest.layer_norm_4d (8 ms)
[----------] 4 tests from VulkanAPITest (668 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (668 ms total)
[  PASSED  ] 4 tests.
```

Reviewed By: yipjustin

Differential Revision: D50436726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112322
Approved by: https://github.com/yipjustin
2023-10-30 19:21:20 +00:00
33c41daf60 Fix scatter_mm kernel failure on non-contiguous tensor arguments (#112337)
This PR fixes
```
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
```
that appears when using large non-contiguous tensor arguments in `scatter_mm` kernel launch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112337
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #112154, #112076
2023-10-30 19:16:05 +00:00
cf6041e942 Use weakref in storing tensors as keys (follow-up to #111470) (#112076)
This PR addresses the discussion items in https://github.com/pytorch/pytorch/pull/111470#discussion_r1369008167, that is,
- use weakref when storing tensors as keys,
- add `storage_offset` to the key data,
- and revise the description of the `TensorAsKey` utility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112076
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #112154
2023-10-30 19:16:05 +00:00
e5c8ac8544 Eliminate try-catch block around triton::_triton_bsr_dense_mm_out call. (#112154)
As in the title.

Currently, the try-catch block hides the failures from triton kernel launches that are not related to exceptions that the try-catch block is meant to ignore. When triton kernel launch fails (e.g. due to bugs in triton or lack of resources), ignoring such failures will lead to hard-to-explain/unrelated errors in subsequent code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112154
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-10-30 19:16:05 +00:00
21330e5ba1 [pytree] align __all__ for C++ and Python pytree (#112110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112110
Approved by: https://github.com/zou3519
2023-10-30 18:32:25 +00:00
219763c38d Support calling user defined triton kernels with kernel.run (#112292)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112292
Approved by: https://github.com/jansel
ghstack dependencies: #112290
2023-10-30 17:51:23 +00:00
1250032c2e [Inductor] Add triton.autotune support for user defined triton kernels with complex grids (#112290)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112290
Approved by: https://github.com/jansel
2023-10-30 17:48:27 +00:00
5a1a9dc354 [inductor][fx pass] Add new split cat pattern detection (#110923)
Summary: We add a new pattern to merge getitem_cat to enable further split merges

Test Plan:
### test mcf model
Patch D49972740
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split-only -c
```

P850153017

### unit test
```
buck2 test mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes -- test_getitem_cat_merge
```
Buck UI: https://www.internalfb.com/buck2/eb7411a5-a6bd-46bc-bf66-756341e3ce10
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13792273864439068
Network: Up: 48KiB  Down: 15KiB  (reSessionID-39ca57cc-5743-423e-b94f-9d0f642010f8)
Jobs completed: 8. Time elapsed: 1:44.7s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0

### before vs after transformation
https://www.internalfb.com/intern/diffing/?paste_number=847958889

Differential Revision: D50100667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110923
Approved by: https://github.com/yanboliang
2023-10-30 17:46:13 +00:00
31c223a52c Forward fix a dynamo tracing rule test failure due to landing race (#112368)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112368
Approved by: https://github.com/Chillee, https://github.com/malfet
2023-10-30 17:34:22 +00:00
a8c74e8225 torch.export: cannot instantiate Dim from REPL (#111231)
Summary:
```
In [1]: import torch
   ...: torch.export.Dim('foo', min=1, max=16)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 2
      1 import torch
----> 2 torch.export.Dim('foo', min=1, max=16)

File /..../torch/export/__init__.py:319, in Dim(name, min, max)
    317 assert _max > _min, f"Cannot create Dim with inconsistent min={min}, max={max}"
    318 dim = _Dim(name, (int,), {"min": _min, "max": _max})
--> 319 dim.__module__ = inspect.getmodule(inspect.stack()[1][0]).__name__  # type: ignore[union-attr]
    320 return dim

AttributeError: 'NoneType' object has no attribute '__name__'
```

Test Plan: Repeat above repro

Differential Revision: D50275165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111231
Approved by: https://github.com/avikchaudhuri, https://github.com/angelayi
2023-10-30 17:15:32 +00:00
92cc52ab0e [CPU SDP] Remove mem efficient attn checks in CPU (#112375)
It doesn't seem like memory efficient attention can be used on CPU, as we don't check for it when iterating backends in `select_sdp_backend_cpp`. So removing some of the logic around mem efficient attention selection.

Created from CodeHub with https://fburl.com/edit-in-codehub

Differential Revision: [D50775562](https://our.internmc.facebook.com/intern/diff/D50775562/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50775562/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112375
Approved by: https://github.com/drisspg
2023-10-30 16:43:20 +00:00
255a4d0bd3 Fix doc of fullgraph parameter in torch.compile (#111906)
The docstring currently states the opposite of what this parameter is doing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111906
Approved by: https://github.com/pmeier, https://github.com/zou3519
2023-10-30 15:17:59 +00:00
f77b9bf3ba [xla hash update] update the pinned xla hash (#112374)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112374
Approved by: https://github.com/pytorchbot
2023-10-30 13:42:07 +00:00
e36dacaeed [Docs] fix typo in example of torch.linalg.solve_triangular (#112361)
Fixes #112359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112361
Approved by: https://github.com/IvanYashchuk
2023-10-30 10:33:14 +00:00
29844adbe0 Add Half support for logspace and range on CPU (#112131)
Add Half support for logspace and range on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112131
Approved by: https://github.com/cpuhrsch
2023-10-30 07:18:47 +00:00
cyy
0d669f06a6 Update Android to R21e (#109355)
R19c is too old, R21e is LTS version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109355
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-10-30 06:49:32 +00:00
bbd5b935e4 Use pytree.tree_leaves everywhere (#112324)
This changes all the instances I could find of `tree_flatten(...)[0]` or
`x, _ = tree_flatten` to use `tree_leaves`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324
Approved by: https://github.com/lezcano
ghstack dependencies: #112327, #112323
2023-10-30 03:39:04 +00:00
a0bf137a78 [pytree] Add optimized tree_leaves implementation (#112323)
pytree is used in many hot paths for dynamo tracing and in many cases we don't
care about the tree spec and just want the flattened list. This improves
`pytree.tree_leaves` to not construct the spec which gives a noticeable
performance improvement when multiplied by the many times it gets called
during tracing.

Concretely, I see a 2x speedup compared to `tree_flatten` in this benchmark:
```python
import torch.utils._pytree as pytree
%timeit pytree.tree_flatten([((100, 100), (100, 1)), dict(device="cuda")])[0]
%timeit pytree.tree_leaves([((100, 100), (100, 1)), dict(device="cuda")])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112323
Approved by: https://github.com/lezcano, https://github.com/XuehaiPan
ghstack dependencies: #112327
2023-10-30 03:39:04 +00:00
bfbc2e3ca8 [fx] Cache _torchscript_schema_to_signature (#112327)
This function is called in `normalize_function` which is in a fairly hot path for
`FakeTensor` dispatch. In this simple benchmark I see `normalize_function`
improve from 92 us to 17 us just by caching this signature object.

```python
import torch
from torch._subclasses import FakeTensorMode
from torch.fx.operator_schemas import normalize_function
aten = torch._ops.ops.aten
%timeit normalize_function(
    aten.empty_strided.default, args=((100, 100), (100, 1)),
    kwargs=dict(device="cuda"), normalize_to_only_use_kwargs=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112327
Approved by: https://github.com/lezcano
2023-10-30 03:38:52 +00:00
919c9b713e [Typo fixed] in triton_heuristics.py (#112350)
Fixes Typo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112350
Approved by: https://github.com/Skylion007
2023-10-29 22:44:27 +00:00
088d1648ec [test][fx] fix incorrect method call in test case (#112336)
This PR fixes the incorrect method name in function call in the test case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112336
Approved by: https://github.com/jon-chuang, https://github.com/kit1980
2023-10-29 19:49:13 +00:00
a9ebee30fa Make numpy core tests Dynamo traceable. (#112141)
A follow-up to https://github.com/pytorch/pytorch/pull/112084: convert vendored numpy/core submodule tests dynamo-traceable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112141
Approved by: https://github.com/lezcano
2023-10-29 19:28:53 +00:00
ccab8ce745 Make numpy fft and linalg tests Dynamo traceable (#112146)
Follow up https://github.com/pytorch/pytorch/pull/112141 and make numpy vendored tests of fft and linalg modules dynamo-traceable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112146
Approved by: https://github.com/lezcano
2023-10-29 19:27:38 +00:00
cyy
740d636165 Add clang-tidy checks in torch/csrc/autograd (#112313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112313
Approved by: https://github.com/Skylion007
2023-10-29 18:55:11 +00:00
ace2713d1e Revert "Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377)"
This reverts commit f1785373c08b9e8383b7eec3391d57053209b525.

Reverted https://github.com/pytorch/pytorch/pull/111377 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111377#issuecomment-1784179040))
2023-10-29 17:41:55 +00:00
ae72607e5f Add way to determine which overload an OpOverloadPacket will resolve to (#112199)
The types are a bit weird (we accept and return a string) because there
is not really a notion of OpOverloadPacket vs OpOverload in C++.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112199
Approved by: https://github.com/ezyang
ghstack dependencies: #112198
2023-10-29 15:36:14 +00:00
235a04c0de Add getAllSortedOperatorsFor helper function (#112198)
I need this for later. This roughly returns all the OpOverloads
for an OpOverloadPacket in the order that the OpOverloadPacket decides
to resolve them in.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112198
Approved by: https://github.com/ezyang
2023-10-29 15:36:14 +00:00
f5088d2e45 [dynamo] fix None routing bug during var_getattr on UDO (#111614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111614
Approved by: https://github.com/jansel
2023-10-29 01:57:43 +00:00
b165abaa3b Error early when dataclass is not registered (#112211)
Partially fixes: https://github.com/pytorch/pytorch/issues/112043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112211
Approved by: https://github.com/angelayi
2023-10-28 19:36:02 +00:00
eb8af4dc67 [dynamo] Be stricter about HigherOrderOperator kwargs (#111938)
kwargs need to be handled carefully in speculate subgraph. We should be clearer about the contract of what the inputs are.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111938
Approved by: https://github.com/zou3519
2023-10-28 18:54:33 +00:00
c14c4efc0e [Inductor] Add triton.autotune support for user defined triton kernels with constant/simple grids (#112228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112228
Approved by: https://github.com/jansel
2023-10-28 17:30:35 +00:00
12c1465d76 [DeviceMesh] Make mesh_resources private (#112294)
This is to prepare moving DeviceMesh as a standalone distributed package.

`_mesh_resources` should only be used in torch.distributed package.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112294
Approved by: https://github.com/fegin
2023-10-28 17:28:46 +00:00
a7a0955790 [pytree][BE] reorganize imports and format code style and update type hints (#112268)
Reland PR:

- #112109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112268
Approved by: https://github.com/Skylion007
2023-10-28 16:30:24 +00:00
0948550c53 [dynamo] Remove mutation in AutogradFunctionContextVariable (#112216)
AutogradFunctionContextVariable was mutating self._saved_tensors, which is generally not allowed since VariableTracker objects should be read-only and are frequently copied via apply/clone.  This was causing some test failures up the PR stack.

This moves the mutation into a separate object that is not copied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112216
Approved by: https://github.com/voznesenskym
ghstack dependencies: #112122
2023-10-28 06:46:48 +00:00
c7b78fb76c [dynamo] Replace recursively_contains with parents_tracker (#112122)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112122
Approved by: https://github.com/voznesenskym
2023-10-28 06:46:48 +00:00
a380bf3297 [dynamo, test] skip flaky dynamo-wrapped tests (#112310)
ghstack-source-id: 7a87e33e7513e7924e4513b6473284562989ed4c
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112309

Skip flaky tests reported by
- https://github.com/pytorch/pytorch/issues/111825
- https://github.com/pytorch/pytorch/issues/111826
- https://github.com/pytorch/pytorch/issues/111909
- https://github.com/pytorch/pytorch/issues/112142
- https://github.com/pytorch/pytorch/issues/112220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112310
Approved by: https://github.com/xmfan
2023-10-28 04:14:57 +00:00
31f605344f [Resubmit][S372460 follow up] Reduce embedding feature validation failure carry-on impact (#111838)
Summary:
## Context
The embedding feature validation for GatherRangeToDense was added in the previous diff: D18031155. The logic will check the mismatchedranges or empty ranges in the whole model lifecycle, once the ratio exceeds some thresehold, it will trigger ENFORCE failure (exception).
In the current implementation, it may have carry-on impact. The mismatch ratio is equal to:
```
ratio = mismatched_ranges_from_t0_to_t1 / total_ranges_from_t0_to_t1
```
if the mismatched_ranges_from_t0 somehow increased a lot (bad value spike) at request N, the ratio will be much larger than the treshold.  Then it may take long util t2 to make the new ratio drops below the threshold. however, the requests between t1 and t2 may be all good requests, then it brings carry-on impact.
Instead, we would like to propose a new strategy, when exception happen at T1, we will clean up all the history counters for this bad feature, and make it a clean run for the next phase util the next exception. it then will get rid of the carry-on impact.
In this logic, we will clean up the counter based on the bad feature J.
more context: https://docs.google.com/document/d/1tYHISyiLf-PVKPVGlZRZ0iq2Hvog3g5BjCKMCDLZHHo/edit

Test Plan:
hardcode a much smaller threshold as 0.0001 to force trigger the exception, deploy on some hosts in prod tiers

```
EPHEMERAL_PACKAGE=d44a3de1305c3b4c30fd62bc354a1285 tw update fbcode/tupperware/config/admarket/sigrid/predictor/prod.tw tsp_cln/admarket/sigrid_predictor_v2_dh_t1_elastic_ha --tasks=100-199 --fast --force
```

## 1) totalRanges can be correct logged
```
I1018 20:37:09.012523  2074 gather_ranges_to_dense_op.h:69 req:00f00000001abbf1] In GatherRangesToDenseOp:
  Lifetime empty ranges for each feature is 12354.
  Lifetime mismatched ranges for each feature is 526.
  With a total of 87503 examples for each feature.
```

## 2) exception can be still triggered
```
E1018 21:08:42.007398   668 LoggingPredictorService.cpp:701 req:001000000013df51] getRequestPrecomputedDataOnePass failure on model 481948521_146: [enforce fail at gather_ranges_to_dense_op.h:215] std::max(totalRangesTemp, minObservation_) * maxMismatchedRatio_ >= mismatchedRangesTemp. 0.1 vs 1. Ratio of range length mismatch for feature at index 0 is 0.00813008 (1/123) which exceeds 1e-05. The incorrect lengths include: 15 (Error from operator:

```

Differential Revision: D50570811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111838
Approved by: https://github.com/malfet
2023-10-28 03:50:33 +00:00
fdcd927d8a [vision hash update] update the pinned vision hash (#112306)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112306
Approved by: https://github.com/pytorchbot
2023-10-28 03:40:44 +00:00
a2dcf26df4 [c10d] Pass avoidRecordStreams into collective() function (#112195)
Even after PR #111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable.

As a fix, we pass `avoidRecordStreams` into the collective() function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112195
Approved by: https://github.com/awgu
2023-10-28 03:28:51 +00:00
25f06ee51b [dynamo] ExecutorchCallDelegateHigherOrderVariable - add sanity check that input and output tensors are disjoint (#111960)
Fixes https://github.com/pytorch/pytorch/issues/111917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111960
Approved by: https://github.com/zou3519
2023-10-28 02:48:43 +00:00
3080fd8383 [profiler] add send/recv src/dst info (#111811)
Summary: There is an ask to add src/dst to nccl trace. This feels like the easiest way to do - adding it to metadata seems to require plumbing a few stacks so will be more work

Test Plan: {F1128545195}

Reviewed By: davidberard98

Differential Revision: D50560692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111811
Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi, https://github.com/fduwjj
2023-10-28 02:48:23 +00:00
2c7c2b7827 [torch op][xs] verbose error message for type mismatch in toList() (#110872)
Summary:
Currently error message doesn't give you details on nature of mismatch:
  Output annotation element type and runtime tensor element type must match for tolist()

After update, error becomes actionable:
  RuntimeError: Output annotation element type and runtime tensor element type must match for tolist(): Long vs Int

Test Plan: existing unit tests

Reviewed By: iseeyuan

Differential Revision: D50082858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110872
Approved by: https://github.com/houseroad
2023-10-28 02:47:47 +00:00
2225e6361d Support for as_nested_tensor() with jagged layout + fixed nested_tensor() semantics (#112304)
This PR:
* Adds support for the `layout` kwarg to `torch.nested.as_nested_tensor()`
* Fixes `torch.nested.nested_tensor()`
    * It should accept a list of lists of scalars
    * It should not preserve autograd history
* Adds extensive testing for these two functions

Semantics for the two functions follow those of the strided layout:
* `torch.nested.nested_tensor(tensor_list, layout=torch.jagged)`: Creates a new jagged layout NT **with no autograd history**
    * `tensor_list` can be a list of Tensors or list of lists of scalars
* `torch.nested.as_nested_tensor(tensor_list, layout=torch.jagged)`: Creates a new jagged layout NT **preserving autograd history of `tensor_list`**
    * `tensor_list` must be a list of Tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112304
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-10-28 02:34:27 +00:00
8d44999183 Revert "[Inductor] Add triton.autotune support for user defined triton kernels with constant/simple grids (#112228)"
This reverts commit dbb31a2984fa616b4bb6fac7abb2a06ec0533eb1.

Reverted https://github.com/pytorch/pytorch/pull/112228 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm test in trunk dbb31a2984 ([comment](https://github.com/pytorch/pytorch/pull/112228#issuecomment-1783660326))
2023-10-28 01:51:32 +00:00
668c3b3f3b Add embedding op to jagged NT (#112288)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112288
Approved by: https://github.com/cpuhrsch
2023-10-28 01:29:17 +00:00
1ff0b82be9 Added patterns for randperm + index_add (#112102)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112102
Approved by: https://github.com/lezcano
ghstack dependencies: #112093, #112101
2023-10-28 01:26:52 +00:00
a1a765c195 Mirror of Xformers Fix (#112267)
# Summary
See https://github.com/fairinternal/xformers/pull/850 for more details
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112267
Approved by: https://github.com/cpuhrsch
2023-10-28 00:06:11 +00:00
46a6435203 Make numpy/lib vendored tests dynamo traceable (#112147)
Follow up https://github.com/pytorch/pytorch/pull/112146 and  #112141 : make numpy/lib vendored tests dynamo traceable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112147
Approved by: https://github.com/lezcano
2023-10-27 23:53:32 +00:00
128f4db77e A small fix in "do_bench_using_profiling" (#112223)
This is a small fix in "do_bench_using_profiling()".
When CUDA kernels are executed in a non-default CUDA stream, if cuda.synchronize() is called, a CUDA kernel named "Context Sync" will be launched to the default stream to wait until all other streams are finished. This CUDA kernel has "CUDA time" but is not a real kernel to profile. This fix excludes "Context Sync" when calculating kernel total time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112223
Approved by: https://github.com/int3, https://github.com/chenyang78
2023-10-27 23:08:38 +00:00
3d2041b342 [inductor] Fix bug handling output_strides in fx graph cache (#112041)
Summary: The current implementation is not properly attaching output strides to the tracing context when an fx graph is loaded from the cache. That bugs leads to assertion failures like `AssertionError: expected size 3==3, stride 1==9 at dim=1`. This change saves the output strides in the serialized object cached on disk and inserts them into the tracing context whether the graph is loaded from cache or compiled.

Test Plan:
* New unit test using resnet18 (which repros the problem)
* Ran the timm benchmark suite with `--training`

Differential Revision: [D50756653](https://our.internmc.facebook.com/intern/diff/D50756653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112041
Approved by: https://github.com/ezyang
2023-10-27 22:30:46 +00:00
dbb31a2984 [Inductor] Add triton.autotune support for user defined triton kernels with constant/simple grids (#112228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112228
Approved by: https://github.com/jansel
2023-10-27 21:40:22 +00:00
c67236a05d Revert "[dynamo] Be stricter about HigherOrderOperator kwargs (#111938)"
This reverts commit edafe2ddb99dd721021262fdfd58c3f796c7da0c.

Reverted https://github.com/pytorch/pytorch/pull/111938 on behalf of https://github.com/izaitsevfb due to Fails meta internal executorch tests with `torch._dynamo.exc.InternalTorchDynamoError: name 'p_kwargs' is not defined` ([comment](https://github.com/pytorch/pytorch/pull/111938#issuecomment-1783538268))
2023-10-27 21:37:48 +00:00
089e7aa4ac Revert "[dynamo] ExecutorchCallDelegateHigherOrderVariable - add sanity check that input and output tensors are disjoint (#111960)"
This reverts commit 27cf49549a35dd78475098b7de02c0a5ab1367ea.

Reverted https://github.com/pytorch/pytorch/pull/111960 on behalf of https://github.com/izaitsevfb due to Fails internal executorch tests with module 'torch.utils._pytree' has no attribute 'tree_flatten_only' ([comment](https://github.com/pytorch/pytorch/pull/111960#issuecomment-1783532843))
2023-10-27 21:32:30 +00:00
061bf1a153 [5/N] Make torch context manager a TorchCtxManagerClassVariable (#111622)
Major change in this PR is to make torch context manager class a separate ```TorchCtxManagerClassVariable```, since we have dynamo implementation for these ctx managers.

I was thinking to wrap them as ```UserDefinedClassVariable``` and do dispatch at ```USCVariable.call_function```, but it seems almost the same amount of work and this way is more clear.

This is on the way of moving ```TorchVariable``` to ```TorchFunctionVariable``` which will only handle the functions who would be allowed in graph (e.g, ```torch.sin```) and constant folded (e.g, ```torch.is_floating_point```). All other torch functions would be go through skip/inline rules, and would be wrapped as ```UserFunctionVariable``` (for inlined) and ```SkipFilesVariable``` (for skipped).
The next steps:
* Wrap torch modules, classes, objects as regular ```PythonModuleVariable```, ```UserDefinedClassVariable``` and ```UserDefinedObjectVariable```.
* Generate the allow in graph torch functions list and wrap them as ```TorchFunctionVariable```.
* Finally merge ```skipfiles.check``` and ```is_allowed``` into one function ```allow_skip.check(fn)``` which would return a Enum of allow, skip and inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111622
Approved by: https://github.com/jansel
2023-10-27 21:26:54 +00:00
1460e5b7f5 updated aarch64 maintainers in docs (#112047)
This PR adds a new section for maintainers of `aarch64`.

Adding @snadampal to the list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112047
Approved by: https://github.com/atalman
2023-10-27 21:09:36 +00:00
f7dc0ae16c Some cleanups in pattern matcher (#112101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112101
Approved by: https://github.com/eellison
ghstack dependencies: #112093
2023-10-27 21:04:39 +00:00
6d685ff54f [BE] Remove float8 from vec is_floating_type definition (#112196)
As it's not supported yet, and it's also not clear, how support should look like

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112196
Approved by: https://github.com/drisspg
2023-10-27 20:48:36 +00:00
ca2106e871 [pytorch-vulkan] floor-divide for tensor, tensor (#112190)
Summary: tsia

Test Plan:
## Compile on Mac and run on Android

```
buck2 build -c ndk.static_linking=true -c pt.enable_qpl=0  --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_api_test_binAndroid  --show-output && adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_api_test_binAndroid__/pt_vulkan_api_test_binAndroid /data/local/tmp
```

Run on android
```
$ adb shell /data/local/tmp/pt_vulkan_api_test_binAndroid
...
[ RUN      ] VulkanAPITest.lstm_prepack_success
[       OK ] VulkanAPITest.lstm_prepack_success (11 ms)
[ RUN      ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7667: Skipped
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 396 tests from VulkanAPITest (29980 ms total)
[----------] Global test environment tear-down
[==========] 396 tests from 1 test suite ran. (29980 ms total)
[  PASSED  ] 395 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
  YOU HAVE 7 DISABLED TESTS

```

All Passed.
Full Output: P865232089

Reviewed By: copyrightly

Differential Revision: D50677361

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112190
Approved by: https://github.com/manuelcandales
2023-10-27 20:20:41 +00:00
1774704fc1 [dynamo] Simplify add_dict in preparation to refactor it with call_set (#110523)
The previous implementation had a fair amount of repeated code, and did
things like calling `add_options` where options was always empty (which
is fine, as the guards are already set within ConstDictVariable).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110523
Approved by: https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #110522
2023-10-27 20:17:10 +00:00
1dcbd1c088 [dynamo] [easy] Move Set to dicts.py (#110522)
A set is more of a dict than a list if you ask me.
This comes before the refactor where we implement sets and dicts via the
same logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110522
Approved by: https://github.com/jansel
2023-10-27 20:17:10 +00:00
b9cb4103d7 Fix iphoneos compilation (#111502)
Summary: As title

Test Plan: buck build @//arvr/mode/iphoneos/mac/opt //xplat/third-party/XNNPACK:ukernels_asm_aarch64

Reviewed By: mcr229

Differential Revision: D50423968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111502
Approved by: https://github.com/mcr229
2023-10-27 20:00:41 +00:00
328a4c5475 [BE] Enhance OpInfo.supported_dtype (#111995)
Current implementation is prone to errors, as it accepts any object, but does not print an error or something if device_type is not recognized.

Remediate it by accepting both device-type and device identifies (either `torch.device` instance or "{device_type}:{ordinal}" string

Fixes https://github.com/pytorch/pytorch/issues/111179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111995
Approved by: https://github.com/albanD
2023-10-27 19:42:01 +00:00
192e795f3f Change save -> load in comment (#112217)
Change save -> load in comment because this is the load_state_dict API

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112217
Approved by: https://github.com/wz337
2023-10-27 19:39:02 +00:00
c120e5606e Use ops_and_refs in test_ops.py instead of _ops_and_refs (#112022)
`ops_and_refs` and `_ops_and_refs` have the same definition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112022
Approved by: https://github.com/lezcano
2023-10-27 18:37:05 +00:00
c7dcba9276 Remove passing disable_fastpath in kwargs (#112250)
Fixes an issue that came up in https://github.com/pytorch/pytorch/pull/112030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112250
Approved by: https://github.com/lezcano
2023-10-27 18:29:20 +00:00
b110d87ac2 Readded device_assert skipping in index and index_put (and also added (#112093)
copy to noop pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093
Approved by: https://github.com/oulgen, https://github.com/lezcano
2023-10-27 18:23:49 +00:00
baf3e054e3 Fixed an error in the comment of file torch.utils.data.dataloader.py#944 . (#112244)
Fixes #ISSUE_NUMBER
@ssnl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112244
Approved by: https://github.com/albanD
2023-10-27 18:16:58 +00:00
33daaeb6b5 Automated submodule update: FBGEMM (#112118)
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 6c2be8831a

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112118
Approved by: https://github.com/malfet
2023-10-27 18:14:54 +00:00
700071869a [no-ci][EZ] Update RELEASE.md (#112253)
Reflect default branch renames from master to main

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112253
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-10-27 18:12:15 +00:00
cb48ef21cc [no-ci] Clarify revert handling in release branches (#112262)
Changes that has been reverted on trunk, must be reverted in release as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112262
Approved by: https://github.com/huydhn
2023-10-27 18:11:29 +00:00
a26cb0a3f2 [dynamo] Enable typechecking for testing.py (#112129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112129
Approved by: https://github.com/Skylion007
ghstack dependencies: #111894, #111992, #112031, #112127, #112128
2023-10-27 18:00:56 +00:00
d3bf6803b6 [dynamo] add sanity check that we do not wrap tracked tensors (#112025)
Identified as a result of https://github.com/pytorch/pytorch/pull/111911

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112025
Approved by: https://github.com/ezyang
2023-10-27 17:15:03 +00:00
d97332f839 Add cuda status checks to FA templates (#112229)
# Summary
cuda status checks were accidentely removed on latest update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112229
Approved by: https://github.com/Skylion007
2023-10-27 16:54:23 +00:00
63c089b09d [c10] Move profiler clock to libc10 for timestamps (#111972)
Summary:
Move the profiler's Approximate Clock from libtorch to libc10. The main reason is to allow c10 features to get time.

The clock is using TSC when available for performance. CUDA Caching Allocator's implementation of memory snapshot will add the timestamps to memory events with this same clock in subsequent diff.

Test Plan: CI

Differential Revision: D50601935

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111972
Approved by: https://github.com/davidberard98
2023-10-27 16:18:40 +00:00
fdbb73fa4e Check both ops and refs in test_strided_layout (#112160)
Trying #112023 again to see if CLA issue is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112160
Approved by: https://github.com/lezcano, https://github.com/Neilblaze
2023-10-27 15:35:34 +00:00
bd0ea72b28 torch.library: Create helper function is_functional_schema (#111660)
I will need this again soon.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111660
Approved by: https://github.com/soulitzer
2023-10-27 15:20:25 +00:00
7df675743c Stop using defaultdict for deferred_runtime_asserts (#112172)
In the ShapeEnv record replay machinery we do equality tests on this dict, but `{i0: []}` is considered not equal to `{}`. But you can unpredictably end up with the first by just doing reads from the dict. Doing a real dict removes this wobbliness.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112172
Approved by: https://github.com/ysiraichi, https://github.com/Skylion007
2023-10-27 15:05:42 +00:00
9f7bff1171 Add timeout for master store if clients do not join (#111805)
Currently, if the master_store does not have all clients join in the `timeout` time, it will just continue silently which could lead to errors down the road. However, if a client does not connect with the master within the specified time then an exception will be raised. This change will have master_store error out if not all clients have joined, making server and client consistent with each other.

Since this is changing the default behavior of master store I am open to suggestions.

Example:

```python
import torch.distributed as dist
import torch.multiprocessing as mp
from datetime import timedelta

def main(rank, world_size):
    if rank == 0:
        print("creating store")
        # world size is 2 so this eventually times out
        store = dist.TCPStore("localhost", 1234, 2, True, timeout=timedelta(seconds=5))
        print("finished creating store")

if __name__ == "__main__":
    world_size = 2
    mp.spawn(main, (world_size,), nprocs=world_size)
```

Previous
```
print("creating store")
print("finished creating store")
```

Now
```
print("creating store")
torch.distributed.DistStoreError: Timed out after 6 seconds waiting for workers. 1/2 workers joined.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111805
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-10-27 14:44:43 +00:00
cf5479b57e [MPS] Make the device in MPSGenerator consistent with MPSAllocator (#112188)
1b702b185e/aten/src/ATen/mps/MPSAllocator.mm (L751-L760)

The device in an MPS tensor is actually allocated with a device index, so this PR makes the device generated by `MPSGenerator` consistent with that.

Fixes https://github.com/pytorch/pytorch/issues/110820#issuecomment-1752088865
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112188
Approved by: https://github.com/malfet, https://github.com/kulinseth
2023-10-27 09:37:41 +00:00
7265c22a5d [AOTInductor] Enforce no_grad for Run entries (#111613)
Summary:
Always enter no_grad mode in AOTInductor run entries.

```
// AOTInductor uses at::addmm_out, which doesn't supports
// arguments that requires gradient. For this reason, we
// enforce no_grad context for run APIs.
```

Test Plan:
buck2 test mode/dev-nosan caffe2/test/inductor:test_aot_inductor

and OSS CI

Differential Revision: D50432042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111613
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2023-10-27 09:14:19 +00:00
2a86bcbac2 [FSDP][state_dict] Cleanup the usage of _get_pg_default_device (#112168)
_get_pg_default_device is not suitable for FSDP use case. We should always use the compute_device when communicating.

Differential Revision: [D50698730](https://our.internmc.facebook.com/intern/diff/D50698730/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112168
Approved by: https://github.com/wz337
2023-10-27 08:09:08 +00:00
46667c97fd [Pytorch][Vulkan] var.dim (#111965)
Summary:
We implement [`torch.var`](https://pytorch.org/docs/stable/generated/torch.var.html) for tensors of 2d to 4d.

By using the `mean`, `sub` and `pow` ops, we can compute the variance as below without adding a new shader.
```
at::Tensor self_mean = self.mean(opt_dim, true);
at::Tensor output = (self.sub(self_mean).pow(2)).mean(opt_dim, keepdim);
```

Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (2da0640c6)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*var*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
  Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *var*
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.var_2d_unbiased
[       OK ] VulkanAPITest.var_2d_unbiased (322 ms)
[ RUN      ] VulkanAPITest.var_2d_biased
[       OK ] VulkanAPITest.var_2d_biased (0 ms)
[ RUN      ] VulkanAPITest.var_3d_unbiased
[       OK ] VulkanAPITest.var_3d_unbiased (2 ms)
[ RUN      ] VulkanAPITest.var_3d_biased
[       OK ] VulkanAPITest.var_3d_biased (2 ms)
[ RUN      ] VulkanAPITest.var_4d_unbiased
[       OK ] VulkanAPITest.var_4d_unbiased (175 ms)
[ RUN      ] VulkanAPITest.var_4d_biased
[       OK ] VulkanAPITest.var_4d_biased (5 ms)
[----------] 6 tests from VulkanAPITest (508 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (508 ms total)
[  PASSED  ] 6 tests.
```

Reviewed By: yipjustin

Differential Revision: D50398925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111965
Approved by: https://github.com/yipjustin
2023-10-27 07:56:01 +00:00
20fc2b4186 [dynamo] Enable typechecking for compiled_autograd.py (#112128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112128
Approved by: https://github.com/Skylion007
ghstack dependencies: #111894, #111992, #112031, #112127
2023-10-27 06:18:58 +00:00
632ac01bef [dynamo] Enable typechecking for exc.py (#112127)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112127
Approved by: https://github.com/Skylion007
ghstack dependencies: #111894, #111992, #112031
2023-10-27 06:18:58 +00:00
6a99291546 Removing sdpa conv layout constraint (#112045)
Previously layout opt with sdpa would cause failures because we would pass a non-dense last dim to sdpa. Those layout constraints have been added in prior prs. Now we can do conv layout opt with sdpa.

Improves twins_pcpvt_base 1.4622 → 1.5351, xcit_large_24_p8_224 3.0681 → 3.1839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112045
Approved by: https://github.com/shunting314
ghstack dependencies: #111976, #111721
2023-10-27 05:40:43 +00:00
572b66331e [PyTorch][ET] collect comms in ET for send/recv (#111985)
Summary: collect send/recv comms op in Execution Trace

Test Plan:
run param comms with arbitrary collective size to collect operator
send
```
{
      "name": "record_param_comms", "id": 153, "rf_id": 141, "parent": 152, "fw_parent": 0, "seq_id": -1, "scope": 0, "tid": 1, "fw_tid": 0, "op_schema": "",
      "inputs": [[[21,22,0,262144,4,"cuda:0"]],215038,139890792374272,1,"send",[],[]], "input_shapes": [[[262144]],[],[],[],[],[],[]], "input_types": ["GenericList[Tensor(float)]","Int","Int","Int","String","GenericList[]","GenericList[]"],
      "outputs": [[[21,22,0,262144,4,"cuda:0"]]], "output_shapes": [[[262144]]], "output_types": ["GenericList[Tensor(float)]"]
   },
```
recv
```
{
      "name": "record_param_comms", "id": 172, "rf_id": 160, "parent": 171, "fw_parent": 0, "seq_id": -1, "scope": 0, "tid": 1, "fw_tid": 0, "op_schema": "",
      "inputs": [[[138,139,0,262144,4,"cuda:0"]],215042,139890792374272,1,"recv",[],[]], "input_shapes": [[[262144]],[],[],[],[],[],[]], "input_types": ["GenericList[Tensor(float)]","Int","Int","Int","String","GenericList[]","GenericList[]"],
      "outputs": [[[138,139,0,262144,4,"cuda:0"]]], "output_shapes": [[[262144]]], "output_types": ["GenericList[Tensor(float)]"]
    },
```

Differential Revision: D50624443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111985
Approved by: https://github.com/fduwjj
2023-10-27 05:24:04 +00:00
7e5e951dfe [tp] update node meta with partitioned val (#112080)
Test Plan:
buck run mode/opt scripts/feikou/di:export_dummy_model -- --world-size=4

buck run mode/opt scripts/feikou/di:run_model -- --num_gpus=4 --num_iters=1

In sigmoid:
Non-DI:
```
V1025 13:57:16.341391 2225036 run_model.cpp:84] Non-ditributed run outputs:[ 0.8350  0.5399  1.0196  0.9286  1.1265  1.0324
V1025 13:57:16.341391 2225036 run_model.cpp:84]  0.8350  0.5399  1.0196  0.9286  1.1265  1.0324
V1025 13:57:16.341391 2225036 run_model.cpp:84]  0.8350  0.5399  1.0196  0.9286  1.1265  1.0324
V1025 13:57:16.341391 2225036 run_model.cpp:84]  0.8350  0.5399  1.0196  0.9286  1.1265  1.0324
V1025 13:57:16.341391 2225036 run_model.cpp:84]  0.8350  0.5399  1.0196  0.9286  1.1265  1.0324
V1025 13:57:16.341391 2225036 run_model.cpp:84] [ CUDAFloatType{5,6} ]]
```
DI:
```
V1025 13:57:26.352564 2226855 run_model.cpp:278] [Rank 3] output wait_tensor_9:  0.8350  0.5399  1.0196  0.9286  1.1265  1.0324
V1025 13:57:26.352564 2226855 run_model.cpp:278]  0.8350  0.5399  1.0196  0.9286  1.1265  1.0324
V1025 13:57:26.352564 2226855 run_model.cpp:278]  0.8350  0.5399  1.0196  0.9286  1.1265  1.0324
V1025 13:57:26.352564 2226855 run_model.cpp:278]  0.8350  0.5399  1.0196  0.9286  1.1265  1.0324
V1025 13:57:26.352564 2226855 run_model.cpp:278]  0.8350  0.5399  1.0196  0.9286  1.1265  1.0324
V1025 13:57:26.352564 2226855 run_model.cpp:278] [ CUDAFloatType{5,6} ]
```

Differential Revision: D50663481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112080
Approved by: https://github.com/wanchaol
2023-10-27 05:08:33 +00:00
033680c9af [tp] fix PrepareModuleInput for multiple inputs (#112204)
Not all inputs needs to annotate shardings and convert to DTensors, if
user annotate only one inputs are mark the rest as Nones, we should skip
creating DTensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112204
Approved by: https://github.com/fduwjj
2023-10-27 05:08:05 +00:00
a6e556f8b0 Support calling __torch_function__ attribute access (#111737)
Triggers `__torch_function__` tracing on attribute/method/property access matching the eager behavior for non-overridden attributes/methods/properties that are present on `torch.Tensor`.

Some caveats:
1. for methods there doesn't seem to be a way to check if the original implementation of a method is overridden via monkey patching or not. For example:
```
class LocalSubclass(torch.Tensor):
    @classmethod
    def __torch_function__(cls, func, types, args=(), kwargs=None):
        if kwargs is None:
            kwargs = {}
        return super().__torch_function__(func, types, args, kwargs)

x = torch.ones(2, 2).as_subclass(LocalSubclass)

> x.sigmoid
<built-in method sigmoid of LocalSubclass object at 0x7f8d305bb5e0>
```
There isn't a way to verify that this built-in method is equivalent to the base `torch.Tensor` implementation as each instance will have a different built-in method object that can't be traced back to the original `torch.Tensor` impl. You can check that the class itself has the original implementation via
```
> inspect.getattr_static(LocalSubclass, "sigmoid")
<method 'sigmoid' of 'torch._C.TensorBase' objects>
```
But we can't detect if the user dynamically patches an object with a built-in method called sigmoid which does something completely different.

2. If a user overrides a method but calls the original implementation we will still graph break. This will require modifying `SuperVariable` (and any other way to get the original impl) to handle tensor subclasses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111737
Approved by: https://github.com/jansel, https://github.com/ezyang
2023-10-27 04:57:19 +00:00
589625cbae Add bandwidth to extern kernel calc (#110539)
Summary: - Modify the result of get_estimated_runtime() for ExternKernelSchedulerNode to count both bytes and FLOPs and return the maximum of the two.

Reviewed By: xmfan

Differential Revision: D48987490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110539
Approved by: https://github.com/xw285cornell
2023-10-27 04:46:24 +00:00
c84dbd2c03 [2D] Enable 2D optimizer set_state_dict() (#111778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111778
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #111774
2023-10-27 04:33:00 +00:00
aa9e65d8f5 [DCP] Add fsspec.transaction context when writing checkpoint to storage (#112191)
Summary: Adding fsspec.transaction to safeguard checkpointing writing. With the context, it should only commit if there was no exception and discard otherwise.

Test Plan:
```
command: buck test @//mode/dev-nosan  //caffe2/test/distributed/checkpoint/fb:test_fsspec_filesystem -- --print-passing-details
```

Reviewed By: rohan-varma

Differential Revision: D50701929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112191
Approved by: https://github.com/rohan-varma
2023-10-27 04:27:29 +00:00
7cb72704cc Constrain sdpa to fx strides (#111721)
Fix for https://github.com/pytorch/pytorch/issues/109607. sdpa requires last dimension strides to be 1. Add constraint so that we run the op with the strides we observed in tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111721
Approved by: https://github.com/drisspg, https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #111976
2023-10-27 03:23:27 +00:00
94e90c199c [dtensor] fix pointwise op linearity with strategy (#112107)
This PR fixes the pointwise op strategy linearity, and switch the
linear pointwise ops to use strategy. Also add tests show that using
the new way we can enable full shard (S(0), S(0)) like operations

Why this is useful? for 2-D Parallel like patterns where the named
parameters are possibly fully sharded on all devices, [S(0), S(0)] or
[S(1), S(0)], etc. need to work, since we don't use the sharding rules
anymore, this is possible at this point.

@awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112107
Approved by: https://github.com/wz337
2023-10-27 02:41:45 +00:00
64fd027f2e Revert "[inductor] benchmark fusion (#108193)"
This reverts commit 73cc5d1cdda118007ccdb0be8d775ba76726596e.

Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1782157638))
2023-10-27 01:40:06 +00:00
0a3199dd7e Revert "Readded device_assert skipping in index and index_put (and also added (#112093)"
This reverts commit e38347f490ae14bf96913a19e7dab9b5e752c276.

Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/izaitsevfb due to Sorry, trying to resolve a conflict with intern, and unblock the revert of #108690 ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1782154814))
2023-10-27 01:37:33 +00:00
797d7100de Revert "[quant][pt2e][be] Cleanup observer insertion logic (#111828)"
This reverts commit bf998a2c5d549cf4856c7becfca4a169bf68b709.

Reverted https://github.com/pytorch/pytorch/pull/111828 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111828#issuecomment-1782154648))
2023-10-27 01:35:27 +00:00
ac4cc5dbea [Dynamo] Do not crash if numpy is not installed (#112175)
`s/isinstance(value, np.generic)/np is not None and isinstance(value, np.generic)/`

Found while looking at https://github.com/pytorch/pytorch/pull/110512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112175
Approved by: https://github.com/ev-br, https://github.com/kit1980
2023-10-27 00:39:28 +00:00
22221c6d60 Revert "Trigger specialization when you call size()/stride() from C++ (#111935)"
This reverts commit 5846705e36795d76941e18073e49c6edba90c994.

Reverted https://github.com/pytorch/pytorch/pull/111935 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111935#issuecomment-1782107024))
2023-10-27 00:23:03 +00:00
1569df7f01 Don't search getitem for batch fusions (#112088)
Batch mm fusion regressed optimizer compile time by about ~1m, excluding getitem solves this problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112088
Approved by: https://github.com/yanboliang
2023-10-27 00:13:55 +00:00
5b71834785 Avoid c++ exception and stack trace (#111438)
Summary:
When raising an exception here this causes pybind11's dispatcher to kick in, which causes aiplatform's logic to kick in (aiplatform::error_reporting::util::printAddressesWithBestEffortLocationInfo), which ultimately uses `folly::symbolizer::Symbolizer::symbolize` for building up the stack trace.  In 3.8 this uses about 3.62% of the CPU time per pyperf (https://fburl.com/scuba/pyperf_experimental/on_demand/oi554uvy).  In Cinder 3.8 for some reason this is worse - using 5.94% of the CPU.

This exception is happening when doing a hasattr() on `prims` for things like `bitwise_left_shift` which don't exist: https://www.internalfb.com/code/fbsource/[2d695f650d00]/fbcode/caffe2/torch/_inductor/lowering.py?lines=590

That exception is ultimately going to be swallowed anyway, and the stack trace has no meaningful value.  Furthermore because this is kind of an expected outcome in the code versus some random C++ exception the stack trace is less valuable as well.

This changes this to return a (None, None) on the failure case instead of returning a valid op/overload list, avoiding the exception, and reclaiming the 3.62%-5.94% of time.

Test Plan: Existing CI and perf run: https://fburl.com/scuba/pyperf_experimental/on_demand/oi554uvy

Differential Revision: D50018789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111438
Approved by: https://github.com/davidberard98
2023-10-26 23:55:34 +00:00
acd02a60d5 Add a test making sure we are not importing SymPy when importing torch (#112038)
As per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112038
Approved by: https://github.com/malfet, https://github.com/peterbell10
ghstack dependencies: #112035, #112036, #112037
2023-10-26 23:32:27 +00:00
47ccf04885 Split SymNode into its own file (#112037)
This PR:

- Moves TrueDiv, LShift, RShift, IsNonOverlappingAndDenseIndicator to `_sympy.functions.py`
- Moves SymNode to `fx.experimental.sym_node`.
  - This file does not have any SymPy dependencies at import time
  - It installs the magic methods in Sym{Bool,Int,Float}.
  - N.b. With this split, we may be able to move Sym{Bool,Int,Float} to this file, and remove quite a few of the hacks around these classes
- Imports `sym_node` in `torch/__init__.py` rather than the whole `symbolic_shapes.py`.
  This breaks the import-time dependency between torch and SymPy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112037
Approved by: https://github.com/peterbell10
ghstack dependencies: #112035, #112036
2023-10-26 23:32:27 +00:00
deac5357db Make proxy_tensor.py not depend on SymPy (#112036)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112036
Approved by: https://github.com/malfet, https://github.com/peterbell10
ghstack dependencies: #112035
2023-10-26 23:32:19 +00:00
4f7f46ee35 Move SymDispatchMode to its own file (#112035)
This is just code movement + a getter and a setter to break the
dependency of SymDispatchMode, and in turn, ProxySymDispatchMode on
sympy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112035
Approved by: https://github.com/peterbell10
2023-10-26 23:32:11 +00:00
55ab9932f5 Revert "Constrain sdpa to fx strides (#111721)"
This reverts commit 8a7c3cec78686e661b3781b916a8aae59083f90a.

Reverted https://github.com/pytorch/pytorch/pull/111721 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking ROCm job in trunk 8a7c3cec78 ([comment](https://github.com/pytorch/pytorch/pull/111721#issuecomment-1782064133))
2023-10-26 23:27:57 +00:00
4a94f77c8e Revert "Make numpy/lib vendored tests dynamo traceable (#112147)"
This reverts commit 190b6e4ba88f6cf00d0bd08d6212a3fe6bb76eaa.

Reverted https://github.com/pytorch/pytorch/pull/112147 on behalf of https://github.com/huydhn due to Sorry for reverting this again, but this is failing in trunk 190b6e4ba8 ([comment](https://github.com/pytorch/pytorch/pull/112147#issuecomment-1782056995))
2023-10-26 23:23:49 +00:00
73cc5d1cdd [inductor] benchmark fusion (#108193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193
Approved by: https://github.com/jansel
2023-10-26 22:18:37 +00:00
e660bd1422 Re-enable some embedded bag tests (#111712)
They were temporary disabled in 2019 by  https://github.com/pytorch/pytorch/pull/26599

As suggested, increased relative tolerance from 0 to 2% when tests are using float16 dtype

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1e49d84</samp>

> _`TestEmbeddingNN`_
> _CUDA tests restored_
> _Bug fixed in autumn breeze_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111712
Approved by: https://github.com/huydhn
2023-10-26 22:16:38 +00:00
190b6e4ba8 Make numpy/lib vendored tests dynamo traceable (#112147)
Follow up https://github.com/pytorch/pytorch/pull/112146 and  #112141 : make numpy/lib vendored tests dynamo traceable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112147
Approved by: https://github.com/lezcano
2023-10-26 21:41:22 +00:00
abe172e268 Revert "Cleanup error reporting for ProcessGroupNCCL (#111979)"
This reverts commit b29c658265d6b95d8ec77f7052eff4f25190fbfc.

Reverted https://github.com/pytorch/pytorch/pull/111979 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing multigpu test in trunk b29c658265 ([comment](https://github.com/pytorch/pytorch/pull/111979#issuecomment-1781919184))
2023-10-26 21:29:40 +00:00
d91a18c433 Grandfather in torchgen'ed aten ops to torch.Tag.pt2_compliant_tag (#112053)
In torchgen, we add the pt2_compliant_tag to all aten ops.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112053
Approved by: https://github.com/soulitzer
2023-10-26 21:21:09 +00:00
27cf49549a [dynamo] ExecutorchCallDelegateHigherOrderVariable - add sanity check that input and output tensors are disjoint (#111960)
Fixes https://github.com/pytorch/pytorch/issues/111917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111960
Approved by: https://github.com/zou3519
2023-10-26 21:13:05 +00:00
73f36e44fb [aotinductor] Add a debug compile flag (#112021)
Summary: When the debug compile flag is specified, model.so is compiled with "-O0 -g".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112021
Approved by: https://github.com/chenyang78
ghstack dependencies: #111823
2023-10-26 21:11:08 +00:00
f66cc67562 [aotinductor] Fix duplicated unbacked symbol declarations (#111823)
Summary: For https://github.com/pytorch/pytorch/issues/111711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111823
Approved by: https://github.com/ezyang, https://github.com/aakhundov
2023-10-26 21:11:08 +00:00
f839a5627b Add bf16 support to replicate padding (#112099)
Fixes #99433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112099
Approved by: https://github.com/mikaylagawarecki
2023-10-26 20:30:49 +00:00
8a7c3cec78 Constrain sdpa to fx strides (#111721)
Fix for https://github.com/pytorch/pytorch/issues/109607. sdpa requires last dimension strides to be 1. Add constraint so that we run the op with the strides we observed in tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111721
Approved by: https://github.com/drisspg, https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #111976
2023-10-26 20:21:55 +00:00
1b702b185e [pytorch-vulkan] disable one zero-dim tensor test to fix test (#112087)
Summary:
D50347338 has bug on android (not Mac, not Devserver).

This diff disable the test for time being while I identify the actual cause.

Test Plan:
##  Compile on devserver

```
[yipjustin@129360.od ~/fbsource (e415d865c)]$ buck2 build -c ndk.static_linking=true -c pt.enable_qpl=0  --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_api_test_binAndroid  --show-output
File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
Buck UI: https://www.internalfb.com/buck2/99d47e63-ed6e-4db9-bee2-24909d647b78
Network: Up: 3.2KiB  Down: 67KiB  (reSessionID-459e359b-773c-48a4-b129-81fde7c5e876)
Jobs completed: 4664. Time elapsed: 7.3s.
Cache hits: 100%. Commands: 38 (cached: 38, remote: 0, local: 0)
BUILD SUCCEEDED
fbsource//xplat/caffe2:pt_vulkan_api_test_binAndroid buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_api_test_binAndroid__/pt_vulkan_api_test_binAndroid
```

## Run test.
adb shell /data/local/tmp/pt_vulkan_api_test_binAndroid | pastry

Result: P864940908
```
...
[       OK ] VulkanAPITest.lstm_success (7 ms)
[ RUN      ] VulkanAPITest.lstm_mclareninputs_success
[       OK ] VulkanAPITest.lstm_mclareninputs_success (56 ms)
[ RUN      ] VulkanAPITest.lstm_prepack_success
[       OK ] VulkanAPITest.lstm_prepack_success (7 ms)
[ RUN      ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7568: Skipped
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 391 tests from VulkanAPITest (30715 ms total)
[----------] Global test environment tear-down
[==========] 391 tests from 1 test suite ran. (30715 ms total)
[  PASSED  ] 390 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
  YOU HAVE 7 DISABLED TESTS

```

Reviewed By: liuk22

Differential Revision: D50668570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112087
Approved by: https://github.com/izaitsevfb, https://github.com/SS-JIA
2023-10-26 19:48:40 +00:00
5e5329155e [aotinductor] only include -lc10 for non-fbcode case (#112125)
Summary: otherwise, we would break internal uses

Differential Revision: D50681467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112125
Approved by: https://github.com/swolchok, https://github.com/desertfire, https://github.com/SherlockNoMad
2023-10-26 19:47:08 +00:00
3a284dae30 Revert "Do not materialize entire randperm in RandomSampler (#103339)"
This reverts commit d80174e2db679365f8b58ff8583bdc4af5a8b74c.

Reverted https://github.com/pytorch/pytorch/pull/103339 on behalf of https://github.com/kit1980 due to Cause issues on MPS, and also fails without numpy ([comment](https://github.com/pytorch/pytorch/pull/103339#issuecomment-1781705172))
2023-10-26 18:53:14 +00:00
b7affa2ac3 Add unit test for ONNX models with torch.distributions.normal.Normal (#111498)
Fixes #111034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111498
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2023-10-26 17:57:34 +00:00
8bc0b382fa [HigherOrderOp] Move map_impl to torch.ops.higher_order (#111404)
The purpose of this pr is as titled. Because of some misusage of ghstack, ghimport, and export to github from internal, the stack of https://github.com/pytorch/pytorch/pull/111092 is a mess. I'll try to land them one by one. This is a replacement for #111092 and #111400.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111404
Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519
2023-10-26 16:59:10 +00:00
f6f81a5969 Update get-workflow-job-id to also return job name (#112103)
Then we can use this job name in `filter-test-configs` if it's available.  This addresses the issue in which `filter-test-configs` on GitHub runners (MacOS x86) couldn't find the runner log to get the job name.  This is expected because GitHub runners are isolated, so a job should not be able to access runner logs, which could contains information from other jobs.

This allows all missing features depending on running `filter-test-configs` on GitHub runners:
* Rerun disabled tests and memory leak check. For example, this would help avoid closing https://github.com/pytorch/pytorch/issues/110980#issuecomment-1779806466 early with the disabled test running properly on MacOS x86
* MacOS x86 jobs can now be disabled or marked as unstable

I keep the current logic to parse the log as a fallback because it's working fine on self-hosted runners.  That also handles the case where `get-workflow-job-id` fails.  Also I move the rest of `get-workflow-job-id` up before the test step like https://github.com/pytorch/pytorch/pull/111483

### Testing

Spot checks some jobs to confirm they have the correct names:

* MacOS M1 test job https://github.com/pytorch/pytorch/actions/runs/6648305319/job/18065275722?pr=112103#step:10:8
* MacOS x86 build job https://github.com/pytorch/pytorch/actions/runs/6648306305/job/18065138137?pr=112103#step:9:14
* Linux test job has https://github.com/pytorch/pytorch/actions/runs/6648300991/job/18065354503?pr=112103#step:13:7
* Windows test job https://github.com/pytorch/pytorch/actions/runs/6648305319/job/18065599500?pr=112103#step:12:7
* MacOS x86 test job https://github.com/pytorch/pytorch/actions/runs/6648306305/job/18066312801#step:10:8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112103
Approved by: https://github.com/clee2000
2023-10-26 16:42:46 +00:00
485cc0faae Revert "[inductor] benchmark fusion (#108193)"
This reverts commit ec0cdcdf6a816eadb4d868284eea86732f50da2e.

Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1781473282))
2023-10-26 16:41:20 +00:00
7da713bbaf Convert evaluate_expr GuardOnDataDependentSymNode into graph break (#111919)
Extracted this failure from
https://github.com/pytorch/pytorch/pull/110155

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111919
Approved by: https://github.com/lezcano
2023-10-26 16:28:00 +00:00
036abd43b3 [dynamo] Preserve node names in export (#111947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111947
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2023-10-26 16:11:35 +00:00
b126adcdee [aotinductor] Pass TorchIR to AOTInductor (#110020)
Updates `_export.aot_compile` to pass a torch IR graph to inductor, allowing inductor to now run the pre_grad_passes, and reuse more of inductor's code.
Also updates the API to only return the `so_path`, and not returning the exported program. The pytree call spec is now serialized and placed inside of the generated model code. When calling the model, because there is no c++ pytree implementation linked yet, we can access the call specs through `get_call_spec()`, and call pytree flatten/unflattenin python.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110020
Approved by: https://github.com/desertfire
2023-10-26 15:54:31 +00:00
ed2cc4dd59 TST: make torch_np added tests dynamo traceable (#112149)
Follow up https://github.com/pytorch/pytorch/pull/112146,  https://github.com/pytorch/pytorch/pull/112141 and https://github.com/pytorch/pytorch/pull/112147: make torch_np added tests dynamo traceable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112149
Approved by: https://github.com/lezcano
2023-10-26 15:36:36 +00:00
42e4c648a2 New @decorateIf decorator for param-specific conditional decoration (#112033)
Adds a new decorator `@decorateIf(decorator, predicate_fn)`. Examples:
```python
from torch.testing._internal.common_utils import decorateIf
...

@decorateIf(unittest.skip, lambda params: params["x"] == 2)
@parametrize("x", range(5))
def test_foo(self, x):
    ...

@parametrize("x,y", [(1, 'foo'), (2, 'bar'), (3, 'baz')])
@decorateIf(
    unittest.expectedFailure,
    lambda params: params["x"] == 3 and params["y"] == "baz"
)
def test_bar(self, x, y):
    ...

@decorateIf(
    unittest.expectedFailure,
    lambda params: params["op"].name == "add" and params["dtype"] == torch.float16
)
@ops(op_db)
def test_op_foo(self, device, dtype, op):
    ...

@decorateIf(
    unittest.skip,
    lambda params: params["module_info"].module_cls is torch.nn.Linear and \
        params["device"] == "cpu"
)
@modules(module_db)
def test_module_foo(self, device, dtype, module_info):
    ...
```

Follow-up for per-param decoration based on https://github.com/pytorch/pytorch/issues/79161#issuecomment-1152487359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112033
Approved by: https://github.com/clee2000, https://github.com/pmeier
2023-10-26 14:39:59 +00:00
7671be8108 [aotinductor] allow generating default args in fbcode (#112085)
Summary:
Previously, we want to maintain forward-compatibility by skipping
default args in the serialized artifacts in fbcode. However, some of our shim
interfaces require default values being set. Discussed with Sherlock offline
and we decided to allow serializing default args into the C++ wrapper code
for now. We will refine this part if we see real FC requirement.

Test Plan: ci

Differential Revision: D50638663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112085
Approved by: https://github.com/SherlockNoMad
2023-10-26 14:17:54 +00:00
c8a5bb451e Do not import sympy within torch._prims_common (#112034)
This is the first of a few PRs that avoid importing SymPy at import time.
The pitch here is that we (almost!) do not have SymPy on our API, so
this should be feasible.

This should speed-up torch imports by a good 15% as per
https://dev-discuss.pytorch.org/t/delving-into-what-happens-when-you-import-torch/1589

In this PR we just move a few global imports into local imports.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112034
Approved by: https://github.com/ezyang
2023-10-26 12:53:25 +00:00
d6724a51f9 [dynamo] md5 hash non compile_ignored configs (#111298)
fixes: https://github.com/pytorch/pytorch/issues/111235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111298
Approved by: https://github.com/ezyang
ghstack dependencies: #111303
2023-10-26 10:59:10 +00:00
1c89ea7f72 Add Half support for softmax and log_softmax on CPU (#103315)
Add Half support for softmax and log_softmax on CPU.
Note: This introduces a correctness issue with MPS https://github.com/pytorch/pytorch/issues/111416 and https://github.com/pytorch/pytorch/issues/111479.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103315
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/malfet
2023-10-26 08:38:54 +00:00
fbff99ffea Add regex matching to Inductor all2all collective unit tests (#112077)
Fixes #111776

Support check_regex in FileCheck() by adding `find_regex` in `struct TORCH_API StringCordView`.
Callsite accepts RE syntax for std::regex.

However, I haven't figured out submatch ID yet.
For example, "buf5[0], buf6_inputs[0]" is still considered a match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112077
Approved by: https://github.com/yf225
2023-10-26 08:29:30 +00:00
395614c1a4 keep sync bn training flag same with converted bn's training flag (#111998)
When converting bn to sync bn, we need to keep sync bn's training flag with the original bn flag, the motivation is there in case the given origin model has set some bn training flag and others are not seated, after we convert sync bn, we hoping not to change this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111998
Approved by: https://github.com/mikaylagawarecki
2023-10-26 08:18:08 +00:00
e38347f490 Readded device_assert skipping in index and index_put (and also added (#112093)
copy to noop pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093
Approved by: https://github.com/oulgen, https://github.com/lezcano
ghstack dependencies: #111990
2023-10-26 07:54:44 +00:00
d090c18fca [dynamo] annotate config with @compile_ignored (#111303)
Fixes: #111221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111303
Approved by: https://github.com/ezyang
2023-10-26 05:41:29 +00:00
89bd17552d [dynamo] Enable typechecking for funcname_cache.py (#112031)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112031
Approved by: https://github.com/Skylion007
ghstack dependencies: #111894, #111992
2023-10-26 04:54:16 +00:00
413baa1b25 [dynamo] Enable typechecking for codegen.py (#111992)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111992
Approved by: https://github.com/Skylion007, https://github.com/eellison
ghstack dependencies: #111894
2023-10-26 04:54:16 +00:00
e67d2c9825 [dynamo] Enable typechecking for allowed_functions.py (#111894)
Motivation: MYPYNOFOLLOW currently typechecks almost all inductor files
and some dynamo files as well. However, it has `follow_imports=skip`
enabled which greatly nerfs its effectiveness. I would like to enable
import following for all the files currently checked by MYPYNOFOLLOW.
But that leads to a lot of new errors in other files.

I can exclude errors from files in other directories, but it is somewhat
difficult to do that for dynamo and inductor files themselves. Thus I am
making sure all the dynamo files typecheck first.

Note on changes: I could not type the return value of
`make_function_id_set` since it was returning a class defined in the
function body. Thus I deleted `make_function_id_set` and replaced it
with a direct construction of the `FunctionIdSet` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111894
Approved by: https://github.com/Skylion007, https://github.com/eellison
2023-10-26 04:54:16 +00:00
b61efe1c2b Fix torch.[size|stride](dim=None)` invocation (#111991)
Per documentation, one should be able to explicitly pass dim argument as None to get tensor size across all dimentions/strides, but before this change it was incorrectly interpreted as named tensor call.

Modify `size` and `stride` signatures generated by `gen_pyi.py` to highlight that overload with `None` will return a Tuple, but one with `dim: _int` returns `int`.

Add regression test to validate the behavior, and remove the check for asserts from two named tensors tests (NamedTensors are dead, aren't they?)

Fixes https://github.com/pytorch/pytorch/issues/111944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111991
Approved by: https://github.com/zou3519
2023-10-26 04:14:35 +00:00
ec0cdcdf6a [inductor] benchmark fusion (#108193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193
Approved by: https://github.com/jansel
2023-10-26 04:14:22 +00:00
edafe2ddb9 [dynamo] Be stricter about HigherOrderOperator kwargs (#111938)
kwargs need to be handled carefully in speculate subgraph. We should be clearer about the contract of what the inputs are.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111938
Approved by: https://github.com/zou3519
2023-10-26 03:51:30 +00:00
2aaa7e542c AOTAutograd: avoid intermediate_base logic when all aliased outputs came from a multi_output_view (#111411)
Partially addresses https://github.com/pytorch/pytorch/issues/111081

This fixes the majority of the slowness from https://fb.workplace.com/groups/1405155842844877/permalink/7491314274228973/. In particular, the type of example that suffers the most perf-wise in AOTAutograd looks like this:
```
@torch.compile
def f(x):
    intermediate = x.mul(2)
    outs = intermediate.unbind(0)
    return *outs

x = torch.randn(50, 50, requires_grad=True)
outs = f(x)
sum(outs).sum().backward()
```

There are 50 output tensors in the above function, that all alias each other. AOTAutograd will dutifully exercise its intermediate base [logic](https://github.com/pytorch/pytorch/blob/main/torch/_functorch/aot_autograd.py#L294), and try to regenerate the aliases outside of the compiled `autograd.Function` at runtime, to ensure that the autograd engine is aware of the aliasing.

In this case, this will result in **50 AsStridedBackward nodes in the backward**, because we will fall back to using as_strided to generate each of those 50 outputs. The current PR as is (somewhat unsafely) ensures that the backward graph consists of a single `UnbindBackward`, or a call to `aten.cat()`.

I left a long comment in the code describing the situation, but the core idea is that **autograd does not let you mutate grad_fn of tensor aliases that come from multi-output views**. So if we have `k` outputs that alias each other, but `k-1` of them are aliases that came from multi-output views, then in eager mode, it would not be possible to mutate one of the aliases in a way that would change the grad_fn of any of the other aliases, without causing an error in the backward. So the claim I'm making is that if we hide this aliasing from the autograd engine, then it is impossible for the user to perform any mutations that would cause autograd metadata to diverge between torch.compile and eager in a way that isn't an error in eager mode.

To be fair, I think that taking the approach outlined in https://docs.google.com/document/d/1DlfFq8TKbuAn2zyJxLfoW-X1qkkm5PLdHFtySo03QAk/edit would also help us avoid the as_strided calls in this particularly egregious case, **and** keep the autograd error messages. This relies on both pre-dispatch functionalization being fully hardened **and** adding some pretty invasive changes to AOTAutograd though, and is probably at least several months out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111411
Approved by: https://github.com/ezyang
2023-10-26 02:54:50 +00:00
28c0b07d19 [ROCm] remove HCC references (#111975)
- rename `__HIP_PLATFORM_HCC__` to `__HIP_PLATFORM_AMD__`
- rename `HIP_HCC_FLAGS` to `HIP_CLANG_FLAGS`
- rename `PYTORCH_HIP_HCC_LIBRARIES` to `PYTORCH_HIP_LIBRARIES`
- workaround in tools/amd_build/build_amd.py until submodules are updated

These symbols have had a long deprecation cycle and will finally be removed in ROCm 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111975
Approved by: https://github.com/ezyang, https://github.com/hongxiayang
2023-10-26 02:39:10 +00:00
f1785373c0 Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377)
Part of #109802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377
Approved by: https://github.com/albanD
2023-10-26 02:39:06 +00:00
7a3a00bb0b [inductor] Remove redundant views (#111773)
As a follow-up to https://github.com/pytorch/pytorch/pull/110740, this patches enables removing redundant complex views to allow more operation fusing.

E.g,  given

```
@torch.compile
def foo(X, Y):
    Z = X + Y
    A = X + Y
    return A + Z
```

the generated code is:

```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 6
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tl.load(in_ptr1 + (x0), xmask)
    tmp2 = tmp0 + tmp1
    tmp3 = tmp2 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)
''')

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    assert_size_stride(arg0_1, (3, ), (1, ))
    assert_size_stride(arg1_1, (3, ), (1, ))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0) # no-op to ensure context
        # Source Nodes: [A], Original ATen: [aten.add]
        buf0 = aten.view.dtype(arg0_1, torch.float32)
        del arg0_1
        buf1 = buf0
        del buf0
        # Source Nodes: [A], Original ATen: [aten.add]
        buf2 = aten.view.dtype(arg1_1, torch.float32)
        del arg1_1
        buf3 = buf2
        del buf2
        buf4 = empty_strided((6, ), (1, ), device='cuda', dtype=torch.float32)
        # Source Nodes: [add_2], Original ATen: [aten.add]
        stream0 = get_cuda_stream(0)
        triton_poi_fused_add_0.run(buf1, buf3, buf4, 6, grid=grid(6), stream=stream0)
        del buf1
        del buf3
        # Source Nodes: [add_2], Original ATen: [aten.add]
        buf5 = aten.view.dtype(buf4, torch.complex64)
        del buf4
        buf6 = buf5
        del buf5
        return (buf6, )
```

whereas previously the generated code was:

```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 6
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tl.load(in_ptr1 + (x0), xmask)
    tmp2 = tmp0 + tmp1
    tl.store(out_ptr0 + (x0), tmp2, xmask)

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    assert_size_stride(arg0_1, (3, ), (1, ))
    assert_size_stride(arg1_1, (3, ), (1, ))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0) # no-op to ensure context
        # Source Nodes: [A], Original ATen: [aten.add]
        buf0 = aten.view.dtype(arg0_1, torch.float32)
        buf1 = buf0
        del buf0
        # Source Nodes: [A], Original ATen: [aten.add]
        buf2 = aten.view.dtype(arg1_1, torch.float32)
        buf3 = buf2
        del buf2
        buf4 = empty_strided((6, ), (1, ), device='cuda', dtype=torch.float32)
        # Source Nodes: [A], Original ATen: [aten.add]
        stream0 = get_cuda_stream(0)
        triton_poi_fused_add_0.run(buf1, buf3, buf4, 6, grid=grid(6), stream=stream0)
        del buf1
        del buf3
        # Source Nodes: [A], Original ATen: [aten.add]
        buf5 = aten.view.dtype(buf4, torch.complex64)
        buf6 = buf5
        del buf5
        # Source Nodes: [add_2], Original ATen: [aten.add]
        buf7 = aten.view.dtype(buf6, torch.float32)
        del buf6
        buf8 = buf7
        del buf7
        # Source Nodes: [Z], Original ATen: [aten.add]
        buf9 = aten.view.dtype(arg0_1, torch.float32)
        del arg0_1
        buf10 = buf9
        del buf9
        # Source Nodes: [Z], Original ATen: [aten.add]
        buf11 = aten.view.dtype(arg1_1, torch.float32)
        del arg1_1
        buf12 = buf11
        del buf11
        buf13 = buf4; del buf4  # reuse
        # Source Nodes: [Z], Original ATen: [aten.add]
        triton_poi_fused_add_0.run(buf10, buf12, buf13, 6, grid=grid(6), stream=stream0)
        del buf10
        del buf12
        # Source Nodes: [Z], Original ATen: [aten.add]
        buf14 = aten.view.dtype(buf13, torch.complex64)
        buf15 = buf14
        del buf14
        # Source Nodes: [add_2], Original ATen: [aten.add]
        buf16 = aten.view.dtype(buf15, torch.float32)
        del buf15
        buf17 = buf16
        del buf16
        buf18 = buf13; del buf13  # reuse
        # Source Nodes: [add_2], Original ATen: [aten.add]
        triton_poi_fused_add_0.run(buf8, buf17, buf18, 6, grid=grid(6), stream=stream0)
        del buf17
        del buf8
        # Source Nodes: [add_2], Original ATen: [aten.add]
        buf19 = aten.view.dtype(buf18, torch.complex64)
        del buf18
        buf20 = buf19
        del buf19
        return (buf20, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111773
Approved by: https://github.com/jansel
2023-10-26 02:37:17 +00:00
64d75f72d4 [fx] Add a faster method for inserting positional argument. (#111974)
Summary:
Traditionally when user want to update the arguments for an FX node, the only way is to call the setter of .args property on nodes. This may be problematic when we insert a lot of arguments. Because of the semantics of the setter method, it has a worst case O(n) complexity.

Adding a new insert_arg provides us two benefits:
1. The operation is guaranteed to be O(1) cost.
2. User can express the intentation more directly, instead of writing code like `node.args = (arg,) + node.args`

Test Plan: caffe2/test:fx -- -r test_insert_arg

Reviewed By: suo

Differential Revision: D50574435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111974
Approved by: https://github.com/angelayi
2023-10-26 02:30:42 +00:00
b29c658265 Cleanup error reporting for ProcessGroupNCCL (#111979)
Continuing some of the work from https://github.com/pytorch/pytorch/pull/108191, I realized majority of errors raised from ProcessGroupNCCL were just generic RuntimeError.

In this PR, I've added appropriate error types to all the exceptions raised from ProcessGroupNCCL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111979
Approved by: https://github.com/fduwjj
2023-10-26 01:39:54 +00:00
74adb4cccc Updated flop counter to accept pytree inputs/outputs (#111990)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111990
Approved by: https://github.com/ezyang
2023-10-26 01:25:27 +00:00
d641450180 Revert "[cpu][inductor] improve cpu vec implementations of log (#111898)"
This reverts commit b5703203647644176220676af0e8e5f23de8d45a.

Reverted https://github.com/pytorch/pytorch/pull/111898 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111898#issuecomment-1780263780))
2023-10-26 01:12:19 +00:00
3831cf4891 TST: make test_multiarray traceable by Dynamo (#112084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112084
Approved by: https://github.com/lezcano
ghstack dependencies: #112081, #112082, #112083
2023-10-26 01:03:45 +00:00
a4e4f41cce MAINT: graph break on numpy.__version__ (#112083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112083
Approved by: https://github.com/lezcano
ghstack dependencies: #112081, #112082
2023-10-26 01:03:45 +00:00
7352c88f58 TST: add x{pass,fail}IfTorchDynamo (#112082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112082
Approved by: https://github.com/lezcano
ghstack dependencies: #112081
2023-10-26 01:03:45 +00:00
5b7caf31c1 CI: remove numpy_torch_interop from CI (#112081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112081
Approved by: https://github.com/lezcano
2023-10-26 01:03:45 +00:00
d8e19bb03a Revert "[2D] Enable 2D optimizer set_state_dict() (#111778)"
This reverts commit 52eec50d31976519a5b1b75993d4945927bcc92f.

Reverted https://github.com/pytorch/pytorch/pull/111778 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing multigpu test in trunk 52eec50d31 ([comment](https://github.com/pytorch/pytorch/pull/111778#issuecomment-1780227820))
2023-10-26 00:18:30 +00:00
0ed461ae4c [dynamo] Ensure Dynamo uses this graph's fakes for Tensor example_values (#111954)
Fixes https://github.com/pytorch/pytorch/issues/111869, Fixes (detailed list of cases handled): https://github.com/pytorch/pytorch/pull/111913#discussion_r1370267313, fully fixes: https://github.com/pytorch/pytorch/issues/111873

Adds sanity checks ensuring that Dynamo uses this graph's fakes for Tensor `example_values`

Handles the main (and only?) entrypoints for new `FakeTensor`s in a Dynamo graph:
- `wrap_fx_proxy_cls`
- `VariableTracker.wrap_tensor`

Ensures that `get_fake_value` returns a fake except when we know we are going to properly wrap non-fakes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111954
Approved by: https://github.com/ezyang
2023-10-25 23:54:18 +00:00
17b732eb04 increase CPU memory requirement for test_nll_loss_large (#110963)
Running `python test_nn.py -v -k test_nll_loss_large_tensor` on a machine with a small host RAM availability (e.g. ~50GB) fails with a `SIGKILL` even though the currently specified memory requirements for CPU (and GPU) are set to 48GB and are thus met.

Profiling the peak memory usage via:
```
\time -v python test_nn.py -v -k test_nll_loss_large_tensor
```
and adding `print(torch.cuda.memory_summaryu())` at the end of the test shows a higher host RAM usage of >100GB and a device memory usage of ~32GB.
```
	Command being timed: "python test_nn.py -v -k test_nll_loss_large_tensor"
	User time (seconds): 81.66
	System time (seconds): 229.02
	Percent of CPU this job got: 671%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.30
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 118150096
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 90280839
	Voluntary context switches: 1669
	Involuntary context switches: 1214548
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
```
```
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  32769 MiB |  32769 MiB |  81923 MiB |  49154 MiB |
|       from large pool |  32768 MiB |  32768 MiB |  81921 MiB |  49152 MiB |
|       from small pool |      0 MiB |      0 MiB |      1 MiB |      1 MiB |
|---------------------------------------------------------------------------|
| Active memory         |  32769 MiB |  32769 MiB |  81923 MiB |  49154 MiB |
|       from large pool |  32768 MiB |  32768 MiB |  81921 MiB |  49152 MiB |
|       from small pool |      0 MiB |      0 MiB |      1 MiB |      1 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |  32769 MiB |  32769 MiB |  81923 MiB |  49154 MiB |
|       from large pool |  32768 MiB |  32768 MiB |  81921 MiB |  49152 MiB |
|       from small pool |      0 MiB |      0 MiB |      1 MiB |      1 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  32774 MiB |  32774 MiB |  81938 MiB |  49164 MiB |
|       from large pool |  32772 MiB |  32772 MiB |  81930 MiB |  49158 MiB |
|       from small pool |      2 MiB |      2 MiB |      8 MiB |      6 MiB |
|---------------------------------------------------------------------------|
...
```

We haven't seen this issue before as the majority of our runners have sufficient host RAM and I just ran into it by chance.

CC @atalman @malfet @crcrpar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110963
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy, https://github.com/malfet
2023-10-25 23:45:47 +00:00
8516b4d7da Automated submodule update: FBGEMM (#106168)
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 3579b4d627

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106168
Approved by: https://github.com/huydhn
2023-10-25 23:32:30 +00:00
2971bdd6fc Ignore Dims of value 1 in Require_Stride_order (#111976)
Ignore dims of value 1 in require_stride_order since they don't affect layout. Previously, unsqueezed dims would always cause a copy because the stride of 0 would throw off the sorted stride order. This was causing perf problems with require_stride_order in next commit in stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111976
Approved by: https://github.com/Chillee
2023-10-25 23:14:25 +00:00
4851c973ae Update FlashAttentionV2 kernel to 02ac572 (#111886)
# Summary
We were restricted from updating to the newest version of FlashAttention based off of the changes to is_casual described here: https://github.com/pytorch/pytorch/issues/108108

Prior to this PR we landed: https://github.com/pytorch/pytorch/pull/111007 which enabled us to updated beyond:
9e5e8bc91e on FlashAttentionV2.

With this PR we have updated to this commit:
02ac572f3f.
Or Tag 2.3.2

## Plans
Following this PR I plan to work more on https://github.com/pytorch/pytorch/issues/110681 in order to expose a CausalVariant attn_mask, w/ the potential for also exposing a kvcache attn_mask.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111886
Approved by: https://github.com/cpuhrsch
2023-10-25 23:07:56 +00:00
ec18ef62f4 Native c10d_functional ops (#110570)
This PR introduces a native version of c10d_functional ops. The main goal is to add collective support in AOTInductor and allow collective ops to work in multi-threaded native runtimes.

The native version also incorporated API improvements we wished to implement in Python c10d_functional:

- Removed `ranks` and `group_size` from collective op signatures which were proven to be redundant.
- Use tensor storage as opposed to `void*` to resolve in-flight work.

The native process group registration/resolution mechansim is only used for native c10d_functional in the PR. It will become the single source of truth in upcoming PRs.

The upcoming PRs will implement Inductor/AOTInductor support for c10d_functional, after which native c10d_functional will replace Python c10d_functional.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110570
Approved by: https://github.com/wanchaol
2023-10-25 22:56:06 +00:00
7fe51e3e9b Add cudagraph_mark_step_begin in torch.compiler, reference in error message (#111722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111722
Approved by: https://github.com/ezyang, https://github.com/msaroufim
2023-10-25 21:53:21 +00:00
f2a0bef35a [export] Upstream support of (tensor, tensor list) in op returns. (#111857)
Summary:
Upstreaming from internal to oss.
Diff: D49710320

Test Plan: buck2 build mode/opt sigmoid/inference/test_gpu:package_gen

Differential Revision: D50577490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111857
Approved by: https://github.com/SherlockNoMad
2023-10-25 21:38:12 +00:00
e5049648be Add a "pt2 compliant" tag; add config to graph break on non-pt2_compliant ops (#111933)
This PR:
- adds the pt2 compliant tag. This tag specifies that the operator works
  with the PT2 compilation APIs. A custom op author should test their
  ops with opcheck if they choose to add this tag.
- adds a config for Dynamo to allow only pt2 compliant ops into the
  graph and graph break on all other OpOverload/OpOverloadPacket.

Bikeshedding help wanted on the name of the tag. It should be easily
grep-able so we can set up rules for it.

Test Plan:
- new tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111933
Approved by: https://github.com/ezyang
ghstack dependencies: #111912, #111915, #111948
2023-10-25 21:20:59 +00:00
6365992f92 [opcheck] Add way to initialize blank failures dict (#111948)
Summary:

Fixes #111926. The workflow is:
- create a blank file with the correct name
- run a test with PYTORCH_OPCHECK_ACCEPT=1

Test Plan:
- tested locally

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111948
Approved by: https://github.com/ezyang
ghstack dependencies: #111912, #111915
2023-10-25 21:20:59 +00:00
3219b728b6 [torch.library] Clarify torch.library.define's schema (#111915)
Unlike the previous torch.library.define, this schema doesn't take a
name (the name is a part of the qualname). We separated out the qualname
from the schema in the new APIs so that they're all consistent with each
other (they all accept the qualname separately).

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111915
Approved by: https://github.com/suo, https://github.com/ezyang
ghstack dependencies: #111912
2023-10-25 21:20:54 +00:00
2d04be9a00 [torch.library] Add mechanism to add tags during define (#111912)
We extend torch.library.Library.define and torch.library.define
with a tags argument.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111912
Approved by: https://github.com/ezyang
2023-10-25 21:20:48 +00:00
ed15fa7cc2 [Kineto][NCCL][3/n] Get the NCCL communication info from PARAM_COMMS_INFO (#111846)
This diff enables the functionality to get the NCCL communication metadata from `c10::DebugInfoKind::PARAM_COMMS_INFO` available in `ThreadLocalDebugInfo`.

To make the overhead lighweight and avoid comparing the function name on each op, we add the method `bool isNcclMeta()`, which decided during initialization.

Differential Revision: [D50439211](https://our.internmc.facebook.com/intern/diff/D50439211/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111846
Approved by: https://github.com/aaronenyeshi
ghstack dependencies: #111842, #111843
2023-10-25 20:35:06 +00:00
1623cc5815 [easy] Make test_mandelbrot_numpy deterministic (#112042)
It fails for me locally, and I'm not the only one:
https://dev-discuss.pytorch.org/t/main-failing-unit-test-dynamicshapesmisctests/1607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112042
Approved by: https://github.com/peterbell10
2023-10-25 20:29:50 +00:00
b33220063d [TD] Historical edited files and profiling heuristics (#111510)
Adds files for the heuristics and run them in trial mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111510
Approved by: https://github.com/ZainRizvi
2023-10-25 19:54:17 +00:00
36b3e1789a Docker release build don't include build suffix in the release (#112046)
This build is used in release as far as I know. For release we don't need suffix.

Test in Release:
```
python3 .github/scripts/generate_pytorch_version.py
2.1.1+cpu
python3 .github/scripts/generate_pytorch_version.py --no-build-suffix
2.1.1
```

Test with nightly:
```
python3 .github/scripts/generate_pytorch_version.py --no-build-suffix
2.2.0.dev20231025
```

With suffix:
```
python3 .github/scripts/generate_pytorch_version.py
2.2.0.dev20231025+cpu
````
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112046
Approved by: https://github.com/huydhn
2023-10-25 19:40:01 +00:00
b54ab57522 Document torch.from_file and fix UntypedStorage.from_file docs (#111688)
Fixes https://github.com/pytorch/pytorch/issues/37439

Also threads through filename so it is accessible via `t.storage().filename`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111688
Approved by: https://github.com/albanD
2023-10-25 19:28:11 +00:00
f3b42ab5b9 feat(dynamo): remove inconsistent tracing histories by acknowledging possibility of inconsistent side-effects (#110804)
Fixes https://github.com/pytorch/pytorch/issues/110765

CC @voznesenskym  @yanboliang @Fidget-Spinner @anijain2305 @soulitzer @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110804
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
2023-10-25 19:27:11 +00:00
cb4e62a498 Fix broken lint on trunk (#112051)
Forward fix lint error introduced by https://github.com/pytorch/pytorch/pull/111146/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112051
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/malfet
2023-10-25 19:18:54 +00:00
b365acba28 [ONNX] A better way to safe guard 2GB model serialization (#111984)
Summary
- faster than previous try-catch.
- more stable than previous try-catch. In some circumstances serializing models > 2GB into a single protobuf file ends up with a corrupted file without raising an exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111984
Approved by: https://github.com/justinchuby
2023-10-25 19:18:37 +00:00
6b7b90462f [aotinductor] Turn clang warning ignored-optimization-argument into error (#112008)
Now we compile the generated wrapper C++ code with clang in fbcode.
When the Model's run_impl function is too large, clang will issue
a warning like:

  Function foo is too big to optimize [-Wignored-optimization-argument]

and compile the code without any optimization.

I think we may want to be more proactive in such cases. If the
generated C++ code is too complex or too large to be optimized,
we would like to be notified loudly with errors, so that we
would figure out ways to address the issue.

Later if we feel that turning this warning into an error is too
aggressive, we would add a config to disable it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112008
Approved by: https://github.com/desertfire, https://github.com/htyu
2023-10-25 19:14:27 +00:00
7e654c8f88 Revert "WIP / TST: allow testing torch._numpy under Dynamo (#110401)"
This reverts commit 5ed4a423ded14138f1a724eff15ccd14648f6c49.

Reverted https://github.com/pytorch/pytorch/pull/110401 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing dynamo job in trunk 5ed4a423de ([comment](https://github.com/pytorch/pytorch/pull/110401#issuecomment-1779811943))
2023-10-25 18:21:16 +00:00
e9804aaacc Fix unit tests and add logging for Inductor intra-graph reordering (#111981)
1. Fix code to make unit tests pass (incl. collect_env issue called out by @int3  in https://github.com/pytorch/pytorch/pull/108091#discussion_r1362901686).
2. Add logging for Inductor intra-graph reordering passes (`TORCH_LOGS="overlap"`), for easier debugging. Example log:
```
[rank0]:[2023-10-24 16:28:26,446] [0/0] torch._inductor.comms.__overlap: [DEBUG] ==== Visualize overlap before reordering pass <function reorder_compute_for_overlap at 0x7fa68c5568e0> ====
[rank0]:[2023-10-24 16:28:26,446] [0/0] torch._inductor.comms.__overlap: [DEBUG] ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf0)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf1)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] InPlaceHint (size=[4, 4], stride=[4, 1]) (buf2)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] AllReduce (size=[4, 4], stride=[4, 1]) (buf3)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] Wait (size=[4, 4], stride=[4, 1]) (buf4)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf5)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] InPlaceHint (size=[4, 4], stride=[4, 1]) (buf6)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] AllReduce (size=[4, 4], stride=[4, 1]) (buf7)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] Wait (size=[4, 4], stride=[4, 1]) (buf8)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf9)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf10)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf11)
[rank0]:[2023-10-24 16:28:26,447] [0/0] torch._inductor.comms.__overlap: [DEBUG] Est. runtime (ms): 0.000228

[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] ==== Visualize overlap after reordering pass <function reorder_compute_for_overlap at 0x7fa68c5568e0> ====
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] InPlaceHint (size=[4, 4], stride=[4, 1]) (buf2)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] AllReduce (size=[4, 4], stride=[4, 1]) (buf3)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] | ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf0)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] | ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf1)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] | ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf9)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] Wait (size=[4, 4], stride=[4, 1]) (buf4)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf5)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] InPlaceHint (size=[4, 4], stride=[4, 1]) (buf6)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] AllReduce (size=[4, 4], stride=[4, 1]) (buf7)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] Wait (size=[4, 4], stride=[4, 1]) (buf8)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] ComputedBuffer (size=[4, 4], stride=[4, 1]) (buf10)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] ExternKernelOut (extern_kernels.mm) (size=[4, 4], stride=[4, 1]) (buf11)
[rank0]:[2023-10-24 16:28:26,448] [0/0] torch._inductor.comms.__overlap: [DEBUG] Est. runtime (ms): 0.000217
```
The `| SomeComputeOp` means the compute op is overlapped with the comm op above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111981
Approved by: https://github.com/wanchaol
2023-10-25 18:19:43 +00:00
9d4dbebc34 Add support to ExportedProgram as input to torch.onnx.dynamo_export (#111497)
Fixes #109889

This PR adds `torch.export.export` as another `FXGraphExtractor` implementation. `torch.onnx.dynamo_export` automatically uses this new FX tracer when a `torch.export.ExportedProgram` is specified as `model`

Implementation is back compatible, thus non `ExportedProgram` models are handled the exact same way as before
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111497
Approved by: https://github.com/BowenBao
2023-10-25 18:11:19 +00:00
07ccaabee7 Make profiler function will be ignored warn only once (#111921)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111921
Approved by: https://github.com/mlazos, https://github.com/oulgen
2023-10-25 17:45:31 +00:00
2b952834c7 [pytorch][PR] [Inductor][FX passes] Pre grad batch relu fusion (#111146)
Summary: We detect independent relu operators and do the fusion in the pre grad.

Test Plan:
### unit test
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498608558485

### Inlinve cvr
f479655232
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch_group
```
before vs after transformation
https://www.internalfb.com/intern/diffing/?paste_number=851907099

```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch_group -c
```

P852036786

Differential Revision: D50207610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111146
Approved by: https://github.com/yanboliang
2023-10-25 17:37:39 +00:00
721b1a6683 s390x vectorization: implement atanh for complex vectorized data (#111653)
s390x vectorization: implement atanh for complex vectorized data

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111653
Approved by: https://github.com/ezyang
2023-10-25 17:36:34 +00:00
49489d478b Update onnx 1.15.0rc2 submodule (#111964)
Update ONNX submodule to the latest RC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111964
Approved by: https://github.com/thiagocrepaldi
2023-10-25 16:41:45 +00:00
5ce8002d24 Revert "Remove deprecated fbgemm operators (#104535)"
This reverts commit 57c7aa12dbf71617bd21fe7e076df8e823b5b7bb.

Reverted https://github.com/pytorch/pytorch/pull/104535 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104535#issuecomment-1779650412))
2023-10-25 16:34:16 +00:00
5846705e36 Trigger specialization when you call size()/stride() from C++ (#111935)
This should be the last of the "it used to work with static shapes but
it doesn't work with dynamic shapes" hard errors.  Now we will just
specialize if you hit it from C++.

The strategy here is a bit clever.  We shunt the size() call to Python
binding if an error would have occurred.  Importantly, we already have
logic to make sure the newly allocated ints stay live for the duration
of the ArrayRef access.

storage_offset is intentionally omitted because there are some problems
with it.  I will fix them next.

This should let us get rid of the aotautograd_static test configuration.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111935
Approved by: https://github.com/zou3519
2023-10-25 16:17:55 +00:00
5ed4a423de WIP / TST: allow testing torch._numpy under Dynamo (#110401)
Use conditional imports: when running under dynamo, import the original NumPy not torch._numpy. This is what we want to trace, not our implementation.

With this, the test suite passes with and without `PYTORCH_TEST_WITH_DYNAMO=1` (modulo a couple of test modules which are not meant to be compiled, e.g. `test_nep50_examples`). There are two new decorators, `x{fail,pass}ifTorchDynamo`, the `xpass` in most cases indicates a graph break and a fallback to eager for things we do not implement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110401
Approved by: https://github.com/lezcano
2023-10-25 16:02:16 +00:00
6fd3659391 Make require_stride_order peek into AliasedLayout (#111681)
Summary:

`require_stride_order` doesn't know how to handle storage with `AliasedLayout`. It always resorts to a copy even when the view refers to a storage with `FixedLayout`. This causes an unneccessary allocation + copy for collective outputs. Peeking into `AliasedLayout` in `require_stride_order` seems to be the proper way to address the issue.

Original program:
```python
import tempfile

import torch
import torch.distributed as dist
from torch.distributed._functional_collectives import *  # noqa
from torch._inductor.utils import run_and_get_triton_code

def func(arg: torch.Tensor) -> torch.Tensor:
    buf0 = arg + 42
    out0 = torch.ops.c10d_functional.all_reduce(buf0, "avg", "default", [0], 1)
    out0 = torch.ops.c10d_functional.wait_tensor(out0)
    return out0

if __name__ == "__main__":
    with tempfile.NamedTemporaryFile(delete=False) as tmpf:
        dist.init_process_group(
            backend="nccl", init_method=f"file://{tmpf.name}", rank=0, world_size=1
        )
        device = torch.device("cuda:0")

        compiled = torch.compile(func)
        print(run_and_get_triton_code(compiled, torch.rand(4, 4, device=device)))

        torch.cuda.synchronize()
        dist.destroy_process_group()
```

Before:
```python
def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (4, 4), (4, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0) # no-op to ensure context
        buf0 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
        # Source Nodes: [buf0], Original ATen: [aten.add]
        stream0 = get_cuda_stream(0)
        triton_poi_fused_add_0.run(arg0_1, buf0, 16, grid=grid(16), stream=stream0)
        del arg0_1
        buf1 = buf0; del buf0  # reuse
        buf2_pg = c10d._find_or_create_pg_by_ranks_and_tag('default', [0], 1)
        buf2 = buf1
        buf2_work = dist.all_reduce(buf2, async_op=True, group=buf2_pg, op=fun_col_impl._str_to_reduce_op('avg'))
        fun_col_impl._register_tensor_work(buf2, buf2_work)
        buf1 = _wait_tensor(buf1)
        buf3 = buf1
        buf4 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
        # Source Nodes: [out0_1], Original ATen: [c10d_functional.wait_tensor]
        triton_poi_fused_wait_tensor_1.run(buf3, buf4, 16, grid=grid(16), stream=stream0)
        del buf1
        del buf3
        return (buf4, )
```

After:
```python
def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (4, 4), (4, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0) # no-op to ensure context
        buf0 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
        # Source Nodes: [buf0], Original ATen: [aten.add]
        stream0 = get_cuda_stream(0)
        triton_poi_fused_add_0.run(arg0_1, buf0, 16, grid=grid(16), stream=stream0)
        del arg0_1
        buf1 = buf0; del buf0  # reuse
        buf2_pg = c10d._find_or_create_pg_by_ranks_and_tag('default', [0], 1)
        buf2 = buf1
        buf2_work = dist.all_reduce(buf2, async_op=True, group=buf2_pg, op=fun_col_impl._str_to_reduce_op('avg'))
        fun_col_impl._register_tensor_work(buf2, buf2_work)
        buf1 = _wait_tensor(buf1)
        buf3 = buf1
        del buf3
        return (buf1, )
```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111681
Approved by: https://github.com/jansel
2023-10-25 15:44:09 +00:00
ac08b10d60 [pytorch] bfloat16 support in erfinv (#111257)
Differential Revision: D50280766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111257
Approved by: https://github.com/jianyuh
2023-10-25 15:43:48 +00:00
247f39f603 Revert "Fix inconsistency of max_split_size between DeviceStats and CUDAAllocatorConfig (#111555)"
This reverts commit 0b424ee0b7bfe09e0a438a63e8336e95eea85901.

Reverted https://github.com/pytorch/pytorch/pull/111555 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111555#issuecomment-1779438172))
2023-10-25 14:44:18 +00:00
8253e0524c Add "device not supported" assert to inductor (#112001)
Fixes #111999

Adds an assert that provides a more informative error message

For example, when running a compiled function with mps (currently unsupported):
```
...
  File "/Users/andrew.hu/Desktop/pytorch/torch/_inductor/graph.py", line 927, in init_wrapper_code
    assert wrapper_code_gen_cls is not None, f"Device {device_type} not supported"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: Device mps not supported
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112001
Approved by: https://github.com/peterbell10
2023-10-25 14:19:37 +00:00
88244cd7a9 [torchx] Do not terminate parent process if exit code from child isn't valid (#111961)
Summary:
There's no reason to terminate the parent process trying to find the name of the signal received by the child process.
Let's make sure this is handled properly, which then will ensure that parent process can process child failures.

Test Plan: Unit tests.

Reviewed By: aaronenyeshi

Differential Revision: D50615419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111961
Approved by: https://github.com/aaronenyeshi
2023-10-25 07:13:28 +00:00
28ebe5df7a yolov3: reduce batch size due to OOM (#111959)
yolov3 w/ cudagraphs (known to use more memory) is failing perf test due to OOM (https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?startTime=Mon,%2016%20Oct%202023%2020:19:47%20GMT&stopTime=Mon,%2023%20Oct%202023%2020:19:47%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=main&lCommit=0b424ee0b7bfe09e0a438a63e8336e95eea85901&rBranch=main&rCommit=29048be41ca3aa8974795d93b9ea9fd6dee415fc)

I'm reducing the batch size from 16 to 8 to keep the same batch size for all yolov3 HUD benchmarks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111959
Approved by: https://github.com/xuzhao9
2023-10-25 06:18:53 +00:00
5120c97f32 Revert "Add support to ExportedProgram as input to torch.onnx.dynamo_export (#111497)"
This reverts commit 4f42edfb6e5b703eec2a14b8933090646702c5a2.

Reverted https://github.com/pytorch/pytorch/pull/111497 on behalf of https://github.com/huydhn due to Sorry for reverting your change, it is failing ONNX test in trunk 4f42edfb6e, possibly a landrace ([comment](https://github.com/pytorch/pytorch/pull/111497#issuecomment-1778519212))
2023-10-25 05:07:00 +00:00
52eec50d31 [2D] Enable 2D optimizer set_state_dict() (#111778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111778
Approved by: https://github.com/fegin
ghstack dependencies: #111774
2023-10-25 04:27:13 +00:00
d8a9b6640e [Kineto][NCCL][2/n] Add records NCCL meta to more collective functions (#111843)
This diff records NCCL metadata for more commonly used collective functions.

NOTE: the coalesced NCCL are not covered: https://fburl.com/code/ihgqqvg8 and how to support them needs further discussion.

Differential Revision: [D50439232](https://our.internmc.facebook.com/intern/diff/D50439232/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111843
Approved by: https://github.com/aaronenyeshi, https://github.com/fduwjj
ghstack dependencies: #111842
2023-10-25 03:49:09 +00:00
43d0ae4822 [Kineto][NCCL][1/n] Add the world size info in NCCL metadata (#111842)
This diff adds the world size info in NCCL metadata, as we need the information to calculate the algorithmic bandwidth and bus Bandwidth.

Differential Revision: [D50439185](https://our.internmc.facebook.com/intern/diff/D50439185/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111842
Approved by: https://github.com/aaronenyeshi, https://github.com/fduwjj
2023-10-25 03:48:55 +00:00
bf998a2c5d [quant][pt2e][be] Cleanup observer insertion logic (#111828)
Summary:
att, after SharedQuantizationSpec bug fix we are doing some checks before hand, this can simplify the logic when we insert observers

Test Plan:
python test/test_quantization.py TestQuantizePT2E

CIs

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111828
Approved by: https://github.com/kimishpatel
ghstack dependencies: #111827
2023-10-25 03:48:36 +00:00
8dc4887e84 [2D] Enable 2D optimizer get_state_dict() (#111774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111774
Approved by: https://github.com/fegin
2023-10-25 03:44:14 +00:00
6625269e14 [vision hash update] update the pinned vision hash (#111982)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111982
Approved by: https://github.com/pytorchbot
2023-10-25 03:39:09 +00:00
cyy
f9cc7f6a1c Enable Wno-unused-private-field,Wunused-lambda-capture and fix CUDA warnings (#110856)
This PR enables Wno-unused-private-field,Wunused-lambda-capture  and some CUDA warnings were fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110856
Approved by: https://github.com/albanD, https://github.com/malfet
2023-10-25 03:39:05 +00:00
9e6c97890b Dynamo runner: add FSDP handcrafted module wrapping policy (#111505)
The default size based auto wrap policy may not be representative of actual usage of the models. We add support for a few handpicked models, and fallback to the size based policy.

sample command:
`PYTHONPATH=~/benchmark/ python benchmarks/dynamo/torchbench.py -dcuda --training --backend=inductor --multiprocess --performance --only nanogpt --fsdp`

1.257x
1.256x
1.257x
1.252x
1.257x
1.262x
1.258x
1.272x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111505
Approved by: https://github.com/H-Huang, https://github.com/xuzhao9
2023-10-25 03:05:31 +00:00
a29a844938 [Inductor] Support top level constants in user defined triton kernels (#111970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111970
Approved by: https://github.com/jansel
ghstack dependencies: #111956
2023-10-25 02:43:51 +00:00
bb550b25c9 [Inductor] Support user defined triton kernels calling other triton kernels and activation functions (#111956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111956
Approved by: https://github.com/jansel
2023-10-25 02:39:43 +00:00
b570320364 [cpu][inductor] improve cpu vec implementations of log (#111898)
Fixes #110611.

The current Torchinductor's `log` implementations will call `sleef` functions in `aten::Vec` which show worse performance than Aten's `log` implementations that invoke `MKL` functions. The reason is that the `sleef` algorithms sacrifice performance in order to have a higher precision. This PR changes Torchinductor's `log` implementations from the `sleef` functions with `1.0` ULP error bound to the ones with `3.5` ULP error bound.

**Performance**
Machine: ICX

The original perf number, perf with `Sleef_logf16_u10`:
```bash
numactl -C0 python test.py
log
eager:    368.8463559374213
compiled: 616.8672097846866
logit
eager:    565.499295014888
compiled: 1010.4096410796046
```

Perf with `Sleef_logf16_u35`:
```bash
numactl -C0 python test.py
log
eager:    364.8629770614207
compiled: 360.2141812443733
logit
eager:    562.3160391114652
compiled: 545.2622110024095
```

**Accuracy**
error_bound | tol=1e-6 | tol=1e-7
-- | -- | --
1.0 ULP | PASS | FAIL
3.5 ULP | PASS | FAIL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111898
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-10-25 01:26:39 +00:00
e574a8ab55 [dynamo] Add sanity checks to ensure no double-wrapping of FakeTensors produced by the current graph (#111913)
Partially fixes: https://github.com/pytorch/pytorch/issues/111873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111913
Approved by: https://github.com/ezyang
2023-10-25 01:18:32 +00:00
4f42edfb6e Add support to ExportedProgram as input to torch.onnx.dynamo_export (#111497)
Fixes #109889

This PR adds `torch.export.export` as another `FXGraphExtractor` implementation. `torch.onnx.dynamo_export` automatically uses this new FX tracer when a `torch.export.ExportedProgram` is specified as `model`

Implementation is back compatible, thus non `ExportedProgram` models are handled the exact same way as before
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111497
Approved by: https://github.com/BowenBao
2023-10-25 00:17:43 +00:00
6e2dfb360b [quant][be] Clean up prepare code (#111827)
Summary:
att

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111827
Approved by: https://github.com/andrewor14
2023-10-25 00:14:59 +00:00
3acaf8564d [easy] use number of param bytes as the chunk size if it's not provided (#111844)
Summary: ATT

Test Plan: CI

Differential Revision: D50572228

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111844
Approved by: https://github.com/zyan0, https://github.com/houseroad
2023-10-24 23:56:33 +00:00
ad4971c0b1 Delete deepcopied model after use in benchmark to reduce memory consumption (#111868)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111868
Approved by: https://github.com/msaroufim, https://github.com/thiagocrepaldi
ghstack dependencies: #111867, #111593
2023-10-24 23:44:14 +00:00
a8760f1b42 [Quantization] Add a test for QAT + PTQ selective quantization in (#111689)
xnnpack quantizer

Summary:
For some workflows you want to quantize some parts of the model via qat
and then continue eager mode training. After training, you want to
export the whole model and perform PTQ on the rest.

Test Plan:
test added

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D50510480](https://our.internmc.facebook.com/intern/diff/D50510480)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111689
Approved by: https://github.com/jerryzh168
2023-10-24 23:25:38 +00:00
192477b5ba Enable flake8-bugbear B020 lint (#110823)
Fixes part of https://github.com/pytorch/pytorch/issues/106571

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110823
Approved by: https://github.com/Skylion007
2023-10-24 22:43:47 +00:00
b600aed237 [TD] Make test class times available during CI (#111836)
Makes the test class durations uploaded by https://github.com/pytorch/test-infra/pull/4670 available during CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111836
Approved by: https://github.com/clee2000
2023-10-24 21:40:10 +00:00
1dd57082a4 [inductor] Decompose boolean min/max into all/any (#110311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110311
Approved by: https://github.com/lezcano
ghstack dependencies: #110310
2023-10-24 21:33:53 +00:00
46e80ce58a [ATen] Support multi dim any and all reductions (#110310)
This adds a new overload to `all` and `any` with support for multiple reduction dims.
```
all.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor
any.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110310
Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/justinchuby
2023-10-24 21:33:53 +00:00
9849ef1253 Remove requires_grad_info from AOTDispatch (#110773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110773
Approved by: https://github.com/bdhirsh
2023-10-24 21:31:34 +00:00
5344468712 Revert "[dynamo] Properly track user-defined types for type() (#110794)"
This reverts commit ad4ccf96896bdf0f098bd9192f8c5a019fddf4c6.

Reverted https://github.com/pytorch/pytorch/pull/110794 on behalf of https://github.com/ezyang due to looks like this actually fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/110794#issuecomment-1778002262))
2023-10-24 20:42:26 +00:00
4839f319da Apply same 'pick_grad' on generating fp64 reference outputs (#111593)
To lower memory consumption for inference mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111593
Approved by: https://github.com/msaroufim, https://github.com/thiagocrepaldi
ghstack dependencies: #111867
2023-10-24 20:16:53 +00:00
ec2e0712db [ONNX] Enable onnx inlining in benchmark for >2GB models (#111867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111867
Approved by: https://github.com/thiagocrepaldi
2023-10-24 20:16:53 +00:00
5da903ff78 [qnnpack] suppress empty translation unit warning (#111475)
Summary: Spotted this while compiling on a Mac M1. The code in these files is gated behind #ifdef and requires SSE, so when building for ARM these files become empty.

Test Plan: CI

Differential Revision: D50407334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111475
Approved by: https://github.com/digantdesai
2023-10-24 20:08:58 +00:00
b0087b4cf7 Revert "record_function: remove legacy internal operators (#72303)"
This reverts commit 0be84bb41e6f527229b9f50ce9937038a0c14ffe.

Reverted https://github.com/pytorch/pytorch/pull/72303 on behalf of https://github.com/izaitsevfb due to Apparently _record_function_enter is still used internally at Meta in several places and in lots of internal tests. ([comment](https://github.com/pytorch/pytorch/pull/72303#issuecomment-1777942975))
2023-10-24 20:01:14 +00:00
e72fcd382b [aotinductor] Fix a problem when the generated graph is empty (#111822)
Summary: For https://github.com/pytorch/pytorch/issues/111691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111822
Approved by: https://github.com/chenyang78
2023-10-24 20:00:27 +00:00
b01e87d0c0 [BE][EZ] Use setup-ssh actions from test-infra (#111922)
I though I've migrated all the actions to this one, but overlooked the Windows binary builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111922
Approved by: https://github.com/atalman
2023-10-24 19:55:58 +00:00
ddcf9c050b [Inductor] Support calling user defined kernels with different type of arguments (#111939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111939
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #111770, #111808
2023-10-24 19:49:48 +00:00
4ac848cf77 [dynamo] Perf (MapHigherOrderVariable): do not unnecessarily get_real_value (#111920)
`get_real_value` will run the real tensor computation via the fx graph, which could be really expensive.

Let's just do the sensible thing by running the fx graph on the fake value

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111920
Approved by: https://github.com/ezyang, https://github.com/zou3519
2023-10-24 19:44:25 +00:00
3c46e859aa [TD] Enable trial mode for new heuristics (#111858)
This lets one get metrics from a new heuristic and evaluate it's results without having it actually reorder the tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111858
Approved by: https://github.com/clee2000
2023-10-24 19:13:07 +00:00
7bec7d95e4 Automate release only changes, binary_linux_test.sh (#111862)
Automates following release only change:
https://github.com/pytorch/pytorch/pull/108688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111862
Approved by: https://github.com/osalpekar
2023-10-24 18:59:34 +00:00
d92459617e Automate passing conda-pytorchbot-token-test for release (#111821)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at a3b51df</samp>

This pull request adds support for testing binary uploads to Anaconda Cloud using different tokens and channels based on the branch name. It modifies the `.github/workflows/_binary-upload.yml` workflow and several other workflows that use the `.github/templates/upload.yml.j2` template. It also adds a new secret variable `conda-pytorchbot-token-test` to store the test token.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111821
Approved by: https://github.com/osalpekar, https://github.com/huydhn
2023-10-24 18:58:47 +00:00
cd034e1793 [HigherOrderOp] don't mannually set input for cond (#111611)
We set mannualy_set_graph_inputs to False for CondHigherOrder. After that, it became necessary to deduplicate the inputs.  We'll add pytree tests in the follow-up pr.

Test Plan:
existing tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111611
Approved by: https://github.com/zou3519
ghstack dependencies: #111610
2023-10-24 18:56:23 +00:00
a0043d4840 [PyTorch] AOTI: cache dtypes and device types at DSO load (#111820)
Calling the `aoti_torch_{device_type,dtype}` functions on
each iteration can impose high costs on overhead-bound CPU models
because they can't be inlined across a DSO boundary. If we call them
on load, we can use simple load instructions at run time.

Differential Revision: [D50563682](https://our.internmc.facebook.com/intern/diff/D50563682/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111820
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #111815, #111816
2023-10-24 18:37:26 +00:00
de2b41bbbf [PyTorch] AOTI: override VecISA selection in fbcode (#111816)
The OSS selection mechanism does not work internally, and doesn't make sense when the machine building the .so and the machine executing it may be different anyway.

Differential Revision: [D50140024](https://our.internmc.facebook.com/intern/diff/D50140024/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111816
Approved by: https://github.com/jansel, https://github.com/msaroufim
ghstack dependencies: #111815
2023-10-24 18:37:26 +00:00
6afd00a318 [PyTorch] AOTI: use array of constants (#111815)
We continue to allow the user to set clients with a map, but under the hood we use an array of constants.

model_container thought it was OK to hand over the map, assume we just
kept a pointer, and then mutate the map later; I had to fix that. I
hope there aren't other sites that do the same thing...

Differential Revision: [D50111512](https://our.internmc.facebook.com/intern/diff/D50111512/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111815
Approved by: https://github.com/jansel, https://github.com/desertfire
2023-10-24 18:37:18 +00:00
b70efde3ad [easy] Reapply D49842542 (remove pessimizing move) (#111910)
This fixes a pessimizing move; for some reason the linked diff was
allowed to land with this change applied only to the internal fork of pytorch.

Differential Revision: [D50599188](https://our.internmc.facebook.com/intern/diff/D50599188/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111910
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2023-10-24 17:51:16 +00:00
b89c2202bc [pytorch-vulkan] Support zero-dim (#111680)
Summary:
1. Add zero-dim (Tensor with 1 element) support.
2. New operator `_local_scalar_dense` that map a zero-dim tensor into a Scalar
3. `sum_dim`:
3.1. Add zero-dim support.
3.2. Fix bug in negative indices when handling multi-dim reduction call
3.3. Add unittests to test new coverages
4. Add `aten::sum` support.
5. Change bug in `add_tensor` (and other binary ops), when `other` is zero dim, we will use broadcast instead.

Test Plan:
## Devserver

Full Paste: P858982150

```
[yipjustin@31799.od ~/fbsource (8593e7559)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan  -c pt.has_backtraces=1    //xplat/caffe2:pt_vulkan_api_test_bin  --
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
Buck UI: https://www.internalfb.com/buck2/90cad0ff-ac98-4dbf-8d6f-0e419c06208d
Network: Up: 43KiB  Down: 1.4MiB  (reSessionID-dfc3a318-fd1a-4ad6-b077-c454ebb4c6a8)
Jobs completed: 6. Time elapsed: 26.4s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 1, local: 1)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 385 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 385 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.zero_size_tensor
[       OK ] VulkanAPITest.zero_size_tensor (9 ms)
[ RUN      ] VulkanAPITest.zero_dim_tensor_1
[       OK ] VulkanAPITest.zero_dim_tensor_1 (84 ms)
[ RUN      ] VulkanAPITest.zero_dim_tensor_2
[       OK ] VulkanAPITest.zero_dim_tensor_2 (22 ms)
[ RUN      ] VulkanAPITest.local_scalar_dense
[       OK ] VulkanAPITest.local_scalar_dense (10 ms)
...
[       OK ] VulkanAPITest.lstm_prepack_success (2 ms)
[ RUN      ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7484: Skipped
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 385 tests from VulkanAPITest (46915 ms total)
[----------] Global test environment tear-down
[==========] 385 tests from 1 test suite ran. (46915 ms total)
[  PASSED  ] 382 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] VulkanAPITest.conv2d_pw_prepack
[  FAILED  ] VulkanAPITest.conv2d_pw_prepack_bc
 2 FAILED TESTS
  YOU HAVE 7 DISABLED TESTS
```

## M1 MAC

P859975219
```
buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64   --target-platforms ovr_config//platform/macos:arm64-fbsource -- --gtest_filter="*"
Using additional configuration options from .buckconfig.local
Building: finished in 0.2 sec (100%) 269/2875 jobs, 0/2875 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 384 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 384 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.zero_size_tensor
[       OK ] VulkanAPITest.zero_size_tensor (40 ms)
[ RUN      ] VulkanAPITest.zero_dim_tensor_1
[       OK ] VulkanAPITest.zero_dim_tensor_1 (7 ms)
[ RUN      ] VulkanAPITest.zero_dim_tensor_2
[       OK ] VulkanAPITest.zero_dim_tensor_2 (1 ms)
[ RUN      ] VulkanAPITest.local_scalar_dense
[       OK ] VulkanAPITest.local_scalar_dense (0 ms)
[ RUN      ] VulkanAPITest.copy_to_texture
[       OK ] VulkanAPITest.copy_to_texture (45 ms)
...
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 384 tests from VulkanAPITest (5127 ms total)

[----------] Global test environment tear-down
[==========] 384 tests from 1 test suite ran. (5127 ms total)
[  PASSED  ] 382 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
[  FAILED  ] 1 test, listed below:
[  FAILED  ] VulkanAPITest.normal_large

 1 FAILED TEST
  YOU HAVE 5 DISABLED TESTS
```

Differential Revision: D50347338

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111680
Approved by: https://github.com/SS-JIA
2023-10-24 17:29:56 +00:00
062850f4b9 Remove TorchText from RELEASE.MD (#111940)
TorchText development has been paused, so it should no longer be considered a requirement for release process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111940
Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/kit1980
2023-10-24 17:28:33 +00:00
f97c2dabd9 Move negative index checking to common.py - Fix issue 97365 (#108690)
Fixes https://github.com/pytorch/pytorch/issues/97365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108690
Approved by: https://github.com/lezcano
2023-10-24 17:27:54 +00:00
f32eb9bc55 fix missing non-contiguous output handling for add op (#111758)
patch for https://github.com/pytorch/pytorch/pull/104689 which is missing similiar handling for add op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111758
Approved by: https://github.com/karthiknagasub, https://github.com/ezyang
2023-10-24 17:27:50 +00:00
0c64ac0d3a Add tests for strided layout in factory functions (#111463)
Fixes #111222
This pull request adds tests for factory functions that create tensors with a strided layout. The tests are added to the `test_ops.py` file and check the behavior of the `empty`, `zeros`, `ones`, and `rand` factory functions when used with the `layout=torch.strided` argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111463
Approved by: https://github.com/lezcano
2023-10-24 17:05:44 +00:00
fb7047e1a1 Place local_used_map_dev_ on CPU for MTIA (#111581)
Summary:
The dist backend used on MTIA doesn't support int32 allreduce for now. The local_used_map_dev_ has to be placed on CPU.

Test Plan: See diff D50387636

Differential Revision: D50460304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111581
Approved by: https://github.com/fduwjj
2023-10-24 17:02:44 +00:00
ad3572a5dc Unify torch.SymInt and torch.types.SymInt (#110573)
Per @ezyang, this should be fine

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110573
Approved by: https://github.com/ezyang
2023-10-24 16:17:23 +00:00
099efd8346 Fix reduction + () + multi-level reduction optimization (#111781)
In https://github.com/pytorch/pytorch/pull/111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together.

There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`.

Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111781
Approved by: https://github.com/malfet, https://github.com/jansel
2023-10-24 15:42:21 +00:00
a887ad0b60 Add continue-on-error if ssh step is failing (#111916)
This is debugging step and should not cause the whole workflow to fail. Hence adding continue-on-error which Prevents a job from failing when a step fails. Set to true to allow a job to pass when this step fails
Failure:
https://github.com/pytorch/pytorch/actions/runs/6627941257/job/18003997514?pr=111821

Example:
```
Run seemethere/add-github-ssh-key@v1
  with:
    GITHUB_TOKEN: ***
    activate-with-label: true
    label: with-ssh
    remove-existing-keys: true
  env:
    ALPINE_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine
    ANACONDA_USER: pytorch
    AWS_DEFAULT_REGION: us-east-1
    BUILD_ENVIRONMENT: windows-binary-conda
    GITHUB_TOKEN: ***
    PR_NUMBER:
    SHA1: e561cd9d[2](https://github.com/pytorch/pytorch/actions/runs/6627941257/job/18003997514?pr=111821#step:3:2)5[3](https://github.com/pytorch/pytorch/actions/runs/6627941257/job/18003997514?pr=111821#step:3:3)d8[4](https://github.com/pytorch/pytorch/actions/runs/6627941257/job/18003997514?pr=111821#step:3:4)0834d8bbef4ec98ad8[6](https://github.com/pytorch/pytorch/actions/runs/6627941257/job/18003997514?pr=111821#step:3:6)[8](https://github.com/pytorch/pytorch/actions/runs/6627941257/job/18003997514?pr=111821#step:3:8)ba01e4
    SKIP_ALL_TESTS: 1
    PYTORCH_ROOT: C:\actions-runner\_work\pytorch\pytorch/pytorch
    BUILDER_ROOT: C:\actions-runner\_work\pytorch\pytorch/builder
    PACKAGE_TYPE: conda
    DESIRED_CUDA: cu118
    GPU_ARCH_VERSION: 11.8
    GPU_ARCH_TYPE: cuda
    DESIRED_PYTHON: 3.[9](https://github.com/pytorch/pytorch/actions/runs/6627941257/job/18003997514?pr=111821#step:3:9)
ciflow reference detected, attempting to extract PR number
Error: The request could not be processed because too many files changed
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111916
Approved by: https://github.com/malfet
2023-10-24 14:53:40 +00:00
1ddbdb5144 Optest: Allow parametrized names for xfails checks (#111797)
CC @zou3519

This is hopefully a fix for https://github.com/pytorch/vision/pull/8058/files#r1368570541. It seems to work for me locally, but maybe there's a more elegant way of handling this?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111797
Approved by: https://github.com/zou3519
2023-10-24 11:35:27 +00:00
4f79161452 Add tensor parallel sharding APIs for torch export (#111236)
Add libraries to apply tensor parallel transformation to an exported program.

Differential Revision: [D50214796](https://our.internmc.facebook.com/intern/diff/D50214796/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111236
Approved by: https://github.com/wanchaol
2023-10-24 10:07:14 +00:00
ebcc42ea10 [Dist] Fix coalescing manager + DETAIL debug mode (#111878)
Fix https://github.com/pytorch/pytorch/issues/109520 by adding it to
ProcessGroupWrapper.

Differential Revision: [D50583403](https://our.internmc.facebook.com/intern/diff/D50583403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111878
Approved by: https://github.com/fegin, https://github.com/wanchaol, https://github.com/fduwjj
2023-10-24 07:47:39 +00:00
babb6c6ac4 nccl flight recorder (#110960)
Keep a buffer of the last 16384 nccl work actions, including the stack
trace that launched the event.

When torch._C._distributed_c10d._dump_nccl_trace(), it an dump these to
a pickled archive.

For each action we get:
process_group_id, seq_id, collective_name, size_of_first_tensor, stack trace

state - issued, started, completed (based on cuda events and queried if
necessary when the dump is requested)

I tested that it is possible to query event state when the streams are
otherwise stuck.

Differential Revision: [D50138956](https://our.internmc.facebook.com/intern/diff/D50138956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110960
Approved by: https://github.com/wconstab
2023-10-24 07:12:21 +00:00
9dfaba6f10 [dynamo] add repro for functorch/fx interop issue (allow_in_graph) (#111746)
Fixes https://github.com/pytorch/pytorch/issues/109025 by adding repro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111746
Approved by: https://github.com/voznesenskym
2023-10-24 07:03:15 +00:00
4b804dac33 [MPS] Add complex support for fill (#111885)
Fixes #110537
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111885
Approved by: https://github.com/malfet
2023-10-24 06:41:10 +00:00
0ad91c2bfb Add an explicit _shutdown method to ProcessGroupNCCL (#111392)
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor.

However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor.

As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111392
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
2023-10-24 05:47:12 +00:00
6d78f34a06 fix regression which creates a new fake tensor (#111864)
Fixes regression identified here: ccd6b373b5 (r1369334484)

Now that `get_fake_value` will identify aliases, we should not try to wrap the fake value again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111864
Approved by: https://github.com/eellison
2023-10-24 05:11:48 +00:00
0e0f6a248d Fix num_batches_tracked of BatchNorm when load_state_dict (#110850)
Fixes #110361

as the title shown

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110850
Approved by: https://github.com/mikaylagawarecki
2023-10-24 04:20:38 +00:00
30cbd2ea37 Add Benchmark for freezing + max autotune, turn on in weekly run (#111853)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111853
Approved by: https://github.com/desertfire
2023-10-24 04:13:56 +00:00
cbc6213f5d [inductor] Defer memory operation lowering to wrapper (#111402)
Right now, memory ops are being lowered to strings partly in
scheduler.codegen() and partly in wrapper.codegen(). But that makes
static memory planning (which is done entirely in `wrapper.codegen()`)
difficult to implement as information is "lost" by that point.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111402
Approved by: https://github.com/jansel
2023-10-24 03:47:56 +00:00
6977ba6e3c [inductor] decomposition for complex addition (#110740)
Tracks https://github.com/pytorch/pytorch/issues/98161

Complex number support in Pytorch isn't ideal today as complex operations will mostly end up taken care of by the aten runtime, except for `torch.angle` which is handled in [105609](https://github.com/pytorch/pytorch/pull/105609). In general a better way to handle that could be to decompose complex operations first so that more opportunities for fusion could be unveiled, and then to have Triton take care of non-continuous (strided) tensor operations more efficiently. This change adds support to decompose complex addtions.

```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 6
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tl.load(in_ptr1 + (x0), xmask)
    tmp2 = tmp0 + tmp1
    tl.store(out_ptr0 + (x0), tmp2, xmask)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110740
Approved by: https://github.com/jansel
2023-10-24 03:41:24 +00:00
b3bb94b980 [dynamo] Update test_invoke_in_pt2_compiled_autograd (#111817)
Summary: For some reason this test seems to only run in fbcode, not OSS

Test Plan: CI

Differential Revision: D50562753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111817
Approved by: https://github.com/izaitsevfb
2023-10-24 03:30:36 +00:00
a469aca1cc Exposes a fast_fp8_accum option to _scaled_mm (#111847)
# Summary
Adds the option to use fast_accumulation_mode for the fp8 matmul in scaled_mm

Information can be found here: https://docs.nvidia.com/cuda/cublas/#cublasltmatmuldescattributes-t
defaults to 0 (off)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111847
Approved by: https://github.com/ipiszy, https://github.com/malfet
2023-10-24 03:26:53 +00:00
702aaf8aea [sparse] semi-structured sparse + torch.compile support (#111049)
Summary:

This PR adds in torch.compile support for semi-structured sparsity,
using the subclass tracing @bdhirsh added.

Based on wether we are using cuSPARSELt or CUTLASS, we return a
different representation of the inner tensors.

Test Plan:
```
python test/test_sparse_semi_structured.py -k compile
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111049
Approved by: https://github.com/cpuhrsch
2023-10-24 02:23:20 +00:00
5eac44bc72 Ignore beartype if its version is 0.16.0 (#111859)
With this fix, 'beartype' 0.16.0 should be ignored and not crash PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111859
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-10-24 02:11:26 +00:00
9132734a35 Use Dr.CI GitHub checkrun summary when querying its API fails (#111628)
This will allow internal SandCastle job to access Dr.CI classification results via GitHub checkrun summary and correctly ignore unrelated failures.

### Testing

Adding `TestBypassFailuresOnSandCastle` where Dr.CI API returns nothing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111628
Approved by: https://github.com/clee2000
2023-10-24 01:32:30 +00:00
e62c887bab Revert "[inductor][BE] split triton_meta and inductor_meta (#111397)"
This reverts commit 070b94dc08c73e133c5231ec6acbe407ae1580f3.

Reverted https://github.com/pytorch/pytorch/pull/111397 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111397#issuecomment-1776282039))
2023-10-24 00:52:24 +00:00
0a26e5fd8f Use 'device' argument in test_sparse.py::TestSparseAnyCUDA::test_as_sparse_gradcheck_* (#111584)
Argument "device" was missed.
So, "test_sparse.py::TestSparseAnyCUDA::test_as_sparse_gradcheck_*_cuda" was always run on the default device ("cpu") if another default torch device was not configured before.
This fix will probably detect a number of issues on various devices which were previously missed.
Should fix failed rocm CI jobs with "##[error]The action has timed out."  and speedup test execution

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111584
Approved by: https://github.com/soulitzer
2023-10-24 00:03:50 +00:00
b969c675f5 Add batched dimensions support to the second operand of bsr_scatter_mm (#111796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111796
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110396, #111470, #111489, #111760
2023-10-23 23:52:49 +00:00
6382011843 Add NVIDIA A100 optimized meta parameters to bsr_dense_mm (#111760)
As in the title.

The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters.

In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations.

<img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111760
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110396, #111470, #111489
2023-10-23 23:52:49 +00:00
f3d08ab271 Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize is 16. (#111489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111489
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110396, #111470
2023-10-23 23:52:49 +00:00
6078ed95cc Use lru_cache to cache indices data for bsr_scatter_mm. (#111470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111470
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110396
2023-10-23 23:52:49 +00:00
b56699b699 Add post grad graph logging (#111808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111808
Approved by: https://github.com/Chillee
ghstack dependencies: #111770
2023-10-23 23:24:04 +00:00
0ea9646cdd Rewrite torch.library's documentation (#111310)
We mention the higher-level torch.library APIs and put the original docs
into a low-level API section.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111310
Approved by: https://github.com/soulitzer
ghstack dependencies: #111380, #111659
2023-10-23 23:02:41 +00:00
66b74d231a Change torch.library.impl to accept a device string (#111659)
torch.library.impl now accepts a device string (e.g. "cpu", "cuda"). It
still accepts DispatchKey strings, but we no longer document this, because
using arbitrary DispatchKeys is more for the power users.

We map the device string to a DispatchKey and then register the impl for
said DispatchKey. A user may also specify multiple device strings at once
or specify "types=default" to get a CompositeExplicitAutograd registration.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111659
Approved by: https://github.com/soulitzer
ghstack dependencies: #111380
2023-10-23 23:02:41 +00:00
6463f2b51c Rename name->qualname in torch.library.impl_abstract (#111380)
See title. Makes this consistent with torch.library.{define, impl, impl_device}, where we have named the same argument `qualname`. This is not BC-breaking because we have not released a version of PyTorch with impl_abstract in it yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111380
Approved by: https://github.com/soulitzer
2023-10-23 23:02:36 +00:00
0be84bb41e record_function: remove legacy internal operators (#72303)
These operators have not been used since #76420 but were preserved for TorchScript backward compatibility

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72303
Approved by: https://github.com/albanD
ghstack dependencies: #104535
2023-10-23 22:55:05 +00:00
4ed4753ac3 [inductor][easy] skip test_extension_backend.py in fbcode (#111591)
Summary: It's currently failing. We should skip it in fbcode because cpp extensions don't work right now.

Differential Revision: D48852412

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111591
Approved by: https://github.com/desertfire
2023-10-23 22:37:13 +00:00
d22e5e4b52 Fix DDP notes (#111833)
To include `import os` otherwise sample is not syntactically correct Reported in https://github.com/pytorch/pytorch.github.io/pull/1490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111833
Approved by: https://github.com/wanchaol
2023-10-23 22:05:36 +00:00
070b94dc08 [inductor][BE] split triton_meta and inductor_meta (#111397)
triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future.

This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton.

Tests - wait for CI

Differential Revision: [D50442547](https://our.internmc.facebook.com/intern/diff/D50442547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111397
Approved by: https://github.com/shunting314, https://github.com/eellison
2023-10-23 21:38:21 +00:00
73170b23d4 Add compile support for NT unbind (#111531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111531
Approved by: https://github.com/ezyang
2023-10-23 21:16:20 +00:00
4d45c21c3f [Export] Don't serialize missing args with default value (#111715)
Summary: Per https://docs.google.com/document/d/1FzWm-sHYwmRi3x_g036kOxd99KaYquUsA-L5JwOn8ys/edit

I wonder if this would break executorch? @larryliu0820
I see exir/serialize.py using export's GraphModuleSerializer.

Test Plan: Existing CIs

Differential Revision: D50519217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111715
Approved by: https://github.com/zhxchen17
2023-10-23 21:09:15 +00:00
185e76238d [2D][Documentation] Add some comments to _chunk_dtensor (#111775)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111775
Approved by: https://github.com/awgu
2023-10-23 20:43:03 +00:00
3b5b7ebd09 [ci] Save various json files from test infra into folder (#111516)
We pull a lot of files from https://github.com/pytorch/test-infra/blob/generated-stats/stats and name them separately when we add them to the artifacts in the build, so stick them in a folder and just add that instead.

Slow test and disabled test jsons remain as they were since they are pulled during the test step and do not need to be included in the artifacts during build since they are not used for sharding.

Sanity checked that test times could be found for linux, mac, windows, and rocm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111516
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-10-23 20:38:25 +00:00
e509b162ed Disable FlashAttenion for is_causal=True when seqlen q not equal kv (#111007)
# Summary:
This pull request **removes** support for non-square sequence lengths in causal attention when using FlashAttention V2.

### Why are doing this
  // FlashAttention 2 updated the default mask meaning for causal in this PR:
  // 9e5e8bc91e it is now aligned to lower_right which would be a BC break
  // for non-square masks. We will not support non-square masks for causal w/ FAV2

 For more context see:
 https://github.com/pytorch/pytorch/issues/108108

 ### Followup
 A large number of people will likely want to use FAV2 with lower_right causal attention for non equal sequence lengths. See this RFC : https://github.com/pytorch/pytorch/issues/110681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111007
Approved by: https://github.com/cpuhrsch
2023-10-23 20:33:37 +00:00
98e749a306 [Pytorch][CPU] Switch building compiler to Clang (#111537)
Summary:
The slimdsnn model is currently built with GCC, and I see Clang-15 generates better code than GCC which is 10% faster, after a stack of backporting (D50338220).

There are likely other improvements to internal Clang as the TOT Clang in LLVM upstream generates even better code.

Test Plan:
Before:

   buck2 run mode/{opt,inplace} //accelerators/workloads/models/slimdsnn:slimdsnn_dso_benchmark -- --iterations=100000000

   Starting benchmark, 100000000 iterations...
   Batch=1 latency: 0.643 us

After:

   buck2 run mode/{opt,inplace} //accelerators/workloads/models/slimdsnn:slimdsnn_dso_benchmark -- --iterations=100000000

   Starting benchmark, 100000000 iterations...
   Batch=1 latency: 0.593  us

Reviewed By: bertmaher

Differential Revision: D50399150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111537
Approved by: https://github.com/bertmaher
2023-10-23 20:26:46 +00:00
6c384cf4a6 Don't DCE unbacked SymInt if it is returned as shape constant buffer (#111803)
Also adds some logging for the inductor scheduler

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111803
Approved by: https://github.com/jansel
2023-10-23 19:57:38 +00:00
0b602b13c8 [small] fix tcpstore doc arg (#111807)
incorrect arg name `wait_for_worker` -> `wait_for_workers`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111807
Approved by: https://github.com/awgu, https://github.com/fduwjj
2023-10-23 19:51:09 +00:00
d4708a6da7 Add scatter_mm and bsr_scatter_mm operations. (#110396)
This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise).

The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`.

<img src="https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168" width="48%">

The same figures for GPU card `NVIDIA A100-SXM4-80GB`:

<img src="https://github.com/pytorch/pytorch/assets/402156/25466f1d-df34-4d1c-a975-afb478e4d9f0" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/6ada91f0-a20f-4f0d-8a48-1f4ccc60d08e" width="48%">

In sum:
- `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors [GPU: `NVIDIA GeForce RTX 2060 SUPER`].
- `bsr_scatter_mm` is up to 2x faster than `bsr_dense_mm` for small block sizes of 16 and large tensors [GPU: `NVIDIA A100-SXM4-80GB`].
- `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger [GPU: `NVIDIA GeForce RTX 2060 SUPER`].
- However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110396
Approved by: https://github.com/cpuhrsch
2023-10-23 19:45:30 +00:00
3b9246ba18 Add CSR tensor with non-contiguous values support to CuSparseSpMatCsrDescriptor (#111742)
Fixes https://github.com/pytorch/pytorch/issues/111574

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111742
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-10-23 19:20:11 +00:00
335582584f [inductor] Adding a way to force fusion of int_mm with mul (#111413)
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise and reduction
ops from the dequant  and following ops creating an increase in memory overhead and a general
slowdown compared to the fused version.

as an external benchmark, for SAM this seems to improve our e2e image encoder times by 3-5% depending on
batchsize and reduces memory usage by 20%

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111413
Approved by: https://github.com/jansel
2023-10-23 19:18:50 +00:00
e264b42a2e [re-land][inductor] Refactor and optimize allocation calls (#111117) (#111511)
Summary:
This is a re-land of https://github.com/pytorch/pytorch/pull/111117 with
updates to our internal tests included.

This splits out changes from
https://github.com/pytorch/pytorch/pull/102625 to make things easier to
review.

This diff creates a `make_allocation()` method that extracts the logic
from `make_buffer_allocation()` while allowing us to allocate non-buffer
objects. In particular, we will use this to allocate memory pools during
memory planning.

This diff also includes a small optimization -- if the desired
allocation is contiguous, then we emit a call to `empty()` instead of
`empty_strided()` with its superfluous stride argument.

Test Plan: contbuild & OSS CI, see 9ce0ae836d

Differential Revision: D50429424

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111511
Approved by: https://github.com/jansel
2023-10-23 19:18:32 +00:00
184aee12cc make no-inline calls to throw exceptions (#111787)
Previously, we throw runtime_error exceptions with some string
operations upon failures. However, inlining such calls into
the main run function causes some exponential compilation-time
behavior for the host compiler, where the compiler may spend
an hour running call-graph related passes for some large models.

This PR replaces the relevant code with no-inline calls.
With this change, we reduced the compilation time from more than
an hour down to a couple of minutes for some large models.
Note that these non-inline calls have little impact to the
model inference runtime, because they are on the error-handling
paths.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111787
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-10-23 19:12:04 +00:00
36d34ce951 [dynamo] support comparing LHS constant with tensor (#111492)
Fixes https://github.com/pytorch/pytorch/issues/108582

Depends on https://github.com/pytorch/pytorch/pull/111557 for fixing broken integration tests. (due to this PR unblocking an in-graph set membership)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111492
Approved by: https://github.com/Skylion007
2023-10-23 19:05:14 +00:00
59ae0d9f9d Allow setting logger output format with TORCH_LOGS_FORMAT (#111770)
TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" will only dump output
level and message contents
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111770
Approved by: https://github.com/jansel
2023-10-23 18:42:27 +00:00
01a2c801d4 Pass BUILD_ENVIRONMENT to MPS tests (#111595)
- Pass `GIT_DEFAULT_BRANCH` and `TEST_CONFIG` as well.
- Unify `_mac-test.yml` and `_mac-test-mps.yml` further by passing runner type via the matrix and uploading results using the same pattern (before the change MacOS12 and MacOS13 results on PRs were overwritten)
- Add `Cleanup disk space` step to `_mac-test-mps.yml` job

Should fix the
```
Warning:  Gathered no stats from artifacts for build env None build env and None test config. Using default build env and default test config instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111595
Approved by: https://github.com/atalman
2023-10-23 18:37:15 +00:00
0247dce6cb [Pytorch][Vulkan] mean.dim (#111609)
Summary:
We implement [`torch.mean(input, dim, keepdim)`](https://pytorch.org/docs/stable/generated/torch.mean.html) for tensors of 2d to 4d.

Since 0-dim tensor hasn't been supported yet, we only support `dim.size() < input.dim()` for now. We will support following cases in the future work:
- `dim.size() == input.dim()`
- `input.dim() == 1`

Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (970fcd90c)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*mean*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
  Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *mean*
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.mean_invalid_inputs
[       OK ] VulkanAPITest.mean_invalid_inputs (46 ms)
[ RUN      ] VulkanAPITest.mean_dim_2d
[       OK ] VulkanAPITest.mean_dim_2d (127 ms)
[ RUN      ] VulkanAPITest.mean_dim_3d
[       OK ] VulkanAPITest.mean_dim_3d (103 ms)
[ RUN      ] VulkanAPITest.mean_dim_4d
[       OK ] VulkanAPITest.mean_dim_4d (89 ms)
[ RUN      ] VulkanAPITest.mean_dim_keepdim_2d
[       OK ] VulkanAPITest.mean_dim_keepdim_2d (66 ms)
[ RUN      ] VulkanAPITest.mean_dim_keepdim_3d
[       OK ] VulkanAPITest.mean_dim_keepdim_3d (127 ms)
[ RUN      ] VulkanAPITest.mean_dim_keepdim_4d
[       OK ] VulkanAPITest.mean_dim_keepdim_4d (4 ms)
[----------] 7 tests from VulkanAPITest (564 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test suite ran. (564 ms total)
[  PASSED  ] 7 tests.
```

Reviewed By: yipjustin

Differential Revision: D50312990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111609
Approved by: https://github.com/yipjustin
2023-10-23 18:34:53 +00:00
39c09d4da6 Revert "Revert "Nvfuser code removal (#111093)"" (#111604)
This reverts commit 715dfced72657e5adacd5bef16e3d458cd94851b.

The original PR #111093 is reverted due to broken internal build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111604
Approved by: https://github.com/davidberard98
2023-10-23 18:32:41 +00:00
ce48d36324 [aotinductor] Update test utility to use AOTIModelRunner (#111657)
Summary: Use AOTIModelRunner provided by libtorch instead of the custom written RAIIModelContainer for testing. This change also makes running AOTInductor benchmarks on CPU possbile.

Differential Revision: [D50560764](https://our.internmc.facebook.com/intern/diff/D50560764)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111657
Approved by: https://github.com/chenyang78
2023-10-23 18:21:27 +00:00
4b6b8fcf6d Disable dynamo when running generated opcheck tests (#111685)
Summary: Use `TORCHDYNAMO_DISABLE=1` when running generated opcheck tests. Enable some `fbgemm::pack_segments` tests that errored out (with error `RuntimeError: expected int but got s0*s1**2`) because dynamo was being run in the opcheck tests.

Test Plan: `parsh -v --build-flags mode/dev-nosan //deeplearning/fbgemm/fbgemm_gpu:sparse_ops_test` then `run_tests("test_pack_segments")`

Differential Revision: D50508958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111685
Approved by: https://github.com/zou3519
2023-10-23 18:21:16 +00:00
e644b03775 [Forward fix] torch.fx.passes.shape_prop should not be skipped (#111771)
Summary: As title

Test Plan: All failures in T167831495 passed

Differential Revision: D50542953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111771
Approved by: https://github.com/aakhundov
2023-10-23 18:05:26 +00:00
4b324a8717 Add Half support for aminmax on CPU (#106853)
Add Half support for aminmax on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106853
Approved by: https://github.com/cpuhrsch
2023-10-23 17:43:47 +00:00
ad4ccf9689 [dynamo] Properly track user-defined types for type() (#110794)
Closes https://github.com/pytorch/pytorch/issues/110315.

Thanks to @ezyang for the easy repro!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110794
Approved by: https://github.com/ezyang
2023-10-23 17:34:23 +00:00
a22e238db0 Additional lint fixes (#111793)
Follow up to https://github.com/pytorch/pytorch/pull/111367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111793
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-10-23 17:18:26 +00:00
f3d02d9ae6 Add support for sym_ite (#111440)
This PR supports sym_ite. This is useful for converting SymBool to SymInt in e.g. #109916. Internally, it uses sympy.Piecewise. We cannot use sympy.ITE because it expects the arguments and output all to be boolean type but we want return SymInt type when converting a SymBool to SymInt. So we use sympy.Piecewise to denote the symbolic relationship.

Note that this pr uses the range analysis for sympy.Piecewise implemented in https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/value_ranges.py.

Test Plan:
See added test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111440
Approved by: https://github.com/ezyang
2023-10-23 16:17:43 +00:00
09040f6fbb bypass nvml for torch.cuda.device_count() if rocm (#110418)
This is a quick-fix to suppress printing "UserWarning: Can't initialize NVML" when calling torch.cuda.device_count() if [NVIDIA Management Library] (https://developer.nvidia.com/nvidia-management-library-nvml) (nvml module) is installed with ROCm.
Fixes https://ontrack-internal.amd.com/browse/SWDEV-414997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110418
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/kit1980
2023-10-23 16:15:48 +00:00
236472b32a Allow to specify specific files for debug info (#111748)
Building with `USE_CUSTOM_DEBINFO=torch/csrc/Module.cpp python setup.py develop` for example will provide debug info only for this file.
This allows to enable debug symbols very fast from a non-debug build by doing a clean then develop (as long as you have ccache) and avoid very large binaries that take a very long time to load in gdb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111748
Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/malfet
2023-10-23 14:00:54 +00:00
024ffd342a [ATen] Make _unsafe_index CompositeExplicitAutograd (#111795)
The ATen implementation for this function simply calls `at::index` so there's
no reason this shouldn't be composite.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111795
Approved by: https://github.com/lezcano
2023-10-23 13:34:29 +00:00
f3991df408 [caffe2] avoid variable shadowing (#111476)
Summary:
Some builds use -Wshadow and currently there is a compiler warning when building that file.

Code inspection shows that `torch::autograd::impl::get_view_autograd_meta` simply extracts information from the passed object, which is `const`. Therefore the returned views should be the same all the time, and we can fetch the view only once.

Test Plan:
CI

NOTE: please advise for a more comprehensive test plan.

Differential Revision: D50407625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111476
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-10-23 13:22:11 +00:00
cyy
e676ec2fe7 Fix undefined __assert_fail on FreeBSD (#111761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111761
Approved by: https://github.com/Skylion007
2023-10-23 12:46:03 +00:00
1eb6c4314b [xla hash update] update the pinned xla hash (#111788)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111788
Approved by: https://github.com/pytorchbot
2023-10-23 10:59:39 +00:00
fb8876069d Support tracing base torch_function impl (#111731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111731
Approved by: https://github.com/jansel
ghstack dependencies: #111730
2023-10-23 07:11:32 +00:00
cyy
0b424ee0b7 Fix inconsistency of max_split_size between DeviceStats and CUDAAllocatorConfig (#111555)
CUDAAllocatorConfig uses size_t max_split_size and initializes it to std:: numeric_limits<size_t>::max(), and then the value is assigned to max_split_size of DeviceStats which is of type int64_t, so that the command
```
python3 -c "import torch;y=torch.empty(3,device='cuda');print(torch.cuda.memory_stats(0)['max_split_size'])"
```
returned -1.

After skimming through the code, and reading the doc in https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html, It was sure that negative values of max_split_size make no sense and we should use size_t instead. Now the error has been fixed and the command returns std:: numeric_limits<size_t>::max().

This issue was found in revert of #111137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111555
Approved by: https://github.com/colesbury
2023-10-23 06:55:29 +00:00
f7401de1bb Add mha to Autocast CPU (#107674)
Fixes #106751.

This PR adds `_native_multi_head_attention` to Autocast CPU policy.

Behavior: Within the scope of torch.cpu.amp.autocast(dtype=torch.bfloat16) , `_native_multi_head_attention` will be forced to run with bf16 data type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107674
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jon-chuang, https://github.com/drisspg
2023-10-23 06:02:55 +00:00
1d9a7f9e43 [Reland] TensorWithTFOverride inheritance from TensorVariable (#111766)
Accidentally merged https://github.com/pytorch/pytorch/pull/111730 with ghstack, so relanding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111766
Approved by: https://github.com/jansel
2023-10-23 04:33:16 +00:00
c65c0682b1 [dynamo] Expand _nonvar_fields names (#111749)
This should be a small compile time optimization, since we won't need to
walk these fields in apply().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111749
Approved by: https://github.com/yanboliang
2023-10-23 02:58:16 +00:00
2b2b6caf8f [inductor] Implement clone removal for user defined triton kernel via reinplace_scatters (#111627)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111627
Approved by: https://github.com/jansel
ghstack dependencies: #111434
2023-10-22 22:28:00 +00:00
d118531733 Use \odot everywhere instead of mixing \odot and * for the Hadamard product (#111763)
This pull request addresses an inconsistency in the representation of the Hadamard product across PyTorch documentation. Currently, the notation varies among different modules:

- In `torch.nn.LSTM` documentation the Hadamard product is represented with $\odot$
- In `torch.nn.GRU` documentation the Hadamard product is represented with $*$
- In `torch.nn.LSTMCell` documentation the Hadamard product is represented with $*$
- In `torch.nn.GRUCell` documentation the Hadamard product is represented with $*$
- In `torch.ao.nn.quantized.dynamic.GRU` documentation the Hadamard product is represented with $*$

This PR proposes consistently representing the Hadamard product throughout the documentation to enhance clarity and align with established standards.
The notation $\odot$ will be uniformly adopted, following the convention in the [Deep Learning Book](https://www.deeplearningbook.org/contents/linear_algebra.html).

**Changes Made:**

- Modified `torch.nn.GRU` documentation to represent the Hadamard product with $\odot$
- Modified `torch.nn.LSTMCell` documentation to represent the Hadamard product with $\odot$
- Modified `torch.nn.GRUCell` documentation to represent the Hadamard product with $\odot$
- Modified `torch.ao.nn.quantized.dynamic.GRU` documentation to represent the Hadamard product with $\odot$

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111763
Approved by: https://github.com/albanD
2023-10-22 21:01:35 +00:00
5af97fedd2 [dynamo] Fix context wrapping grad mode variable (#111534)
Fixes https://github.com/pytorch/pytorch/issues/111528

Makes use of `ContextWrappingVariable` so that the function will enter the grad mode whenever it is called, and exit once it is done calling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111534
Approved by: https://github.com/jansel
2023-10-22 20:55:48 +00:00
798efab532 Fix S367052 to unblock ICVR MC3 (#109937)
Summary: Somehow "getitem" started to get Tensor starting from ads_ranking:996 and broke SDD pipelining FX-transformer. We need to skip the Tensor node in annotation.

Test Plan:
N4326037
with ads_ranking kernel
# Before
ads_ranking:v996
 {F1100009226}
# With this diff
 {F1100009310}

Differential Revision: D49567615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109937
Approved by: https://github.com/xush6528
2023-10-22 20:24:26 +00:00
c4ab229a82 [dynamo] Implement set.__contains__ for Tensor as object match of FakeTensor (#111738)
Fixes https://github.com/pytorch/pytorch/issues/111556

Dynamo implementation of `set.__contains__` previously used `__eq__` match.

But this is wrong when `__eq__` match does not imply `__hash__` match, as is the case for `torch.Tensor`, leading to inconsistent results. See: https://github.com/pytorch/pytorch/issues/111542

Hence implement as Tensor object match i.e. proxy node `'example_value'` FakeTensor match.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111738
Approved by: https://github.com/lezcano
2023-10-22 17:40:34 +00:00
977d3bcc46 [Inductor] Support user defined triton kernels in inductor (#111434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111434
Approved by: https://github.com/jansel
2023-10-22 17:04:19 +00:00
e2e1189f41 [dynamo] Fix guard for ndarray calling torch.as_tensor(None) (#111665)
Fixes https://github.com/pytorch/pytorch/issues/111662

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111665
Approved by: https://github.com/lezcano
2023-10-22 15:16:21 +00:00
8e60d646b9 [dynamo][stream]support device-agnostic stream in dynamo and capture stream/event method in fx graph (#108312)
This PR implements 2 things:
1. support the device agnostic stream and runtime APIs captured by the dynamo.
2. support the stream methods(include the event) captured by the dynamo.

Here are details for 1st.
Previously the stream captured in dynamo was tightly bind to CUDA. Here we implement a global singleton container named `StreamMethodContainer` for different backends to register their associated stream methods to dynamo. When import the backend’s product, the stream operations can be registered directly by calling

```
device_stream_method = {'current_stream': method_1,
                         'create_stream_context': method_2,
                         'set_stream': method_3,
                         'set_stream_by_id': method_4}
torch._dynamo.stream.register_stream_method(device_name, device_stream_method)
```

Stream methods need to be passed in this API according to the precise semantics represented by the dict key in `device_stream_method`. After register, these methods can be used by dynamo to capture the stream operations in users’ script, for example, get the current stream or set the specific stream. Additionally, the wrapped stream variable and the stream context variable are changed to be the device-agnostic, the proxy functions of these variables are assigned by the associated methods in the container. All of this are illustrated in the below. Below is a illustration.

![image](https://github.com/pytorch/pytorch/assets/74231238/37ac7350-c539-4167-9886-c3744ecab65d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108312
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-10-22 13:22:58 +00:00
57c7aa12db Remove deprecated fbgemm operators (#104535)
These operators are not used and have been deprecated since #72690 (Feb 2022). Additionally, the `torch.jit.quantized` interface has been deprecated since #40102 (June 2020).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104535
Approved by: https://github.com/ezyang
2023-10-22 06:10:09 +00:00
bf01a7b023 [3/N] Merge skipfiles.check rules (#111451)
This major change in this PR is to consolidate the skipfiles.check rules, the major thing done is merging the original ```FILE_INLINELIST``` with ```SUBMOD_INLINELIST``` into new ```MOD_INLINELIST``` and a legacy  ```LEGACY_MOD_INLINELIST```.
Let's use the following example to illustrate what is the expected behavior for this force inline list:
fa995626a8/torch/_dynamo/skipfiles.py (L344-L369)

The handling logic is:
* If f2 is inlined, we will check both ```MOD_INLINELIST``` and ```LEGACY_MOD_INLINELIST``` to consultant force inline rules for f3.
* If f2 is skipped, we will check ```LEGACY_MOD_INLINELIST``` only for inline rules for f3.

The reason behind this design is: if f2 is skipped, if we always trace all recursively called functions, we will go to the very low level functions (e.g, ```super().__init__```) which caused graph breaks. We treated this as a signal that all functions that f2 recursively called should be skipped as well if f2 is skipped. This is also a feature that many PyTorch developers requested, they just want to skip all recursive functions if they mark the upper level functions as skipped.

For PyTorch developers, we should only use ```MOD_INLINELIST``` going forward. I think most of the modules in the ```LEGACY_MOD_INLINELIST``` are legacy things to workaround when we didn't have a good skip/inline API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111451
Approved by: https://github.com/ezyang
2023-10-22 04:35:15 +00:00
61461f39d1 [dtensor] handle negative dim and fix TP regression (#111750)
TP style still have some regression due to negative dim specifications,
fix it by allow DTensor API to handle negative dims and normalize them.

i.e. TP uses `Shard(-1)`, and then try to redistribute `Shard(1) -> Shard(-1)`, this should ideally be no-op but current it runs a decompose sharding phrase and it would turn this transformation to `Shard(1) -> Replicate -> Shard(-1)`, which is wrong and triggers unnecessary allgathers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111750
Approved by: https://github.com/rohan-varma
2023-10-22 04:25:45 +00:00
1d291e1f19 [dtensor] hide xla imports to avoid warning (#111751)
xla imports throw warnings about xla not imported and we should only
import xla when needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111751
Approved by: https://github.com/rohan-varma
2023-10-22 04:09:10 +00:00
c9ca0dde0d python_arg_parser + dynamic shapes: fix segfault coercing symint to intlist (#111642)
Fixes https://github.com/pytorch/pytorch/issues/104812.

As of https://github.com/pytorch/pytorch/pull/111216, the python arg parser will now guard and cast symints from dynamo into ints when it is forced to (e.g. when we pass a symint to an op that only accepts ints).

But the python arg parser also has logic to try to coerce ints into int[] - we need the same logic for symint -> int[].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111642
Approved by: https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #111553
2023-10-22 02:27:14 +00:00
62942b075c dynamo: graph break on resize_ (#111553)
AOTAutograd's handling for resize_() isn't fully robust (and on top of that, functionalization can potentially give up and raise an error if the tensor you're resizing has outstanding views).

So given that, and given that resize_() is rare, I updated dynamo to graph break on resize_() instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111553
Approved by: https://github.com/ezyang
2023-10-22 02:27:14 +00:00
f0cde8613c Revert "Use fmt::format in NCCLUtils and ProcessGroupNCCL instead of c10::str (#107268)"
This reverts commit 6c56e1ce2b8d850eb8f51731ecc8be415160e02b.

Reverted https://github.com/pytorch/pytorch/pull/107268 on behalf of https://github.com/jansel due to Breaks build on Ubuntu 23.04 ([comment](https://github.com/pytorch/pytorch/pull/107268#issuecomment-1773960355))
2023-10-22 01:03:30 +00:00
cc776d2186 [PyTorch Pinned Allocator] Create per thread task pool for mapping memory space (#111545)
Differential Revision: D50443865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111545
Approved by: https://github.com/zdevito
2023-10-22 00:23:49 +00:00
7bd004297a [inductor] Move inductor ops to CompositeExplicitAutograd (#111702)
Relands #111274
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111702
Approved by: https://github.com/voznesenskym
ghstack dependencies: #111700, #111701
2023-10-21 17:31:43 +00:00
1a528c826e [Compiled Autograd] Error if tensor_post_acc_grad_hooks is set (#111701)
Relands #111273
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111701
Approved by: https://github.com/voznesenskym
ghstack dependencies: #111700
2023-10-21 17:31:43 +00:00
a1154e673b [Compiled Autograd] Turn accumulate_grad into an op (#111700)
Relands #111271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111700
Approved by: https://github.com/voznesenskym
2023-10-21 17:31:09 +00:00
cyy
39f484646b [4/N] Apply clang-tidy to aten/src/ATen/core (#111406)
Applies clang-tidy to more aten/src/ATen/core/* files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111406
Approved by: https://github.com/Skylion007
2023-10-21 15:14:00 +00:00
47eed65481 [dynamo] Add is_ support for Tensors, force get_fake_value to reuse previously computed example_value if available (#111565)
Use FakeTensor id match as equivalent to object identity match

cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111565
Approved by: https://github.com/ezyang
2023-10-21 13:56:30 +00:00
9455af58b5 [easy][dynamo] Cleanup guard builder selection (#111723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111723
Approved by: https://github.com/jon-chuang, https://github.com/jansel
2023-10-21 10:48:32 +00:00
cc28b9c10a Fixed a memory leak in PyTorchFileReader (#111703)
Fixes #111330.

This PR prevents `PyTorchFileReader` from leaking memory when initialized with an already opened file handle instead of a file name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111703
Approved by: https://github.com/Skylion007
2023-10-21 10:11:43 +00:00
344fc98991 [dynamo] fix: SetVariable should test Tensor identity based example_value FakeTensor, not fx.Node (#111696)
FX Node changes after in-place op. FakeTensor remains the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111696
Approved by: https://github.com/ezyang
2023-10-21 08:49:21 +00:00
d054078b74 Fix missing guards from logs (#111698)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111698
Approved by: https://github.com/suo, https://github.com/voznesenskym
2023-10-21 07:17:09 +00:00
9c9f66c042 [TorchFix] Update old pretrained TorchVision API in tests (#111708)
For TorchVision models, `pretrained` parameters have been deprecated in favor of "Multi-weight support API" - see https://pytorch.org/vision/0.15/models.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111708
Approved by: https://github.com/NicolasHug
2023-10-21 07:05:33 +00:00
920c9adcc6 [MetaTensor] fix inplace copy for meta tensor (#111705)
Fixes #105685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111705
Approved by: https://github.com/ezyang
2023-10-21 06:02:37 +00:00
5737545467 [vision hash update] update the pinned vision hash (#111720)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111720
Approved by: https://github.com/pytorchbot
2023-10-21 05:09:06 +00:00
3c4581d613 Remove outdated declarations from setup.py (#110660)
`-Wno-deprecated-declarations` should not be needed after Python 2 not supported.

Clang issue for `-Wno-missing-braces` was fixed in 2018.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110660
Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/malfet
2023-10-21 04:55:44 +00:00
c84c86f018 SymIntify convolution (#111599)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111599
Approved by: https://github.com/wanchaol, https://github.com/bdhirsh
2023-10-21 03:03:20 +00:00
0a147fd112 Pointwise fuse cat with pointwise inputs or outputs and <= 4 inputs (#111233)
Improves perf of llama_v2 locally from 1.55 -> 1.57

The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise.

Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was:

```
def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin):
    iota =  torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False)

    # File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
    unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0)
    position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]);  unsqueeze = None

    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed
```

Also not sure if I should be more worried about concatting reduction->pointwise inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233
Approved by: https://github.com/Chillee
2023-10-21 02:34:05 +00:00
03da0694b7 Fix buffer overflow in torch.sort (#111672)
By updating fbgemm submodule
Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1)

Also, update call-sites  to `fbgemm:: GenerateEmbeddingSpMDM` to pass `isbf16` twice, to match API changes introduced in https://github.com/pytorch/FBGEMM/pull/1851

Fixes https://github.com/pytorch/pytorch/issues/111189 and https://github.com/pytorch/pytorch/issues/111710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111672
Approved by: https://github.com/Skylion007
2023-10-21 02:30:11 +00:00
62df159c3f move tf override tensor to torch_function.py (#111714)
Moves TensorWithTFOverride to torch_function.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111714
Approved by: https://github.com/eellison, https://github.com/voznesenskym
2023-10-21 02:29:01 +00:00
5034e98393 Fix create source distribution step for release (#111697)
This is fixing following failure in the release branch:
```
cp: cannot create directory '/tmp/pytorch-release/2.1': No such file or directory
```
Link: https://github.com/pytorch/pytorch/actions/runs/6591657669/job/17910724990

cp will report that error if the parent directory (pytorch-release in this case) does not exist.
This is working in main since ``PT_RELEASE_NAME: pytorch-main`` however for release its ``PT_RELEASE_NAME: pytorch-release/2.1``

Test:
```
export tag_or_branch=release/2.1
tag_or_branch="${tag_or_branch//\//_}"
echo $tag_or_branch
release_2.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111697
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2023-10-21 01:57:23 +00:00
8376079b97 [DTensor][XLA] Support Xla backend in distribute_tensor API (#110275)
This addresses #92909 , and enable XLA backend support for `distribute_tensor` API.

Test plan: added a unit test case & tested with CloudTPU. The CI should skip this unless it's a XLA workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110275
Approved by: https://github.com/wanchaol, https://github.com/alanwaketan, https://github.com/JackCaoG
2023-10-21 01:17:15 +00:00
ff864efd53 [DCP][Test] Add use_dtensor subtests for test_state_dict FSDP test (#111615)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111615
Approved by: https://github.com/fegin
2023-10-21 00:44:41 +00:00
cb2fef1f47 [DCP][Test] Update fine-tune e2e test to use init_device_mesh and DTensor state_dict (#111598)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111598
Approved by: https://github.com/fegin
2023-10-21 00:37:00 +00:00
7709382b50 Fix regression in torch.equal behavior for NaNs (#111699)
`torch.equal(x, x)` should return false if one of `x` is a tenor of floats one of which is NaN.
So, it renders some of the optimization proposed in https://github.com/pytorch/pytorch/pull/100024 invalid, though as result `torch.equal` will become much slower for identical floating point tensors.

Add regression test that calls torch.equal for tensor containing NaN

Fixes https://github.com/pytorch/pytorch/issues/111251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111699
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-10-21 00:02:45 +00:00
aa24459595 [NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph capture 2 (#110665)
Alternative to #104487.

Several have chimed in that #104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture.

CC @huydhn @malfet @Aidyn-A @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110665
Approved by: https://github.com/kwen2501
2023-10-20 23:57:43 +00:00
f9d45f63dd [torch] Add LOAD_METHOD_SUPER and LOAD_ATTR_SUPER (#111707)
Summary:
Cinder has two new opcodes which optimize `super()` in classes. This implements
the opcodes for `torch._dynamo`.

Test Plan:
```
buck2 test mode/opt-split-dwarf aps_models/ads/icvr/... -c fbcode.use_cinder_fast_test=true
```

Differential Revision: D50516475

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111707
Approved by: https://github.com/jansel
2023-10-20 23:50:42 +00:00
9b499b417e [BE]: Apply subprocess check to github scripts (#111684)
Add subproces checks to raise exceptions in Github scripts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111684
Approved by: https://github.com/albanD
2023-10-20 23:37:57 +00:00
43c211facb [quant][pt2e] Actually support transitive sharing for SharedQuantizationSpec (#111172)
Summary:
Previously we actually did not really support this, this PR added the support.

Next
* clean up insert observer logic
* add allow_transitive_sharing boolean flag to allow people to turn this op for certain edges

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_shared_qspec_transitivity

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D50250789](https://our.internmc.facebook.com/intern/diff/D50250789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111172
Approved by: https://github.com/kimishpatel
2023-10-20 23:25:17 +00:00
1ad0f0b308 [BE]: remove unnecessary enumerate calls (#111690)
Remove unnecessary enumerate calls, entirely automated fixes so probably reasonably low risk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111690
Approved by: https://github.com/malfet
2023-10-20 23:20:29 +00:00
c2a248bdb3 Revert "[ROCm] Unskip functorch tests that now work (#110760)"
This reverts commit 71b35862d3f4ebf0285370d2224b0d0efb118321.

Reverted https://github.com/pytorch/pytorch/pull/110760 on behalf of https://github.com/izaitsevfb due to Lint failure ([comment](https://github.com/pytorch/pytorch/pull/110760#issuecomment-1773490896))
2023-10-20 23:04:49 +00:00
e9422b1fb0 Fix test listing error (#111630)
Summary: Fix fbcode internal test listing error

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --run-disabled

Differential Revision: D50485766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111630
Approved by: https://github.com/desertfire
2023-10-20 23:00:18 +00:00
101210e2ce [dynamo] cast single-elem tensors to float and int (#111518)
Fixes https://github.com/pytorch/pytorch/issues/109538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111518
Approved by: https://github.com/ezyang
2023-10-20 22:53:58 +00:00
079394e9d6 [documentation] adding desc for adaptive_autorange (#111612)
Summary: This prevented it from showing up in docs

Test Plan: no functional changes

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111612
Approved by: https://github.com/cpuhrsch
2023-10-20 22:38:39 +00:00
4c6e85365f Add NVIDIA license to comm_analysis.py (#111670)
We adapted the cost model from NCCL code, we should apply their license here as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111670
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2023-10-20 21:34:35 +00:00
71b35862d3 [ROCm] Unskip functorch tests that now work (#110760)
This issue unskips some of the working tests that were skipped as a result of https://github.com/pytorch/pytorch/issues/96560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110760
Approved by: https://github.com/zou3519
2023-10-20 21:33:56 +00:00
303c54dbd9 [dynamo] share a subgraph tracer across fwd and bwd in autograd.Function (#111588)
Fixes https://github.com/pytorch/pytorch/issues/111031

The current design of autograd.Function tracing in dynamo is that we:

1) speculate fwd, and if its fine,
2) speculate bwd, and if its fine
3) install the .apply in the graph alongside fwd guards

The mechanism for doing so involves creating HOPs for fwd, bwd, and apply. The speculation for fwd and bwd create their own subtracer. This is fine, until a proxy created in fwd is used in bwd.

For a simple example, consider:

```
 class Foo(Function):
            @staticmethod
            def forward(ctx, x):
                ctx.x0 = x.size(0)
                return x * 2

            @staticmethod
            def backward(ctx, grad_out):
                return grad_out * ctx.x0
```
the value stored at `x0` is a proxy - but it is a proxy belonging to the fwd speculation subtracer. Rather than teaching it to the subtracer for bwd, we choose to create a subtracer that covers both fwd and bwd speculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111588
Approved by: https://github.com/zou3519
2023-10-20 21:32:02 +00:00
bdba54fb4d [HigherOrderOp] use assertExpectedInline for control flow tests (#111610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111610
Approved by: https://github.com/zou3519
2023-10-20 21:07:00 +00:00
8ffbc36f8f [Pytorch][Vulkan] Fix the implementation of aten::sum.dim_IntList (#111586)
Summary:
The existing implementation of `aten::sum.dim_IntList` depends on the following steps:
- store the items of arguments `opt_dim` in a  `std::set<int64_t> dims_set;`
- iterate through `dims_set` in the reverse order(i.e. from largest to smallest) and compute the sum for one designated dim in `sum_dim`

But when `opt_dim` contains negative items and `keepdim==false`, the dimension iteration over the  set will be messed up. For example, the existing implementation will fail at the test case `test_sum_dim({10, 7, 5}, {-1, -2});`.

We fix the issue by invoking `int64_t dim_normalized = utils::normalize(d, self.dim());` to get normalized dim in the range [0, `self.dim()` - 1].

Moreover the existing TORCH_CHECK of the condition
```
d >= -self.dim() - 1 && d <= self.dim()
```
is wrong and fixed by
```
d >= -self.dim() && d < self.dim()
```

Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (04b08a835)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*sum*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *sum*
[==========] Running 8 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 8 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.cumsum
[       OK ] VulkanAPITest.cumsum (105 ms)
[ RUN      ] VulkanAPITest.sum_invalid_inputs
[       OK ] VulkanAPITest.sum_invalid_inputs (0 ms)
[ RUN      ] VulkanAPITest.sum_dim_2d
[       OK ] VulkanAPITest.sum_dim_2d (145 ms)
[ RUN      ] VulkanAPITest.sum_dim_3d
[       OK ] VulkanAPITest.sum_dim_3d (91 ms)
[ RUN      ] VulkanAPITest.sum_dim_4d
[       OK ] VulkanAPITest.sum_dim_4d (89 ms)
[ RUN      ] VulkanAPITest.sum_dim_keepdim_2d
[       OK ] VulkanAPITest.sum_dim_keepdim_2d (63 ms)
[ RUN      ] VulkanAPITest.sum_dim_keepdim_3d
[       OK ] VulkanAPITest.sum_dim_keepdim_3d (135 ms)
[ RUN      ] VulkanAPITest.sum_dim_keepdim_4d
[       OK ] VulkanAPITest.sum_dim_keepdim_4d (4 ms)
[----------] 8 tests from VulkanAPITest (637 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (637 ms total)
[  PASSED  ] 8 tests.
```

Reviewed By: yipjustin

Differential Revision: D50442152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111586
Approved by: https://github.com/yipjustin
2023-10-20 20:33:06 +00:00
e4e7d34fe9 [pt2][quant] Clean up QAT get conv-bn-relu nodes (#111515)
Summary: Reduces duplicate code to map original matched nodes
to replacement nodes.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT

Reviewers: jerryzh168

Subscribers: jerryzh168, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111515
Approved by: https://github.com/jerryzh168
2023-10-20 20:01:38 +00:00
cc37d8d3f8 [Easy] Fixed typo in init_device_mesh note (#111658)
It has been a while since I landed a PR...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111658
Approved by: https://github.com/H-Huang, https://github.com/wz337
2023-10-20 19:49:38 +00:00
14c2f296e0 Don't suppress original error message for data-dependent value (#111596)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111596
Approved by: https://github.com/suo
2023-10-20 19:38:50 +00:00
ba04d84089 S390x inductor support (#111367)
Use arch compile flags. They are needed for vectorization support on s390x.
Implement new helper functions for inductor.

This change fixes multiple tests in test_cpu_repro.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111367
Approved by: https://github.com/ezyang
2023-10-20 19:38:46 +00:00
8d03a0dd75 [ez] Remove extraneous files (#111668)
Accidentally added by https://github.com/pytorch/pytorch/pull/111504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111668
Approved by: https://github.com/atalman
2023-10-20 19:34:59 +00:00
fdc29f58c6 [TP] Refactor style to make it work with torch.compile (#111625)
We are refactoring parallel style to solve the following things:
1. To further simplifying code logic to make more readable for users.
2. To remove tuple check so that we can work with dynamo for now. Ideally dynamo needs to support this case and we will fix it in parallel.
3. Add tests for newly added parallel style in UT and torch compile test so that we can capture regression due to code change.
4. Move placements early return check into DTensor since it is by passed by dynamo.
5. Remove PairwiseParallelStyle from unit tests to use the new Col/Rowwise parallel style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111625
Approved by: https://github.com/wanchaol
2023-10-20 19:20:43 +00:00
d1afb7d43d add Half support for multinomial on CPU (#104178)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104178
Approved by: https://github.com/jgong5, https://github.com/kulinseth, https://github.com/cpuhrsch
2023-10-20 19:16:04 +00:00
d1110a18de [Dynamo]make sure resume function have valid names (#111635)
An ongoing effort for https://github.com/pytorch/pytorch/issues/111633 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111635
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-10-20 18:54:52 +00:00
a55ecec195 [dynamo][__torch_function__ 2/n] Refactor TensorWithTFOverrideVariable (#109556)
This is purely a refactor that preserves the existing behavior and tests.

The main contributions of the PR are to refactor the dispatch of `__torch_function__` to enable calling it with  TF override objects in any argument position and matching the eager dispatch behavior.

This will allow for the following in upcoming PRs:

1) have TensorWithTFOverrideVariable inherit from TensorVariable
2) enable tracing through the base `__torch_function__` implementation.

Note: this depends on https://github.com/pytorch/pytorch/pull/109542

towards tracing for https://github.com/pytorch/pytorch/issues/93723

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109556
Approved by: https://github.com/jansel, https://github.com/ezyang
2023-10-20 18:53:38 +00:00
11a3c7696b [dynamo - testing] Add repro for higher order op list inputs (#111647)
Add repro from https://github.com/pytorch/pytorch/issues/110118 now that it has been fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111647
Approved by: https://github.com/ezyang
2023-10-20 18:23:23 +00:00
9656ef88b6 [sigmoid] Switch to oss serializer. (#111455)
Differential Revision: D50348807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111455
Approved by: https://github.com/tugsbayasgalan
2023-10-20 18:19:05 +00:00
974c47a20e remove flatten.using_ints, linalg_*, linear, log_softmax.int, logdet, special_* from xfail list (#110985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110985
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2023-10-20 18:15:39 +00:00
8df42f9220 [PyTorch][Vulkan] Allow 0-size tensors to be represented in PyTorch Vulkan (#111512)
Summary:
0-size tensors are allowed in PyTorch (e.g. a tensor with size {2, 1, 0}). However, this currently causes issues with PyTorch Vulkan as the Vulkan API would raise an error when attempting to allocate a resource with no memory.

This diff fixes the behaviour by adding support for `VulkanImage` and `VulkanBuffer` objects that do not have any associated memory.

Test Plan:
Tested locally with `vulkan_api_test` on Mac as a sanity test.
```
buck run //xplat/caffe2:pt_vulkan_api_test_bin --target-platforms ovr_config//platform/macos:x86_64-fbsource -- --gtest_filter="*"
```

But given how foundational of a change this is, more extensive testing should be done in order to be safe.

Reviewed By: yipjustin

Differential Revision: D50030659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111512
Approved by: https://github.com/yipjustin
2023-10-20 17:46:13 +00:00
2452e65960 [BE] More nested namespaces (#111575)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at bb8fede</samp>

Simplify the syntax of various namespace definitions and declarations in `aten/src/ATen/cpu` and `aten/src/ATen/metal` files to improve code readability and consistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111575
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-10-20 16:43:57 +00:00
a267d95c2a Reland: Add lazy_clone_storage to create COW storages (#111579)
Relands #110192

NOTE: COW storages do not actually copy on write yet, they just have the COW deleter and deleter context applied to them

Part of #109833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111579
Approved by: https://github.com/ezyang
2023-10-20 15:49:59 +00:00
619ae87a1d Disable inductor layout_opt on ROCm (#111474)
Previously we disabled this option on none MI200 GPUs (https://github.com/pytorch/pytorch/pull/107812 due to worse NHWC conv performance on some cards. This PR will disable this feature for all GPUs to make this uniform for ROCm and due to perf regressions noted here https://github.com/pytorch/pytorch/pull/110319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111474
Approved by: https://github.com/jithunnair-amd, https://github.com/eellison
2023-10-20 09:31:01 +00:00
3ca81aed42 Add sdpa to Autocast CPU (#111558)
Fixes #111276

This PR adds sdpa to Autocast CPU policy.

Behavior: Within the scope of `torch.cpu.amp.autocast(dtype=torch.bfloat16)`, `scaled_dot_product_attention` will be forced to run with bf16 data type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111558
Approved by: https://github.com/colesbury
2023-10-20 05:30:09 +00:00
6c56e1ce2b Use fmt::format in NCCLUtils and ProcessGroupNCCL instead of c10::str (#107268)
Fixes #64604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107268
Approved by: https://github.com/fduwjj
2023-10-20 05:26:51 +00:00
37253c0cd5 Update RUFF to 0.1.1 (#111618)
Updates ruff to the latest version with some bugfixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111618
Approved by: https://github.com/colesbury
2023-10-20 04:46:24 +00:00
ff835fb464 [AOTInductor] Disable NonABI tests in fbcode (#111616)
Summary: NonABI mode is not intended to be used in fbcode.

Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:test_aot_inductor

Differential Revision: D50478575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111616
Approved by: https://github.com/desertfire, https://github.com/khabinov
2023-10-20 04:37:05 +00:00
e24fdfa177 [vision hash update] update the pinned vision hash (#111624)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111624
Approved by: https://github.com/pytorchbot
2023-10-20 04:09:18 +00:00
93a9b1314b Make step() faster by passing in a tensor vs scalar 1 (#111084)
This is the culminated result of https://github.com/pytorch/pytorch/pull/110954#issuecomment-1758520411.

We are making the code slightly more complicated to gain some perf in minimizing calls to `.copy_()` and `.to()`.

### Code
```
import torch
with torch.cuda.device(0):
    steps = [torch.zeros((), device="cpu", dtype=torch.float32) for i in range(1000)]

    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ]
    ) as p:
        # New code:
        # step_device = steps[0].device
        # one = torch.tensor(1.0, device=step_device) if str(step_device) == "cpu" else 1
        # torch._foreach_add_(steps, one, 1.0)

        # Old code:
        torch._foreach_add_(steps, 1)

    print(p.key_averages().table(sort_by="cpu_time_total"))
```

### Profiles
**with old code**
```
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
      aten::_foreach_add_        35.31%      52.089ms        99.99%     147.495ms     147.495ms             1
               aten::add_        25.05%      36.949ms        64.68%      95.406ms      95.406us          1000
                 aten::to         3.97%       5.852ms        39.63%      58.457ms      58.457us          1000
           aten::_to_copy        10.11%      14.917ms        35.66%      52.605ms      52.605us          1000
              aten::copy_        21.65%      31.939ms        21.65%      31.939ms      31.939us          1000
      aten::empty_strided         3.90%       5.749ms         3.90%       5.749ms       5.749us          1000
    cudaDeviceSynchronize         0.01%      18.000us         0.01%      18.000us      18.000us             1
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 147.513ms
```

**with new code**
```
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
      aten::_foreach_add_        55.06%      49.963ms        99.86%      90.625ms      90.625ms             1
               aten::add_        44.81%      40.662ms        44.81%      40.662ms      40.662us          1000
            aten::detach_         0.01%       8.000us         0.05%      45.000us      45.000us             1
                  detach_         0.04%      37.000us         0.04%      37.000us      37.000us             1
              aten::empty         0.03%      30.000us         0.03%      30.000us      30.000us             1
                 aten::to         0.03%      23.000us         0.03%      23.000us      23.000us             1
    cudaDeviceSynchronize         0.02%      22.000us         0.02%      22.000us      22.000us             1
         aten::lift_fresh         0.01%       6.000us         0.01%       6.000us       6.000us             1
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 90.751ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111084
Approved by: https://github.com/albanD
ghstack dependencies: #111079
2023-10-20 01:34:08 +00:00
ca7d084ff9 Add ScalarTensor or 0dim overload for _foreach_add (#111079)
Adding a Tensor overload will allow us to:
- optimize in more cases than before
- increase coverage for scalarTensor instead of just scalars in our foreach APIs

The main complication in this PR was that add.Tensor has a scalar overload, so I've now built out support for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111079
Approved by: https://github.com/albanD
2023-10-20 01:34:07 +00:00
935f697754 remove movedim.intlist, tensor_split*, to.* from xfail list (#110999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110999
Approved by: https://github.com/kshitij12345
2023-10-19 23:54:45 +00:00
652f4c656e Freeze fuse two mms (#111232)
Improves llama_v2 perf locally from 1.48x -> 1.55x.

A good future rewrite would be to unify the freezing batching with the other batching rules that @yanboliang & co were working on. I want to wait for the forthcoming pre-dispatch changes to settle down first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111232
Approved by: https://github.com/Chillee
2023-10-19 22:52:34 +00:00
cb856b08b2 [BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496)
Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496
Approved by: https://github.com/malfet
2023-10-19 21:56:36 +00:00
c90f8c883d [ONNX][s390x] byteswap data when serializing to external files during onnx exporting (#111543)
This patch is a complement to #107963 to do byteswap data when exporting to onnx, which its swap bytes from big endian to little endian when external files of a big onnx model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111543
Approved by: https://github.com/justinchuby
2023-10-19 21:44:39 +00:00
8899abde32 [PyTorch][ET] Improve Process Groups Mapping Info Collection (#110908)
Summary:
Process Groups Mapping info collection was introduced in D46321690.

Improve the mapping info collected there:
- replace pg_id (a unique ID for the PG object) with pg_names (a unique name for each pg and shared by all ranks)
- add number of pg info with group_count
- reduce the length of pg_config_info to avoid being truncated(max length of 4096, now doubled ) by
  - migrating ranks(a map from global ranks to group ranks) with the list of global ranks of a pg, since we currently don't use group rank id
  - using an empty rank list to indicate that all ranks are involved in a pg and adding a field of group_size to show how many ranks are involved

Test Plan:
Tested in HPC
```
buck2 run mode/opt //hpc/torchrec/models/ads:cmf_10x_launcher -- launcher=local data_loader=random data_loader.num_batches=100 checkpoint=model_store max_ind_range=10 launcher.num_trainers=8
```
Example output in ET
```
{
"name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "",
      "inputs": ["[{\"pg_name\": \"0\", \"backend_id\": 140688385794048, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"1\", \"backend_id\": 140688386762752, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"2\", \"backend_id\": 140682531798720, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"faa29c0b1e06cd7abc873bd561414911_0\", \"backend_id\": 140672678002688, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"3\", \"backend_id\": 140672678007616, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"faa29c0b1e06cd7abc873bd561414911_1\", \"backend_id\": 140672678012544, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}]"], "input_shapes": [[]], "input_types": ["String"],
      "outputs": [], "output_shapes": [], "output_types": []
    },
```

Before the change, pg_config_info of >128 rank will be truncated, e.g.
```
"inputs": ["[{\"pg_id\": 140321146893696, \"backend_id\": 140321113854976, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\": 114, \"115\": 115, \"116\": 116, \"117\": 117, \"118\": 118, \"119\": 119, \"120\": 120, \"121\": 121, \"122\": 122, \"123\": 123, \"124\": 124, \"125\": 125, \"126\": 126, \"127\": 127}}, {\"pg_id\": 140321074662400, \"backend_id\": 140321100033024, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\": 114, \"115\": 115, \"116\": 116, \"117\": 117, \"118\": 118, \"119\": 119, \"120\": 120, \"121\": 121, \"122\": 122, \"123\": 123, \"124\": 124, \"125\": 125, \"126\": 126, \"127\": 127}}, {\"pg_id\": 140321154994304, \"backend_id\": 140319780290048, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\""], "input_shapes": [[]], "input_types": ["String"],

```
After the change the length reduced
```
"inputs": ["[{\"pg_name\": \"0\", \"backend_id\": 140551405059072, \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"1\", \"backend_id\": 140551399745536, \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"2\", \"backend_id\": 140578999821184, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"ea2f9024c70c8b9a25bc06a4723e5805_0\", \"backend_id\": 140559197777152, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"3\", \"backend_id\": 140549119076736, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"ea2f9024c70c8b9a25bc06a4723e5805_1\", \"backend_id\": 140571995143424, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}]"], "input_shapes": [[]], "input_types": ["String"],
```

Reviewed By: louisfeng, fduwjj

Differential Revision: D50048147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110908
Approved by: https://github.com/fduwjj
2023-10-19 21:37:19 +00:00
675df7520a [tgif][multiforward] allow codegen to generate different func name (#111446)
Summary: see Shiyan's design doc for ATM TS publish weights dedupe https://fb.quip.com/HnUVAjUMaXMQ

Test Plan: tested in N4454041 after D50341352 that multiforward method is working for ts model

Differential Revision: D45750812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111446
Approved by: https://github.com/842974287
2023-10-19 21:19:30 +00:00
f0fac6a94f Update gloo submodule commit to include recent ROCm6.0 related updates (#111465)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111465
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-10-19 21:18:23 +00:00
7a3c3d63bf fix gloo cuda sparse_allreduce dispatch (#111485)
Fixes #111422

allreduce_sparse_cuda gets dispatched to allreduce_sparse which doesnt exist for gloo. However, gloo has an existing implementation so this is just fixing the dispatching to that.

The reason CI didn't catch this is because we are calling the backend directly. Added a test which calls the public API (dist.XYZ) and goes through the dispatcher

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111485
Approved by: https://github.com/fduwjj
2023-10-19 21:15:45 +00:00
dc31dbbcab Optimize reduction + amax fusion (#111122)
This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels.

Benchmark:
```
python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark

Before this PR:
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096).
Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms.

After this PR:
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096).
Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms.
```

LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16.

From Inductor nightly benchmark test:
There are perf differences in cuda_graph / cuda_graph_dynamic / default runs, but no difference in inductor_max_autotune. So it seems to me that the perf differences are mostly like fluctuations.

![Screenshot 2023-10-18 at 4 58 55 PM](https://github.com/pytorch/pytorch/assets/10527447/6640474a-1e1d-4d33-97e9-0a60d0bc9f1f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111122
Approved by: https://github.com/jansel
2023-10-19 20:53:50 +00:00
786c51d626 Symintify torch.diff (#111530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111530
Approved by: https://github.com/bdhirsh, https://github.com/ezyang
ghstack dependencies: #111529
2023-10-19 20:38:57 +00:00
74f6f7adcf Fix NT subclass test typo (#111529)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111529
Approved by: https://github.com/jbschlosser
2023-10-19 20:07:04 +00:00
79529ef657 [dynamo] fix graph break when listlike of tensor contains const (#111572)
Fixes https://github.com/pytorch/pytorch/pull/111557#discussion_r1365620968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111572
Approved by: https://github.com/voznesenskym, https://github.com/lezcano
2023-10-19 19:51:28 +00:00
2a40b7efcb Add Half support for addcmul, addcdiv, cumsum, and topk on CPU (#103319)
Add Half support for addcmul, addcdiv, cumsum, and topk on CPU.
Note: This PR will introduce the issue  https://github.com/pytorch/pytorch/issues/111454.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103319
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-10-19 17:47:45 +00:00
715dfced72 Revert "Nvfuser code removal (#111093)"
This reverts commit 572628e52054b0e061fbaeb0497267380fe45180.

Reverted https://github.com/pytorch/pytorch/pull/111093 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, @albanD please help to support the author with the next steps to get this diff merged ([comment](https://github.com/pytorch/pytorch/pull/111093#issuecomment-1771434853))
2023-10-19 17:39:49 +00:00
ca5f6f7af3 [MPS] Skip virtualized devices (#111576)
Skip devices that does not support `MTLGPUFamilyMac2`, for example something called "Apple Paravirtual device", which started to appear in GitHub CI, from https://github.com/malfet/deleteme/actions/runs/6577012044/job/17867739464#step:3:18
```
Found device Apple Paravirtual device isLowPower false supports Metal false
```

As first attempt to allocate memory on such device will fail with:
```
RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 1.70 GB). Tried to allocate 0 bytes on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).
```

Fixes https://github.com/pytorch/pytorch/issues/111449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111576
Approved by: https://github.com/atalman, https://github.com/clee2000, https://github.com/huydhn
2023-10-19 17:19:35 +00:00
0617f7fa75 [ez] Remove unused code in upload_test_stats (#111504)
This is code related to parallelism and test times that isn't used, so remove it.

Tested by running locally with `python3 -m tools.stats.upload_test_stats --workflow-run-id 6551035874 --workflow-run-attempt 1 --head-branch main --head-repository "pytorch/pytorch"` and commenting out parts for uploading to s3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111504
Approved by: https://github.com/huydhn
2023-10-19 16:09:15 +00:00
4e310fd875 [Autograd] Track when mutations are for triton kernels (#111500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111500
Approved by: https://github.com/bdhirsh
2023-10-19 15:34:34 +00:00
971f67c988 Allow SymInt to specialize to FLOAT (#111219)
Fixes https://github.com/pytorch/pytorch/issues/111200

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111219
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #111216
2023-10-19 12:55:18 +00:00
40c44c2307 Force specialization on INT_LIST (#111216)
Follow up on https://github.com/pytorch/pytorch/pull/95479

Fixes https://github.com/pytorch/pytorch/issues/111198

Fixes https://github.com/pytorch/pytorch/issues/111197

Fixes https://github.com/pytorch/pytorch/issues/111188

Fixes https://github.com/pytorch/pytorch/issues/111201

Fixes https://github.com/pytorch/pytorch/issues/111202

I can also do this for some other types, will do this stacked on top.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111216
Approved by: https://github.com/voznesenskym
2023-10-19 12:55:18 +00:00
aa3243bceb [vmap] symintify : is_same_size and split_with_sizes (#111491)
Partial : https://github.com/pytorch/pytorch/issues/111312

Reference: Point 1 of https://github.com/pytorch/pytorch/issues/111312#issuecomment-1769079147

Should this have a test?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111491
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2023-10-19 11:04:40 +00:00
03e28bde2e [tp] fix torch compile regression (#111521)
The most recent refactor of TP
https://github.com/pytorch/pytorch/pull/111160 breaks torch compile
path, so reverting the behavior back by:
1. use the old default prepare_input/output
2. add the colwise/rowwise parallel test instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111521
Approved by: https://github.com/fduwjj
2023-10-19 10:27:10 +00:00
eqy
894b9957c8 [DOCS][CUDA] Update TF32 docs for sm90 (#111337)
For #110252.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111337
Approved by: https://github.com/msaroufim
2023-10-19 09:36:13 +00:00
503f44fbb8 Fix: perverse input's NaN values to prevent undefined behavior for matrix_exp function (#111539)
Currently, for `matrix_exp` function, if we have NaN values in the input matrices (small batches), it will keep outputting a "normal" result without any NaN value in it, and this will cause some problems that we may can't notice. This PR is for preventing such undefined behavior by "bring back" those NaN values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111539
Approved by: https://github.com/lezcano
2023-10-19 09:07:36 +00:00
90e2117a99 Allow optimizer state conversion to accommodate optimizers that have no tensor state (e.g. SGD) (#111501)
Fixes #111499

This PR slightly alters the new fused `all_gather` `optim_state_dict` implementation to support optimizers without tensor state (e.g. SGD) in a `use_orig_params=True` context.

The principle change is to short-circuit `_allgather_orig_param_states` if an empty `state_buffers` dict is returned after completing `_convert_all_state_info` here:
93e5065ba0/torch/distributed/fsdp/_optim_utils.py (L1481-L1484)

To allow `_convert_all_state_info` to accommodate optimizers with no tensor state, I also change the scope of `dtype` and make the return type `Optional`.

As discussed in the issue this PR fixes, I'm [extending](93e5065ba0/test/distributed/fsdp/test_fsdp_optim_state.py (L1915I)) `test_state_dict_with_none_tensor_state` to test with both Adam and SGD optimizers to validate scalar and non-tensor states continue to be restored for both optimizer types.

Thanks to the distributed team as always for their adroit design and exceptionally valuable contributions to the open source ML community. Hope you all feel appreciated commensurate with the compounding progress your work enables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111501
Approved by: https://github.com/fegin
2023-10-19 06:47:04 +00:00
5ce2ab8466 [cuda] Preserve operations order between vectorized and non-vectorized in ln grad input (#111488)
The vectorized implementation in https://github.com/pytorch/pytorch/pull/111021 changed the order of arithmetic instructions in `layer_norm_grad_input`, causing non bitwise identical results when compared to the non-vectorized implementation. At merging, all accuracy checks passed, including internal inductor ones.

There are CI periodic inductor dynamo tests (e.g. `pit_b_224`) that run eager mode models several times and compare results. If the input buffers are aligned to the vector length, the vectorized implementation will be used. If not, the default one will be used. If the 2 eager runs end up having different buffer alignments, 2 implementations will be called and then the results would be very close but not bitwise identical. The tests check for bitwise identical results and in some cases they may fail.

This fix makes sure that the operation order between non-vectorized and vectorized is the same and the 2 implementations **should** produce bitwise identical results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111488
Approved by: https://github.com/malfet
2023-10-19 06:00:15 +00:00
b2b5f1377b [caffe2] replace numpy.object with object (#111494)
Reviewed By: florazzz

Differential Revision: D50380126

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111494
Approved by: https://github.com/Skylion007
2023-10-19 04:37:00 +00:00
e3463fe4ca [ONNX] Benchmark to store test data along exported model (#111095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111095
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-10-19 03:20:52 +00:00
71d7173ab3 Introduce is_big_gpu condition for test_max_autotune (#111467)
Fixes https://github.com/pytorch/pytorch/issues/111527

Other test files that rely on `max_autotune` mode being enabled already conditionalise the UT suite on this condition (e.g. test_select_algorithm). Proposing to add this condition for test_max_autotune.

Currently we are observing failures on these UTs on the ROCm runners but using MI200+ these tests will pass again (context: https://github.com/pytorch/pytorch/pull/111381#issuecomment-1768048732)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111467
Approved by: https://github.com/shunting314
2023-10-19 03:05:22 +00:00
4ec777e9a5 [BE] Clean up trymerge code handling broken trunk failures (#111520)
This is the final part of https://github.com/pytorch/pytorch/pull/110054.  The broken trunk classification has been done on Dr.CI, so we can just check for that in trymerge for consistency when ghstack is used.

* [x] https://github.com/pytorch/pytorch/pull/110054
* [x] https://github.com/pytorch/pytorch/pull/110133
* [x] This PR to clean up the broken trunk logic.

One important change is that `get_classifications` doesn't need to query the jobs from Rockset for the head and merge base SHA anymore, saving a query there.  The function looks a lot simpler now.

### Testing

https://github.com/pytorch/pytorch/pull/111253 had 1 broken trunk failure as detected by Dr.CI from the base commit 3eb5cae3af (valid) while trymerge didn't detect that because ghstack base commit be8e517174 didn't have the same failure (miss).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111520
Approved by: https://github.com/clee2000
2023-10-19 02:30:56 +00:00
4f0cf1e1ff Mark more decomp tests as slow (#111524)
Something is broken with automatic slow detection, so let's do it manually

Those tests were previously classified as slow, see:
```
test_decomp.py::TestDecompCUDA::test_quick_core_backward_baddbmm_cuda_float64 SKIPPED [0.0003s] (test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test) [ 53%]
test_decomp.py::TestDecompCUDA::test_quick_core_backward_clamp_max_cuda_float64 SKIPPED [0.0002s] (test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test) [ 53%]
test_decomp.py::TestDecompCUDA::test_quick_core_backward_clamp_min_cuda_float64 SKIPPED [0.0002s] (test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test) [ 53%]
```
from https://ossci-raw-job-status.s3.amazonaws.com/log/17792633247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111524
Approved by: https://github.com/kit1980, https://github.com/izaitsevfb, https://github.com/huydhn
2023-10-19 02:29:59 +00:00
18cc8a92ac [ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431)
For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way.

To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`).

This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now.

Cc: @awgu @janeyx99 @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431
Approved by: https://github.com/H-Huang
2023-10-19 00:41:09 +00:00
a7883ee470 Bump urllib3 from 2.0.6 to 2.0.7 in /tools/build/bazel (#111435)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.0.6 to 2.0.7.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.0.6...2.0.7)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-18 17:14:06 -07:00
547a116fcf Fix redundant asserts (#111445)
Fixes: https://github.com/pytorch/pytorch/issues/109852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111445
Approved by: https://github.com/zhxchen17
2023-10-18 23:57:31 +00:00
ba2ba9621c More NT subclass op support for SAM (#111253)
With this PR, we have full op support for SAM without needing to unwrap subclass into jagged buffer -> run ops -> rewrap manually. Specifically, this was previously happening in the MaskDecoder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111253
Approved by: https://github.com/soulitzer, https://github.com/cpuhrsch
2023-10-18 21:21:28 +00:00
53c1dca6a3 [Reland] Add a workflow to release Android binaries (#110976)
This adds 2 jobs to build PyTorch Android with and without lite interpreter:

* Keep the list of currently supported ABI armeabi-v7a, arm64-v8a, x86, x86_64
* Pass all the test on emulator
* Run an the test app on emulator and my Android phone `arm64-v8a` without any issue
![Screenshot_20231010-114453](https://github.com/pytorch/pytorch/assets/475357/57e12188-1675-44d2-a259-9f9577578590)
* Run on AWS https://us-west-2.console.aws.amazon.com/devicefarm/home#/mobile/projects/b531574a-fb82-40ae-b687-8f0b81341ae0/runs/5fce6818-628a-4099-9aab-23e91a212076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110976
Approved by: https://github.com/atalman
2023-10-18 21:17:11 +00:00
a771fde8b1 Update the magma to version 2.7.2 (#111442)
- 2.7.2 version + few ROCm related commits: https://bitbucket.org/icl/magma/pull-requests/37

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111442
Approved by: https://github.com/Skylion007, https://github.com/jithunnair-amd
2023-10-18 21:09:05 +00:00
102fbd402c [ci] Move step to get workflow job id before test step in linux (#111483)
We’ve been strugging to get the job id since 9/28/2023 12:03 pm.  Before this we had almost 0 problems getting job id, but after, we get a lot of `Recieved status code '502' when attempting to retrieve https://api.github.com/repos/pytorch/pytorch/actions/runs/6551579728/jobs?per_page=100:\n", 'Bad Gateway\n\nheaders=Server: GitHub.com\nDate: Tue, 17 Oct 2023 20:32:52 GMT\nContent-Type: application/json\nContent-Length: 32\nETag: "652eed15-20"\nVary: Accept-Encoding, Accept, X-Requested-With\nX-GitHub-Request-Id: EC62:7EE0:166AAF5:2D51A8E:652EEF6A\nconnection: close\n\n` ex https://github.com/pytorch/pytorch/actions/runs/6551579728/job/17793898278#step:18:22

Recently, it has been happening around 1/4 of the time, possibly more. I think this happens almost only on linux.

I believe this is somehow caused by a test, since distributed tests seems to be disproportionately affected, so I move the step to get the job id before the test step.  This also has the benefit of the test step being able to get the job id now if we want it.

Regardless of whether this works or not, its a pretty harmless change that might make things easier in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111483
Approved by: https://github.com/huydhn
2023-10-18 20:54:06 +00:00
9c7391ea36 Revert " [1/N] Apply clang-tidy to c10 cuda files (#111137)"
This reverts commit 43b023694eea4348fa28e8028fa7445d6375860c.

Reverted https://github.com/pytorch/pytorch/pull/111137 on behalf of https://github.com/malfet due to Was reverted internally due to the failures in torch.cuda.memory_stats(device=0) (presumably) ([comment](https://github.com/pytorch/pytorch/pull/111137#issuecomment-1769274103))
2023-10-18 20:32:53 +00:00
7fabb73dae Add ciflow/rocm label to run ROCm jobs (#111394)
Fixes https://github.com/pytorch/test-infra/issues/4516.  As this is not part of trunk, it won't block regular merge.  On the other hand, we can still add `ciflow/rocm` to run it on PR.

~~I'll add an auto label rule for this after this is merged and the label becomes available~~ Here it is https://github.com/pytorch/test-infra/pull/4647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111394
Approved by: https://github.com/ZainRizvi
2023-10-18 20:28:13 +00:00
16cb3bdd57 Skip test_quick_core_backward_baddbmm_cuda_float64 (#111493)
As its painfully slow (10+ min on A100):
```shell
$ time python3 test_decomp.py -v -k test_quick_core_backward_baddbmm_cuda_float64
Fail to import hypothesis in common_utils, tests are not derandomized
test_quick_core_backward_baddbmm_cuda_float64 (__main__.TestDecompCUDA) ... ok

----------------------------------------------------------------------
Ran 1 test in 897.523s

OK

real	15m4.773s
user	15m0.207s
sys	0m6.492s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111493
Approved by: https://github.com/clee2000, https://github.com/huydhn
2023-10-18 20:09:14 +00:00
93e5065ba0 [CODEMOD][caffe2] replace numpy.bool with bool (#111432)
Test Plan:
numpy.bool is long deprecated and removed starting numpy-1.20.0 [1]. This replaces all references with equivalent `bool` type using the following oneliner:
```
rg -l 'np\.bool' caffe2 | grep '\.py$' | xargs perl -pi -e 's,\bnp\.bool\b,bool,'
```
1. https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Differential Revision: D50372711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111432
Approved by: https://github.com/Skylion007
2023-10-18 18:56:40 +00:00
fa995626a8 [ROCm] Bump kineto submodule commit to clear kineto cache to avoid memory leaks (#110849)
Fixes #103999

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110849
Approved by: https://github.com/Skylion007
2023-10-18 17:34:03 +00:00
256a5ff49d int4 mm kernel enhancement (#111460)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111460
Approved by: https://github.com/Chillee
2023-10-18 17:19:52 +00:00
b72a1402f5 [AOTInductor] ProxyExecutor skips serializing missing args with default value (#111425)
Summary: In AOTInductor ABI Compatible-mode, we don't serialize missing args with default value.

Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops

Differential Revision: D50345729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111425
Approved by: https://github.com/angelayi
2023-10-18 17:10:42 +00:00
543dc75746 [Reland] horizontal concat fusion (#111437)
Reland https://github.com/pytorch/pytorch/pull/108115

The main fix is to disallow nop nodes to be included in foreach scheduler nodes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111437
Approved by: https://github.com/yanboliang
2023-10-18 17:09:01 +00:00
3eb5cae3af Revert "[Compiled Autograd] Turn accumulate_grad into an op (#111271)"
This reverts commit 04b04c068659127a53d659c44b0dd75fa9fd5887.

Reverted https://github.com/pytorch/pytorch/pull/111271 on behalf of https://github.com/jeanschmidt due to Breaking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111271#issuecomment-1768527932))
2023-10-18 14:02:34 +00:00
0be90c5d7f Revert "[Compiled Autograd] Error if tensor_post_acc_grad_hooks is set (#111273)"
This reverts commit cba0dd0fdcdc550005976fd4af6fd3c70f4ddb3c.

Reverted https://github.com/pytorch/pytorch/pull/111273 on behalf of https://github.com/jeanschmidt due to Breaking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111273#issuecomment-1768522328))
2023-10-18 14:00:30 +00:00
a389e2c7c7 Revert "[inductor] Move inductor ops to CompositeExplicitAutograd (#111274)"
This reverts commit 8b46a106f254fd860a4b7b99c8bb640ba58cb176.

Reverted https://github.com/pytorch/pytorch/pull/111274 on behalf of https://github.com/jeanschmidt due to Breaking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111274#issuecomment-1768517555))
2023-10-18 13:57:23 +00:00
ed7739d690 Revert "[aot_inductor] return a copy of any constant (#111356)"
This reverts commit 71e1f34923af186dff46a8641c977a1cf507e06c.

Reverted https://github.com/pytorch/pytorch/pull/111356 on behalf of https://github.com/jeanschmidt due to Breaking internal ci ([comment](https://github.com/pytorch/pytorch/pull/111356#issuecomment-1768503640))
2023-10-18 13:51:30 +00:00
08f580d498 Revert "[inductor] Refactor and optimize allocation calls (#111117)"
This reverts commit 9ce0ae836d6801a39776897b9e891cd978b28aea.

Reverted https://github.com/pytorch/pytorch/pull/111117 on behalf of https://github.com/jeanschmidt due to Braking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111117#issuecomment-1768489865))
2023-10-18 13:45:02 +00:00
a4391f085b Add regression test for cuda_stream type checks (#111430)
Reported in https://github.com/pytorch/pytorch/issues/111268
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111430
Approved by: https://github.com/huydhn
ghstack dependencies: #111428
2023-10-18 07:24:01 +00:00
e2f1d03d73 [BE] Use C10_UNUSED (#111439)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 21e87dc</samp>

> _We're sailing on the sea of code, with warnings to avoid_
> _We use the `C10_UNUSED` macro for variables unexploited_
> _We heave and ho and pull and push, and make the code more neat_
> _We sing this shanty as we go, to keep us in good spirits_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111439
Approved by: https://github.com/huydhn
2023-10-18 04:54:47 +00:00
1ac36dbd2a [aotinductor] Make writing of the weight files to be conditional (#111379)
Summary: Since we cache the AOTInductor generated library file, we should not need to write the weights as binary file if the library file already exists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111379
Approved by: https://github.com/chenyang78
2023-10-18 04:52:36 +00:00
108378e2af Fix: torch.matrix_exp performance issue (#105225) (#110848)
Fixes #105225

- New implementation for `compute_T18_scale_square` method.
- Always use the highest degree for large batch sizes (size > 1).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110848
Approved by: https://github.com/lezcano
2023-10-18 04:43:25 +00:00
a9b3afd3d8 [aotinductor] Refactor the generated result (#111080)
Summary: Return the compiled library path as a string instead of wrap it as a callable.

Differential Revision: [D50246941](https://our.internmc.facebook.com/intern/diff/D50246941)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111080
Approved by: https://github.com/jansel, https://github.com/chenyang78
2023-10-18 04:35:34 +00:00
e9a51a6a07 [BE] Revive test_typing (#111428)
`test_typing.py` was written to use `pytest` in https://github.com/pytorch/pytorch/pull/54234 which unfortunately rendered it incompatible with run_test.py, and therefore it was not running in CI all this time.

In this PR, same functionality is re-written using unittest framework, and `parametrize` from `torch.testing._internal._common_utils`.

Valid `test_typing.py` with ufmt

Disable `fail/bitwise_ops.py` and `pass/jit.py` as it regressed at some point as well as one of examples in `namedtuple.py` as `torch.linalg.qr` type is no longer revealed correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111428
Approved by: https://github.com/clee2000
2023-10-18 02:19:49 +00:00
572628e520 Nvfuser code removal (#111093)
Removes the existing integration code & build of nvfuser in TorchScript.

Note that I intentionally left the part where we wipe out `third_party/nvfuser` repo. I'll do that in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111093
Approved by: https://github.com/albanD
2023-10-18 01:00:47 +00:00
0b14ec8ca6 [ONNX] Add dynamo_onnx_aot_inline to bench (#110183)
An option that applies onnx.inliner post model export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110183
Approved by: https://github.com/thiagocrepaldi
2023-10-18 00:43:04 +00:00
eafce2394d [pytorch-vulkan] aten::floor_divide (#110785)
Summary:
as title. only tensor_scalar.

this diff does not include the element-wise tensor-tensor operation.

Test Plan:
```
[yipjustin@33167.od ~/fbsource (9cfca7c97)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*floor_divide_scalar*"
Watchman fresh instance: new mergebase, cleared graph state, cleared dep files
Buck UI: https://www.internalfb.com/buck2/bcac40be-79af-47c5-bd3f-95c11179aa68
Network: Up: 29MiB  Down: 264MiB  (reSessionID-2fef8b89-76b0-4496-bb27-b10d42cf7ef4)
Jobs completed: 5196. Time elapsed: 45.9s.
Cache hits: 81%. Commands: 2070 (cached: 1672, remote: 375, local: 23)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *floor_divide_scalar*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.floor_divide_scalar
[       OK ] VulkanAPITest.floor_divide_scalar (150 ms)
[ RUN      ] VulkanAPITest.floor_divide_scalar_inplace
[       OK ] VulkanAPITest.floor_divide_scalar_inplace (39 ms)
[----------] 2 tests from VulkanAPITest (189 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (190 ms total)
[  PASSED  ] 2 tests.
```

Differential Revision: D50001740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110785
Approved by: https://github.com/SS-JIA
2023-10-17 23:02:35 +00:00
2dc1726ab7 Compile NestedTensor with AOTAutograd (#110529)
This PR has a number of changes that improve subclass support for AOTAutograd/Inductor in general:
-  previously if a subclass does extra aliasing between graph outputs/inputs in a way, the partitioner would complain because grad_outputs are the outputs reused as-is. Now we do a view_as(self) to workaround this.
- Use dense -> dense metadata when working with fwd_output_strides during backward. This is important since the stride information comes from inductor which sees the dense to dense graph.
- Inductor requires that the inputs to the compiled backward to match some expected strides computed during compilation. We make sure to make the inner tensors of the subclass contiguous (previously, we only made the subclass itself contiguous)

Changes specific to NestedTensor relevant to compilation:
- Properly handle the case where `__tensor_unflatten__` is passed non-symbolic dense tensors and with meta extracted from fake subclasses.
- Skip var_to_range logic for singleton int
- Skip size hint logic in inductor for singleton int

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110529
Approved by: https://github.com/bdhirsh
2023-10-17 21:17:10 +00:00
e708de83b9 [4/N] Reorder VariableBuilder._wrap (#111409)
Reorganize the priority inside of ```VariableBuilder._wrap```:
* is_allowed returning True -> TorchVariable
* skipfiles.check returning True -> SkipFilesVariable
* UserFunctionVariable/UserMethodVariable (This is means both is_allowed and skipfiles.check returning False, then inlining by default)
* UserDefinedClassVariable
* UserDefinedObjectVariable (the ultimate default value)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111409
Approved by: https://github.com/jansel
2023-10-17 21:12:34 +00:00
41490119f2 Revert "[sparse] semi-structured sparse + torch.compile support (#111049)"
This reverts commit 408f210938176870133a3dde5e8fbc4926cafbc0.

Reverted https://github.com/pytorch/pytorch/pull/111049 on behalf of https://github.com/clee2000 due to Sorry I'm pretty sure this caused a memory leak 408f210938 https://github.com/pytorch/pytorch/actions/runs/6550388354/job/17790615103 `test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_mlp_contiguous_relu_compile_backend_cutlass_dense_input_shape_(1, 128)_cuda - RuntimeError: CUDA driver API confirmed a leak in __main__.TestSparseSemiStructuredCUDA.test_mlp_contiguous_relu_compile_backend_cutlass_dense_input_shape_(1, 128)_cuda! Caching allocator allocated memory was 235008 and is now reported as 352256 on device 0. CUDA driver allocated memory was 359333888 and is now 361431040.` ([comment](https://github.com/pytorch/pytorch/pull/111049#issuecomment-1767186569))
2023-10-17 21:11:09 +00:00
17002d25c5 [export] Remove call_spec argument from ExportedProgram ctor. (#111407)
Summary: call_spec arg is not used anymore.

Test Plan: CI

Reviewed By: SherlockNoMad, tugsbayasgalan

Differential Revision: D50335365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111407
Approved by: https://github.com/izaitsevfb
2023-10-17 21:01:37 +00:00
2bb1692334 fix dict size change during iteration (#111267)
Summary:
_wrapped_fns_to_patch points to f_globals which might change during iteration due to factors like lazy imports. This diff fixes potential runtime errors like:

```
RuntimeError: dictionary changed size during iteration
```

Test Plan: CI

Reviewed By: Kronuz

Differential Revision: D50283983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111267
Approved by: https://github.com/yanboliang
2023-10-17 20:36:13 +00:00
cc9b7bb85c [reland] [inductor] fix a max-autotune rng state related bug (#111381)
reland https://github.com/pytorch/pytorch/pull/109828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111381
Approved by: https://github.com/lezcano
2023-10-17 19:16:36 +00:00
1aad6d803a [Reland][Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567) (#111396)
This is a reland of #110567 with additional fbcode fixed.

Summary:
In ABI compatible mode, We always need op_overload.schema for FallbackKernel.

Approved by: https://github.com/jansel

Test Plan: contbuild & OSS CI, see 37a0265992

Differential Revision: D50339346

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111396
Approved by: https://github.com/chenyang78
2023-10-17 18:53:38 +00:00
6e8079e00f Fix timeout value for memory leak check job (#111386)
Fixes https://github.com/pytorch/pytorch/pull/110193 as it doesn't work as expected:

* I forgot the timeout on the test step
* Also MacOS test job wasn't covered

### Testing

The job timeout is set correctly to 600 https://github.com/pytorch/pytorch/actions/runs/6541825177/job/17764485473#step:14:7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111386
Approved by: https://github.com/clee2000
2023-10-17 18:25:02 +00:00
543a763cd8 [DCP] Add HSDP checkpiont unit tests (#111399)
Add two unit tests:

1. HSDP checkpoint unit test
2. HSDP FSDP checkpoint conversion unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111399
Approved by: https://github.com/wanchaol
2023-10-17 17:59:42 +00:00
2c313880fc [TD] Make test class correlation scores available to heuristics. (#111229)
https://github.com/pytorch/test-infra/pull/4617 generates `file_test_class_rating.json`. Now we ensure it's available for heuristics to use during the test step.

(Actual heuristics will come in a separate PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111229
Approved by: https://github.com/huydhn
2023-10-17 16:29:30 +00:00
973c87b320 raise instead of skip in test/test_meta.py (#110939)
Supersedes #109004.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110939
Approved by: https://github.com/lezcano, https://github.com/kurtamohler
2023-10-17 10:17:43 +00:00
71e1f34923 [aot_inductor] return a copy of any constant (#111356)
When the model returns a constant, we cannot "release" its handle,
because the constant doesn't have any handle at all. Instead,
we should allocate a new tensor and then return a copy of the constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111356
Approved by: https://github.com/hl475
2023-10-17 08:44:21 +00:00
7a740e2b85 Revert "direct runtime assertions (#111262)"
This reverts commit e6d9350d7f135b3e0f27a949853ae691021b51f6.

Reverted https://github.com/pytorch/pytorch/pull/111262 on behalf of https://github.com/jeanschmidt due to Breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/111262#issuecomment-1765881675))
2023-10-17 08:04:36 +00:00
29048be41c [Reland] Add int4mm kernel (#111403)
This is a reland for #110914, #111327 and #111390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111403
Approved by: https://github.com/Chillee
2023-10-17 06:33:18 +00:00
cyy
7b7f070ec5 [3/N] Apply clang-tidy to aten/src/ATen/core/ (#111301)
Applies clang-tidy to aten/src/ATen/core/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111301
Approved by: https://github.com/Skylion007
2023-10-17 05:52:20 +00:00
cyy
43b023694e [1/N] Apply clang-tidy to c10 cuda files (#111137)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111137
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2023-10-17 04:52:50 +00:00
46000bede6 Fix a typo in fake tensor test. (#111193)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111193
Approved by: https://github.com/janeyx99
2023-10-17 03:36:28 +00:00
013b51f8cc [state_dict][7/N] Add a fine tuning e2e test case for distributed.state_dict and DCP (#111111)
As title

Differential Revision: [D50209732](https://our.internmc.facebook.com/intern/diff/D50209732/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111111
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107, #111275, #111109, #111110, #111120
2023-10-17 03:09:12 +00:00
9ce0ae836d [inductor] Refactor and optimize allocation calls (#111117)
This splits out changes from
https://github.com/pytorch/pytorch/pull/102625 to make things easier to
review.

This diff creates a `make_allocation()` method that extracts the logic
from `make_buffer_allocation()` while allowing us to allocate non-buffer
objects. In particular, we will use this to allocate memory pools during
memory planning.

This diff also includes a small optimization -- if the desired
allocation is contiguous, then we emit a call to `empty()` instead of
`empty_strided()` with its superfluous stride argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111117
Approved by: https://github.com/jansel
2023-10-17 03:06:52 +00:00
cyy
3e354ef3e3 Increase coverage of clang-tidy to CudaIPCTypes.cpp (#111371)
This PR uses clang-tidy in torch/csrc/CudaIPCTypes.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111371
Approved by: https://github.com/Skylion007
2023-10-17 02:08:10 +00:00
a0632389b7 [BE]: Update lintrunner mypy to 1.6.0 (#111375)
Follow up to #111305 that updates lintrunner's version too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111375
Approved by: https://github.com/malfet
2023-10-17 01:22:06 +00:00
c8a72db432 [BE]: Update ruff to 0.1.0 (#111391)
Updates RUFF to the latest and greatest version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111391
Approved by: https://github.com/albanD, https://github.com/malfet
2023-10-17 01:09:16 +00:00
19a6487ad4 [state_dict][6/N] Change API names to avoid conflict and simplify the API signatures (#111120)
`state_dict` is a very common variable name people use to represent a local
state_dict and `load_state_dict` conflicts with DCP's `load_state_dict`.

This PR changes `state_dict` to `get_state_dict`. `get_state_dict` is more close to what is this API does -- users use the API to get the current state_dict for saving or for loading (passed to DCP for loading in-place)..

This PR also changes `load_state_dict` to `set_state_dict`. `set_state_dict` is less ideal compared to `get_state_dict` but is symetric. We can still change the API name before it goes to beta.

This PR also simplies the API signatures. `model_only` is removed and `optim_only` only exists for `get_state_dict`.

Differential Revision: [D50213931](https://our.internmc.facebook.com/intern/diff/D50213931/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111120
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107, #111275, #111109, #111110
2023-10-17 00:15:31 +00:00
7fb09b804b Reland "AOTAutograd: Go down inference path if no outputs require grad (#111011)" (#111347)
Re-land of https://github.com/pytorch/pytorch/pull/111011.

The original PR ended up having a bad interaction with code that tried to run `torch.compile` under `with torch.inference_mode`, which caused some internal tests to fail.

The issue was that:

(1) AOTInductor invokes the pattern matcher passes in inductor

(2) The pattern matcher registers some code with [training_graph](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/pad_mm.py#L461)

(3) The `training_graph` function expects to be able to set the global autograd state to `requires_grad`, and always get out a join graph (assertion [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/pattern_matcher.py#L1196)).

(4) However, when inference_mode is activated, and you try to run AOTAutograd, AOTAutograd will witness that all outputs to the traced function will not require grad, and (now correctly) think that we are tracing an inference graph, which fails the above assert.

After talking to Bin, it sounds like these training-only patterns aren't necessary when we know we are compiling an inference graph (which should always be the case if you're running torch.compile with inference_mode). So I updated the pattern matcher to ignore any pattern matches using `training_graph`, when inference_mode is enabled.

This reverts commit cf6b1cdf6ac74d375b0787bd8f9463cb3a53b0e5.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111347
Approved by: https://github.com/Chillee
2023-10-17 00:11:15 +00:00
f84755bcac Fix _CudaStreamBase type annotations (#111387)
Make it inherit from `Stream` as indeed it is, see 97a513ed07/torch/csrc/cuda/Stream.cpp (L208) and
```
python3 -c "import torch;print(torch._C._CudaStreamBase.__base__)"
<class 'torch.Stream'>
```

Fixes https://github.com/pytorch/pytorch/issues/111268

TODO (in separate PR): Revive `test_typing` and add regression test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111387
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007
2023-10-16 23:26:58 +00:00
9683a26c55 [state_dict][5/N] Add submodules save and load support (#111110)
It is not easy for user to do submodules save and load (e.g., fine tuning) because FSDP requires to get the root module. This PR enables the support of submodule save and load.

Differential Revision: [D50209727](https://our.internmc.facebook.com/intern/diff/D50209727/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111110
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107, #111275, #111109
2023-10-16 23:25:37 +00:00
bd9a2465e7 Back out "Add a workflow to release Android binaries (#110976)" (#111401)
Summary:
Original commit changeset: 96813f0fac68

Original Phabricator Diff: D50161780

This breaks the integration test on T166457344

Test Plan: Sandcastle.

Differential Revision: D50344243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111401
Approved by: https://github.com/izaitsevfb
2023-10-16 23:16:37 +00:00
408f210938 [sparse] semi-structured sparse + torch.compile support (#111049)
Summary:

This PR adds in torch.compile support for semi-structured sparsity,
using the subclass tracing @bdhirsh added.

Based on wether we are using cuSPARSELt or CUTLASS, we return a
different representation of the inner tensors.

Test Plan:
```
python test/test_sparse_semi_structured.py -k compile
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111049
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110583
2023-10-16 23:07:26 +00:00
deb800ee81 Fix typo under test directory (#111304)
This PR fixes typo in comments under `test` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111304
Approved by: https://github.com/Skylion007
2023-10-16 23:06:06 +00:00
1e70f4d02c Revert "Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)"
This reverts commit bb1424d46e656dfcdd4c12efe58ada9f1720c4d8.

Reverted https://github.com/pytorch/pytorch/pull/111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111072#issuecomment-1765399829))
2023-10-16 23:03:26 +00:00
5a8a89360d Handle the .tolist method of np.arrays in dynamo (#111382)
Fixes part 1 of https://github.com/pytorch/pytorch/issues/111370#issuecomment-1764730773

While at it, add a test for numpy ndarray `.size` attribute. This started as an attempt to remove the delegation of what looks like a `.size()` method --- which does not exist in numpy --- on the same line this patch adds a `tolist` to.
But this is apparently needed for something else and existing tests start failing. Thus, declare it as _ain't broken don't fix_, and only keep the test. Can remove the test if wanted though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111382
Approved by: https://github.com/lezcano
2023-10-16 22:56:52 +00:00
afb4914c3d Align torch.library.impl with the new torch.library style (#111308)
We add a new overload to torch.library.impl that accepts an optional
Library arg. If provided, the lifetime of the registration will be
tied to the Library arg, otherwise, it will live forever.

Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111308
Approved by: https://github.com/soulitzer
ghstack dependencies: #111307
2023-10-16 22:32:23 +00:00
9d9cc67592 Make torch.library.define consistent with the new APIs (#111307)
This PR introduces a new overload of torch.library.define. Like
impl_abstract, and our plans for the rest of the torch.library APIs, we
allow it to accept an optional library object to tie the lifetime of the
op definition to.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111307
Approved by: https://github.com/soulitzer, https://github.com/ezyang
2023-10-16 22:32:23 +00:00
5c3955200c Add linear quantize function to custom ops (#111148)
Summary: Add linear quantize for vulkan to custom ops so it can be used from a model.

Test Plan:
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1
//xplat/caffe2/fb/custom_ops/vulkan_quantized:pt_vulkan_quantized_test_binAppleMac\#macosx-arm64
[       OK ] VulkanAPITest.convert_qconv2d_context (135 ms)
[ RUN      ] VulkanAPITest.linear_2d
[       OK ] VulkanAPITest.linear_2d (4 ms)
[----------] 2 tests from VulkanAPITest (139 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (139 ms total)
[  PASSED  ] 2 tests.
##############################################################
buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource
//xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
[       OK ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_int8_int32 (11 ms)
[ RUN      ] VulkanAPITest.linear_2d_flat
[       OK ] VulkanAPITest.linear_2d_flat (4 ms)
[ RUN      ] VulkanAPITest.linear_2d_small
[       OK ] VulkanAPITest.linear_2d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_2d_large
[       OK ] VulkanAPITest.linear_2d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_3d_flat
[       OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_small
[       OK ] VulkanAPITest.linear_3d_small (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_large
[       OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_flat
[       OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_custom
[       OK ] VulkanAPITest.linear_custom (0 ms)
[----------] 76 tests from VulkanAPITest (1811 ms total)
[----------] Global test environment tear-down
[==========] 76 tests from 1 test suite ran. (1811 ms total)
[  PASSED  ] 76 tests.
YOU HAVE 8 DISABLED TESTS
##############################################################
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1
[----------] Global test environment tear-down
[==========] 346 tests from 1 test suite ran. (5648 ms total)
[  PASSED  ] 345 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS

Reviewed By: manuelcandales

Differential Revision: D49609985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111148
Approved by: https://github.com/yipjustin
2023-10-16 21:47:09 +00:00
408e991dfe Revert "Quant: add weight int4pack mm kernel (#110914)"
This reverts commit 9980876cab9dcedce7d7dd1c8a2e168b548eaa36.

Reverted https://github.com/pytorch/pytorch/pull/110914 on behalf of https://github.com/jeanschmidt due to Breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110914#issuecomment-1765302621))
2023-10-16 21:27:26 +00:00
5ff9b49063 Revert "update int4 tinygemm kernels (#111327)"
This reverts commit e0e15a4ac61648cc8f63f0ab102c32e8884fb5d1.

Reverted https://github.com/pytorch/pytorch/pull/111327 on behalf of https://github.com/jeanschmidt due to This PR is preventing the revert of https://github.com/pytorch/pytorch/pull/110914 ([comment](https://github.com/pytorch/pytorch/pull/111327#issuecomment-1765299310))
2023-10-16 21:24:54 +00:00
f29b957475 [cuda] vectorized implementation for layer_norm_grad_input_kernel (#111021)
Using vectorized loads/stores makes the `layer_norm_grad_input_kernel` generally faster. This PR accelerates medium and larger problem sizes.

```python
def run_model_on_device(fs, X, gO, device_string, numeric_type):
    ln = torch.nn.LayerNorm((fs,), device=device_string, dtype=numeric_type)
    ln.reset_parameters()
    X.grad = None
    ln.zero_grad(set_to_none=True)
    out = ln(X)
    out.backward(gO)
    return (ln.weight.grad, ln.bias.grad)

def run_correctness_test(eps_weight, eps_bias):
    dtype = torch.float
    for val in l_inputs:
        bs = val[0][0]
        fs = val[0][1]

        mean_adjustment = torch.randn(fs, device="cpu", dtype=torch.float)
        X = mean_adjustment * torch.randn(
            bs, fs, device="cpu", dtype=torch.float, requires_grad=True
        )

        X = X.detach().requires_grad_()
        gO = torch.rand_like(X)
        X_gpu = X.to("cuda")
        X_gpu = X_gpu.detach().requires_grad_()
        gO_gpu = gO.to("cuda")
        gO_gpu = gO_gpu.detach().requires_grad_()

        grad_cpu_ref = run_model_on_device(fs, X, gO, "cpu", dtype)
        grad_gpu = run_model_on_device(fs, X_gpu, gO_gpu, "cuda", dtype)
        weight_grad_gpu_target = grad_gpu[0].detach().to("cpu")
        bias_grad_gpu_target = grad_gpu[1].detach().to("cpu")

        weight_delta = torch.abs(grad_cpu_ref[0] - weight_grad_gpu_target)
        weight_mismatches = (weight_delta >= eps_weight).nonzero()
        weight_mismatch_pct = len(weight_mismatches) / len(weight_delta) * 100

        bias_delta = torch.abs(grad_cpu_ref[1] - bias_grad_gpu_target)
        bias_mismatches = (bias_delta >= eps_bias).nonzero()
        bias_mismatch_pct = len(bias_mismatches) / len(bias_delta) * 100

        if weight_mismatch_pct > 0 or bias_mismatch_pct > 0:
            print(
                "Size ({} x {}) mismatch percentage: weight {:3.2f} bias {:3.2f}".format(
                    fs, bs, weight_mismatch_pct, bias_mismatch_pct
                )
            )

# Run the correctness tests
run_correctness_test(0.01, 0.01)
torch.cuda.synchronize()

# Allocate a tensor equal to L2 cache size on A100 GPUs
l2_cache_flusher = torch.empty(int(80 * (1024**2)), dtype=torch.float, device="cuda")

# Run the performance tests. We need to run this at global scope because otherwise
# the `ln` and `gO` objects are likely removed by the JIT compiler
results = []
for dtype in (torch.float, torch.half):
    for val in l_inputs:
        bs = val[0][0]
        fs = val[0][1]
        iterations = val[1]

        ln = torch.nn.LayerNorm((fs,), device="cuda", dtype=dtype)
        X = torch.randn(bs, fs, device="cuda", dtype=dtype, requires_grad=True)
        gO = torch.rand_like(X)

        # Try to measure FWD and BWD pass in the same loop
        l_ev_start_fwd = [torch.cuda.Event(enable_timing=True)] * iterations
        l_ev_stop_fwd = [torch.cuda.Event(enable_timing=True)] * iterations
        l_ev_stop_bwd = [torch.cuda.Event(enable_timing=True)] * iterations

        l_fwd_times = []
        l_bwd_times = []
        torch.cuda.synchronize()
        for i in range(iterations):
            l2_cache_flusher.zero_()
            torch.cuda._sleep(1_000_000)

            X.grad = None
            ln.zero_grad(set_to_none=True)

            l_ev_start_fwd[i].record()
            out = ln(X)
            l_ev_stop_fwd[i].record()
            out.backward(gO)
            l_ev_stop_bwd[i].record()
        torch.cuda.synchronize()

        l_fwd_times = []
        l_bwd_times = []
        for i in range(iterations):
            l_fwd_times.append(l_ev_start_fwd[i].elapsed_time(l_ev_stop_fwd[i]))
            l_bwd_times.append(l_ev_stop_fwd[i].elapsed_time(l_ev_stop_bwd[i]))

        print(
            "({}, {}, {}, fwd_ms, bwd_ms)|{:.3f}|{:.3f}".format(
                dtype,
                bs,
                fs,
                sum(l_fwd_times) / iterations * 1000,
                sum(l_bwd_times) / iterations * 1000,
            )
        )
```

Results in the attached picture:

<img width="314" alt="Screenshot 2023-10-16 at 11 08 25 AM" src="https://github.com/pytorch/pytorch/assets/23515689/ce571fc5-c84e-47eb-95f6-9faa44042cc1">

I also isolated the previous implementation and the vectorized one into a native CUDA program and the speedup is confirmed. **Average speedup = 21.73%**

```
Size (2048, 2048); Mismatches: dX = 0 out of 4194304. Max missmatch idx = 0.                                                                                                                                                                                           [16/1529]
reference = 0.0560 (ms); optimized = 0.0435 (ms); bw_opt = 1437.54 GB/s; speedup = 28.78%
Size (4096, 512); Mismatches: dX = 0 out of 2097152. Max missmatch idx = 0.
reference = 0.0220 (ms); optimized = 0.0174 (ms); bw_opt = 1797.26 GB/s; speedup = 26.44%
Size (1024, 512); Mismatches: dX = 0 out of 524288. Max missmatch idx = 0.
reference = 0.0101 (ms); optimized = 0.0082 (ms); bw_opt = 953.49 GB/s; speedup = 22.97%
Size (1024, 256); Mismatches: dX = 1 out of 262144. Max missmatch idx = 22411.
reference = 0.0082 (ms); optimized = 0.0075 (ms); bw_opt = 521.14 GB/s; speedup = 9.21%
Size (1024, 1024); Mismatches: dX = 0 out of 1048576. Max missmatch idx = 0.
reference = 0.0137 (ms); optimized = 0.0108 (ms); bw_opt = 1447.42 GB/s; speedup = 26.93%
Size (2048, 512); Mismatches: dX = 0 out of 1048576. Max missmatch idx = 0.
reference = 0.0141 (ms); optimized = 0.0116 (ms); bw_opt = 1349.79 GB/s; speedup = 21.81%
Size (2048, 256); Mismatches: dX = 0 out of 524288. Max missmatch idx = 0.
reference = 0.0108 (ms); optimized = 0.0102 (ms); bw_opt = 768.90 GB/s; speedup = 6.09%
Size (1024, 128); Mismatches: dX = 1 out of 131072. Max missmatch idx = 9165.
reference = 0.0070 (ms); optimized = 0.0068 (ms); bw_opt = 288.56 GB/s; speedup = 2.81%
Size (1024, 2048); Mismatches: dX = 0 out of 2097152. Max missmatch idx = 0.
reference = 0.0223 (ms); optimized = 0.0164 (ms); bw_opt = 1905.58 GB/s; speedup = 35.90%
Size (1024, 768); Mismatches: dX = 3 out of 786432. Max missmatch idx = 507105.
reference = 0.0113 (ms); optimized = 0.0101 (ms); bw_opt = 1160.00 GB/s; speedup = 11.79%
Size (2048, 128); Mismatches: dX = 0 out of 262144. Max missmatch idx = 0.
reference = 0.0097 (ms); optimized = 0.0089 (ms); bw_opt = 440.97 GB/s; speedup = 9.12%
Size (2048, 1024); Mismatches: dX = 0 out of 2097152. Max missmatch idx = 0.
reference = 0.0204 (ms); optimized = 0.0166 (ms); bw_opt = 1881.43 GB/s; speedup = 22.81%
Size (4096, 256); Mismatches: dX = 1 out of 1048576. Max missmatch idx = 601965.
reference = 0.0156 (ms); optimized = 0.0154 (ms); bw_opt = 1016.47 GB/s; speedup = 1.24%
Size (4096, 1024); Mismatches: dX = 0 out of 4194304. Max missmatch idx = 0.
reference = 0.0411 (ms); optimized = 0.0417 (ms); bw_opt = 1499.55 GB/s; speedup = -1.43%
Size (4096, 4096); Mismatches: dX = 0 out of 16777216. Max missmatch idx = 0.
reference = 0.2323 (ms); optimized = 0.2077 (ms); bw_opt = 1203.75 GB/s; speedup = 11.83%
Size (1024, 4096); Mismatches: dX = 0 out of 4194304. Max missmatch idx = 0.
reference = 0.0659 (ms); optimized = 0.0570 (ms); bw_opt = 1096.51 GB/s; speedup = 15.60%
Size (1024, 3072); Mismatches: dX = 0 out of 3145728. Max missmatch idx = 0.
reference = 0.0425 (ms); optimized = 0.0299 (ms); bw_opt = 1568.10 GB/s; speedup = 42.11%
Size (1024, 2464); Mismatches: dX = 8 out of 2523136. Max missmatch idx = 2087476.
reference = 0.0292 (ms); optimized = 0.0230 (ms); bw_opt = 1636.18 GB/s; speedup = 27.07%
Size (1024, 800); Mismatches: dX = 1 out of 819200. Max missmatch idx = 652342.
reference = 0.0114 (ms); optimized = 0.0104 (ms); bw_opt = 1175.05 GB/s; speedup = 9.63%
Size (1024, 6144); Mismatches: dX = 0 out of 6291456. Max missmatch idx = 0.
reference = 0.0973 (ms); optimized = 0.0844 (ms); bw_opt = 1110.87 GB/s; speedup = 15.28%
Size (1024, 4904); Mismatches: dX = 6 out of 5021696. Max missmatch idx = 4670210.
reference = 0.0814 (ms); optimized = 0.0721 (ms); bw_opt = 1037.99 GB/s; speedup = 12.90%
Size (4096, 2048); Mismatches: dX = 0 out of 8388608. Max missmatch idx = 0.
reference = 0.0990 (ms); optimized = 0.0770 (ms); bw_opt = 1623.58 GB/s; speedup = 28.54%
Size (1024, 1860); Mismatches: dX = 0 out of 1904640. Max missmatch idx = 0.
reference = 0.0219 (ms); optimized = 0.0174 (ms); bw_opt = 1631.12 GB/s; speedup = 25.75%
Size (1024, 20160); Mismatches: dX = 23 out of 20643840. Max missmatch idx = 20274656.
reference = 0.3054 (ms); optimized = 0.2600 (ms); bw_opt = 1183.08 GB/s; speedup = 17.45%
Size (3072, 256); Mismatches: dX = 0 out of 786432. Max missmatch idx = 0.
reference = 0.0129 (ms); optimized = 0.0127 (ms); bw_opt = 925.71 GB/s; speedup = 1.69%
Size (4096, 128); Mismatches: dX = 3 out of 524288. Max missmatch idx = 451331.
reference = 0.0128 (ms); optimized = 0.0129 (ms); bw_opt = 608.06 GB/s; speedup = -0.74%
Size (512, 128); Mismatches: dX = 0 out of 65536. Max missmatch idx = 0.
reference = 0.0062 (ms); optimized = 0.0061 (ms); bw_opt = 161.25 GB/s; speedup = 2.35%
Size (2048, 64); Mismatches: dX = 0 out of 131072. Max missmatch idx = 0.
reference = 0.0084 (ms); optimized = 0.0086 (ms); bw_opt = 228.70 GB/s; speedup = -2.49%
Size (3072, 2048); Mismatches: dX = 0 out of 6291456. Max missmatch idx = 0.
reference = 0.0770 (ms); optimized = 0.0614 (ms); bw_opt = 1527.43 GB/s; speedup = 25.44%
Size (3200, 104); Mismatches: dX = 0 out of 332800. Max missmatch idx = 0.
reference = 0.0105 (ms); optimized = 0.0113 (ms); bw_opt = 440.93 GB/s; speedup = -6.96%
Size (1152, 384); Mismatches: dX = 0 out of 442368. Max missmatch idx = 0.
reference = 0.0102 (ms); optimized = 0.0084 (ms); bw_opt = 786.48 GB/s; speedup = 21.59%
Size (131072, 64); Mismatches: dX = 12 out of 8388608. Max missmatch idx = 7659094.
reference = 0.2054 (ms); optimized = 0.2873 (ms); bw_opt = 438.49 GB/s; speedup = -28.51%
Size (64, 131072); Mismatches: dX = 0 out of 8388608. Max missmatch idx = 0.
reference = 0.8372 (ms); optimized = 0.3295 (ms); bw_opt = 379.37 GB/s; speedup = 154.09%
Size (131072, 128); Mismatches: dX = 18 out of 16777216. Max missmatch idx = 16158071.
reference = 0.2296 (ms); optimized = 0.3116 (ms); bw_opt = 805.47 GB/s; speedup = -26.31%
Size (128, 131072); Mismatches: dX = 0 out of 16777216. Max missmatch idx = 0.
reference = 0.9297 (ms); optimized = 0.3785 (ms); bw_opt = 660.52 GB/s; speedup = 145.64%
Size (131072, 256); Mismatches: dX = 47 out of 33554432. Max missmatch idx = 33062426.
reference = 0.3003 (ms); optimized = 0.4231 (ms); bw_opt = 1184.07 GB/s; speedup = -29.02%
Size (256, 131072); Mismatches: dX = 0 out of 33554432. Max missmatch idx = 0.
reference = 1.0449 (ms); optimized = 0.4828 (ms); bw_opt = 1035.63 GB/s; speedup = 116.43%
Average speedup = 21.73%
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111021
Approved by: https://github.com/malfet
2023-10-16 21:22:41 +00:00
8b46a106f2 [inductor] Move inductor ops to CompositeExplicitAutograd (#111274)
I suspect in practice this won't matter, but if we do end up tracing this it causes them not to get decomposed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111274
Approved by: https://github.com/voznesenskym, https://github.com/desertfire
ghstack dependencies: #111271, #111273
2023-10-16 21:16:24 +00:00
cba0dd0fdc [Compiled Autograd] Error if tensor_post_acc_grad_hooks is set (#111273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111273
Approved by: https://github.com/voznesenskym
ghstack dependencies: #111271
2023-10-16 21:16:24 +00:00
04b04c0686 [Compiled Autograd] Turn accumulate_grad into an op (#111271)
Rather than baking the behavior of `AccumulateGrad` nodes into the generated graph (either as `+=`, or as a return value of the graph).  This creates a new `accumulate_grad_` dispatcher op that is included in the generated graph like:
```
def forward(self, inputs, sizes, hooks):
    getitem = inputs[0]
    getitem_1 = inputs[1]
    getitem_2 = inputs[2]
    getitem_3 = inputs[3]
    getitem_4 = inputs[4]
    getitem_5 = inputs[5]
    getitem_6 = inputs[6]
    getitem_7 = inputs[7]
    getitem_8 = inputs[8]
    getitem_9 = inputs[9];  inputs = None
    expand = torch.ops.aten.expand.default(getitem, [2, 4]);  getitem = None
    threshold_backward = torch.ops.aten.threshold_backward.default(expand, getitem_1, 0);  expand = getitem_1 = None
    t = torch.ops.aten.t.default(getitem_3);  getitem_3 = None
    mm = torch.ops.aten.mm.default(threshold_backward, t);  t = None
    t_1 = torch.ops.aten.t.default(threshold_backward)
    mm_1 = torch.ops.aten.mm.default(t_1, getitem_2);  t_1 = getitem_2 = None
    t_2 = torch.ops.aten.t.default(mm_1);  mm_1 = None
    sum_1 = torch.ops.aten.sum.dim_IntList(threshold_backward, [0], True);  threshold_backward = None
    view = torch.ops.aten.view.default(sum_1, [4]);  sum_1 = None
    t_3 = torch.ops.aten.t.default(t_2);  t_2 = None
    accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, t_3);  getitem_4 = t_3 = None
    threshold_backward_1 = torch.ops.aten.threshold_backward.default(mm, getitem_5, 0);  mm = getitem_5 = None
    t_4 = torch.ops.aten.t.default(threshold_backward_1)
    mm_2 = torch.ops.aten.mm.default(t_4, getitem_6);  t_4 = getitem_6 = None
    t_5 = torch.ops.aten.t.default(mm_2);  mm_2 = None
    sum_2 = torch.ops.aten.sum.dim_IntList(threshold_backward_1, [0], True);  threshold_backward_1 = None
    view_1 = torch.ops.aten.view.default(sum_2, [4]);  sum_2 = None
    t_6 = torch.ops.aten.t.default(t_5);  t_5 = None
    accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_7, t_6);  getitem_7 = t_6 = None
    accumulate_grad__2 = torch.ops.inductor.accumulate_grad_.default(getitem_8, view_1);  getitem_8 = view_1 = None
    accumulate_grad__3 = torch.ops.inductor.accumulate_grad_.default(getitem_9, view);  getitem_9 = view = None
    return []

```

The motivation here is `AccumulateGrad` nodes are causing trouble in FSDP tracing, since FSDP is in-place resizing parameters and parameter storage in hooks.  We will model this mutation in dynamo, but not during the initial compiled autograd capture.  This allows us to bypass failing shape checks in the initial capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111271
Approved by: https://github.com/voznesenskym
2023-10-16 21:16:17 +00:00
6f06832219 Fixed typo in activation.py (#111358)
liner -> linear
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111358
Approved by: https://github.com/mikaylagawarecki
2023-10-16 20:36:55 +00:00
97a513ed07 Revert "Add lazy_clone_storage to create COW storages (#110192)"
This reverts commit 1c308144177d6e1663e41aae32a89e1c49b8b3b4.

Reverted https://github.com/pytorch/pytorch/pull/110192 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, @ezyang please support the author providing further details ([comment](https://github.com/pytorch/pytorch/pull/110192#issuecomment-1765157285))
2023-10-16 19:43:20 +00:00
c271df9239 IPUHooksInterface: fix a typo, remove const & (#111372)
Return an `at::Generator` from `newIPUGenerator`, not a reference to one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111372
Approved by: https://github.com/Skylion007
2023-10-16 19:19:40 +00:00
07f0413b70 [c10d] add nccl version to c10d logger (#111215)
Summary: NCCL version is essential for debugging purpose and NCCL rollout monitoring. Log this info for easy access.

Test Plan:
run cmf10x on devgpu

https://pxl.cl/3B5gf

https://fburl.com/scuba/pytorch_c10d_logging/lybk2usq

Differential Revision: D50240853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111215
Approved by: https://github.com/Skylion007
2023-10-16 18:47:49 +00:00
ff432c048d [easy] Remove duplicate exprs in produce_guards (#111270)
Summary: We're checking the original guard.expr in the issued set instead of the simplified expr, leading to duplicate guards in cases where one expression simplifies to another.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111270
Approved by: https://github.com/Chillee, https://github.com/ezyang
2023-10-16 18:31:38 +00:00
b691d09010 fix: reset prefetch flag upon reshard (#111354)
The `prefetched` flag should be reset upon reshard. Otherwise, for zero2, next access to the corresponding parameter will skip "unshard" operation, and results in wrong parameter shape.

The need of unsharding is also metioned [in the comment of `FlatParameterHandle.unshard`](https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_flat_param.py#L1241-L1242).

As [`FlatParameterHandle` already guarded it against unnecessary all gather](https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_flat_param.py#L1240), this shouldn't incur extra communication overhead.

_Personally I also find `_prefetched` a bit of mis-named, it should really be `_unsharded`._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111354
Approved by: https://github.com/awgu
2023-10-16 18:31:33 +00:00
9ab6ac5bc1 [ONNX] Fix aten::new_zeros due to TorchScript behavior change on Pytorch 2.1 Fix #110935 (#110956)
Fixes #110597

Summary:

* Generic code: The `torch._C.Value.node().mustBeNone()` is encapsulated into the high-level API `JitScalarType.from_value` ; `_is_none` was also extended to allow either `None` or `torch._C.Value.node.mustBeNone()`, so users don't manually call into TorchScript API when implementing operators
* Specific to `new_zeros` (and ops of ` *_like`  and `new_*`): When checking `dtype`, we always must use ` _is_none`, which will call  proposed by #110935
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110956
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2023-10-16 18:28:20 +00:00
9f562a3de3 [dynamo] make disable_cahce_limit also disable accumulated cache limit (#111334)
Fixes #111329.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111334
Approved by: https://github.com/yanboliang
2023-10-16 17:59:04 +00:00
89f11c69a8 Revert "[inductor] Adding a way to force fusion of int_mm with mul (#111125)"
This reverts commit f4297576e63e4110f6bdf2522ae6a5fb4c7f3816.

Reverted https://github.com/pytorch/pytorch/pull/111125 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it fails on ROCm f4297576e6 ([comment](https://github.com/pytorch/pytorch/pull/111125#issuecomment-1764956174))
2023-10-16 17:37:13 +00:00
59281d5631 [tp] fix SP style regression (#111353)
[tp] fix SP style regression

Although we want to remove prepare_input/output, we should still keep
the old behavior for SequenceParallel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111353
Approved by: https://github.com/fduwjj
2023-10-16 17:18:17 +00:00
493618d745 Revert "[C10D] Introduce C++ side Collective Callbacks. (#110307)"
This reverts commit 359336e3e9a0f67974e53805b5207fbbbc149490.

Reverted https://github.com/pytorch/pytorch/pull/110307 on behalf of https://github.com/wconstab due to this sits on top of another PR https://github.com/pytorch/pytorch/pull/111072 that needs to be reverted due to internal release testing failure / multisect blame ([comment](https://github.com/pytorch/pytorch/pull/110307#issuecomment-1764910301))
2023-10-16 17:07:58 +00:00
6462d71c10 Fixes a typo in docstring: should be "elastic" (#111352)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111352
Approved by: https://github.com/H-Huang
2023-10-16 16:54:52 +00:00
0d368f586a fix wrong meta for index_select.out (#111364)
fixes https://github.com/pytorch/pytorch/issues/110699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111364
Approved by: https://github.com/ezyang
ghstack dependencies: #111040
2023-10-16 15:18:20 +00:00
4cf23c6a61 FunctionalTensor: avoid spurious not_implemented logging during proxy tracing (#111040)
This is kind of hard to test, but I can try to add a test case if requested.

I noticed locally that we now end up logging to the ProxyTensorMode and FakeTensorMode `not_implemented` logs in very simple compile examples: https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/proxy_tensor.py#L269

It was because `_mirror_autograd_meta_to()` indirectly queries sizes, and since modes have higher priority than subclasses, `aten::sym_sizes()` was getting dispatched to our modes before going to `FunctionalTensor.__torch_dispatch__`.

This works out fine (they return NotImplemented and we eventually get to `FunctionalTensor`) but I figured we want to avoid cluttering up the logs. So I wrapped the calls with `FunctionalTensorMode`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111040
Approved by: https://github.com/ezyang
2023-10-16 15:18:20 +00:00
50b80185d6 fix bugs about traceback.walk_stack in python3.8.x (#110922)
Fixes #110769

as stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110922
Approved by: https://github.com/mikaylagawarecki
2023-10-16 14:29:07 +00:00
126d422cf0 Error if you try to run Dynamo compiled function under torch.jit.trace (#111321)
Fixes https://github.com/pytorch/pytorch/issues/111319

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111321
Approved by: https://github.com/Chillee
2023-10-16 13:52:29 +00:00
78909a6f0b [xla hash update] update the pinned xla hash (#111360)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111360
Approved by: https://github.com/pytorchbot
2023-10-16 11:53:00 +00:00
9af82fa2b8 Revert "[vision hash update] update the pinned vision hash (#111316)"
This reverts commit da364449909b02202e542952c271244a33412c4a.

Reverted https://github.com/pytorch/pytorch/pull/111316 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/111316#issuecomment-1763827734))
2023-10-16 06:43:09 +00:00
b4745d476c Revert "[sparse] semi-structured sparse + torch.compile support (#111049)"
This reverts commit ac02531babab028cb260d2225ff9e91e92df063b.

Reverted https://github.com/pytorch/pytorch/pull/111049 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/111049#issuecomment-1763795957))
2023-10-16 06:16:59 +00:00
bfcd86955e [TP] Fix TP doc format to show examples correctly (#111346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111346
Approved by: https://github.com/wanchaol
ghstack dependencies: #111160, #111166, #111176, #111177
2023-10-16 06:15:10 +00:00
e0e15a4ac6 update int4 tinygemm kernels (#111327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111327
Approved by: https://github.com/msaroufim
ghstack dependencies: #111314
2023-10-15 21:53:29 +00:00
882bc1708b [dtensor][11/n] adds some __str__ for ease of read (#111278)
This add __Str__ to op schema and dtensor spec for ease of reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111278
Approved by: https://github.com/fduwjj
ghstack dependencies: #109145, #110717, #111234
2023-10-15 16:00:31 +00:00
6b5d736bf7 [dtensor][10/n] switch pointwise op to use op strategy (#111234)
As titled, this also handles sth like [Shard(0), Shard(0)] correctly for
pointwise ops, which was previously errored out
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111234
Approved by: https://github.com/fduwjj
ghstack dependencies: #109145, #110717
2023-10-15 16:00:31 +00:00
f34f3b5421 [dtensor][9/n] matrix ops to generate strategy (#110717)
This PR switches matrix ops to generate the sharding strategies, and
with the cost selection algorithm introduced in the previous PR we are
able to enable this and more ops to leverage strategy based sharding
prop

This also fixes a bunch of corner cases that existing propagation does
not cover, resulting in full coverage for baddbmm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110717
Approved by: https://github.com/fduwjj
ghstack dependencies: #109145
2023-10-15 16:00:16 +00:00
b4ab8ac515 [dtensor][8/N] Introduce cost model for sharding (#109145)
This PR adds some basic comm cost model for sharding prop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109145
Approved by: https://github.com/fduwjj
2023-10-15 16:00:06 +00:00
25a2845d78 [TP] Enable embedding sharding in TP API (#111177)
We see use cases where embedding sharding is also needed in TP API so we enabled it in the API since DTensor already support colwise embedding sharding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111177
Approved by: https://github.com/wanchaol
ghstack dependencies: #111160, #111166, #111176
2023-10-15 11:49:56 +00:00
e942fddb83 Fix get_estimated_runtime for symbolic shapes (#111314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111314
Approved by: https://github.com/lezcano
2023-10-15 05:40:03 +00:00
e6d9350d7f direct runtime assertions (#111262)
Previously we were generating a graph to add runtime assertions on inputs and then running that graph to check input constraints. This PR checks input constraints directly.

Differential Revision: D50289970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111262
Approved by: https://github.com/zhxchen17
2023-10-15 05:15:09 +00:00
7df287dc18 [state_dict][4/N] Support strict flag for model.load_state_dict (#111109)
As title

Differential Revision: [D50209723](https://our.internmc.facebook.com/intern/diff/D50209723/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111109
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107, #111275
2023-10-15 04:58:15 +00:00
da36444990 [vision hash update] update the pinned vision hash (#111316)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111316
Approved by: https://github.com/pytorchbot
2023-10-15 04:37:20 +00:00
4a388e70f2 Update mypy to 1.6.0 (#111305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111305
Approved by: https://github.com/janeyx99
2023-10-15 01:55:44 +00:00
48989bc820 trace frames with np.ndarray (#110512)
Fixes #109604

Resubmit gh-109715 + several skips and small fixes to make tests pass.

The main fix here is by @ysiraichi : previously, dynamo did not resume tracing numpy ndarrays after a graph break.
While at it, fix several small issues Yukio's fix uncovers:

- graph break gracefully on numpy dtypes which do not map to torch.dtypes (uint16 etc)
- recognize array scalars in dynamo, treat them as 0D ndarrays
- make sure that iterating over torch.ndarray generates arrays not bare tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110512
Approved by: https://github.com/lezcano
2023-10-15 00:56:10 +00:00
da662248fb [Dynamo] Fix autograd.Function tracing errors loudly involving saved tensors (#111277)
Fixes #104792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111277
Approved by: https://github.com/jansel, https://github.com/zou3519
2023-10-15 00:47:59 +00:00
ff3d773dd9 [TP] Add deprecation warnings in the documentations for Pairwise parallel, sequence parallel and other prepare input/output functions (#111176)
As part of TP UX improvements, we want to keep our API simple (not easy) so that users get the flexibility to do what they want and avoid a too generic API which tries to solve everything and get things too complicated. We are updating the doc accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111176
Approved by: https://github.com/wanchaol
ghstack dependencies: #111160, #111166
2023-10-15 00:39:24 +00:00
73d288fdf9 [aotinductor] Relax ExternKernel kwargs checking (#111167)
Summary: When a fallback kernel is called without specifying any kwargs, we still need to fill in default values for those kwargs when generating cpp call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111167
Approved by: https://github.com/chenyang78, https://github.com/jgong5
2023-10-14 21:41:33 +00:00
5caf2e55d4 [FSDP] fix: fix for fsdp zero2 validation error (#110139)
# Problem
When sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch is turned on, the validation after the train has an incorrect weight shape.
<img width="1508" alt="image" src="https://github.com/pytorch/pytorch/assets/41232043/57a9c3bb-cb5c-46df-ac26-922740686f9e">

# Analyze
When using `SHARD_GRAD_OP`, the `free_unsharded_flat_param` in `_post_forward_reshard` is often False, so it does not set the handle's `_prefetched` flag to False after the forward.

The normal train phase sets this flag to False in the `_post_backward_final_callback`, and the validation phase doesn't execute the hook, so after the first iter of the validation is done, the flag of the handle of the prefetched will remain True.

This will cause the handle to skip the `_unshard` in the next `_pre_forward_unshard`, and the `_prefetch_handle` will not do a prefetch, which will result in an incorrect weight shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110139
Approved by: https://github.com/awgu
2023-10-14 20:59:28 +00:00
6dc54fe8d6 [BE] Compile FBGEMM with ASAN (#111266)
If `USE_ASAN` is set, compile FBGEMM with ASAN as well, by setting `USE_SANITIZER` to `address,undefined`

This fixes regression in sanitizer coverage introduced by https://github.com/pytorch/pytorch/pull/93147  that change effects of sanitizer from the entire project to just torch libraries, and finally allows one to reliably catch regression reported in https://github.com/pytorch/pytorch/issues/111189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111266
Approved by: https://github.com/huydhn
2023-10-14 20:35:04 +00:00
cff8bf47c3 update the dispatch of some operators which accept scalar (#110918)
The scalar overloads of some ops like `bitwise_xor.Scalar` were dispatched to `CompositeImplicitAutograd` by default. It is against the rule for `CompositeImplicitAutograd` that all tensor operations (except reading metadata) must be done through calls to the ATen dispatcher rather than interacting with the Tensor directly.
So, here update the dispatch of these overloads to `CompositeExplicitAutograd`.
Fixes #93224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110918
Approved by: https://github.com/peterbell10
2023-10-14 15:54:08 +00:00
8085e08a84 [TP] Add prepareInput and output for input/output DTensor layout annotation in the parent module in TP API (#111166)
In some use cases, we found that users might want to annote the input/output DTensor layout for the parent module rather than the submodule whose parameters are to be distributed so that we want to have these two class for users to annote input/output DTensor layouts so that we register pre-FWD/FWD hook for the TP-lized module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111166
Approved by: https://github.com/wanchaol
ghstack dependencies: #111160
2023-10-14 15:37:52 +00:00
7c67139e7b [state_dict][3/N] Cleanup StateDictOptions, make it more readable (#111275)
This is a reland PR for https://github.com/pytorch/pytorch/pull/111108 with the proper docstring fix.

1. Rename DistributedStateDictOptions to StateDictOptions.
2. Remove cpu_offload as we have not yet required this option.
3. Rename save_frozen_parameters to ignore_frozen_params.

Differential Revision: [D50294352](https://our.internmc.facebook.com/intern/diff/D50294352/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111275
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107
2023-10-14 15:34:52 +00:00
3a8b10e2da [TP] Refactor Parallel Style to make it more usable (#111160)
One thing we find it challenging for users is that we don't want to expose the concept of prepare_input and prepare_out to users since there are so many func names for users to select from which is quite confusing. On the other hand, the colwise and rowwise parallel always need input(out) and output(in) to be certain layout so we can somehow simplify the logic here and make it more usable.

So we added three public attributes to the parallelStyle here and the code logic is like:

```python
class ParallelStyle(ABC):
    """
    The parallel style user wants the module or submodule to be parallelized.
    We can add more in future, but this seems sufficient for immediate needs. Users can extend this class to build their own parallel style with customized input/output preparations.
  """
    input_layouts: Union[placement, Tuple[placement]]
    output_layouts: Union[placement, Tuple[placement]]
    use_local: bool

class RowwiseParallel(ParallelStyle):
    """
    Partitioning the row of a module. We assume the input to be a sharded DTensor and output to be a replicate Tensor.
    """
    def __init__(self):
        super().__init__(input_layouts=Shard(-1), output_layouts=Replicate(), use_local=True)

Class ColwiseParallel(ParallelStyle):
    """
    Partitioning the column of a module. We assume the input to be a Replicated DTensor and output to be a sharded DTensor.
    """
    def __init__(self):
        super().__init__(input_layouts=Replicate(), output_layouts=Shard(-1), use_local=True)

# For the case of Sequence parallel, users just set different input_shard, Shard(0) or Shard(1) instead of Replicate()

Class PrepareModuleInput(ParallelStyle):
    """
    Only used to specify the input distribute spec for a module.
    """
    def __init__(self):
        super().__init__(input_layouts=Shard(0), output_layouts=Replicate(), use_local=False)

Class PrepareModuleOutput(ParallelStyle):
    """
    Only used to specify the output distribute spec for a module.
    """
    def __init__(self):
        super().__init__(input_layouts=Replicate(), output_layouts=Shard(0), use_local=True)

parallelize_plan = {
    "embedding": ColwiseParallel(output_shard=Replicate()),
    "attn": PrepareModuleInput(),
    "attn.w1": ColwiseParallel(),
    "attn.w2": ColwiseParallel(),
    "attn.w3": ColwiseParallel(),
    "attn.wo": RowwiseParallel(),
}

parallelize_module(
    module=block, # this can be a submodule or module
    device_mesh=mesh['tp'],
    parallelize_plan=parallelize_plan,
)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111160
Approved by: https://github.com/wanchaol
2023-10-14 15:26:36 +00:00
b28cb43f5c Intra-graph reordering pass on Inductor scheduler IR (based on #100762) (#108091)
This PR implements intra-graph communication reordering pass on Inductor scheduler IR, based on Horace's previous PR #100762.

Main algorithm:
1. Greedily moves waits as late as possible (i.e. until we reach a use)
2. Greedily moves comms as early as possible (i.e. until we reach an input)
3. Move computes following simple heuristics to improve overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108091
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2023-10-14 14:51:24 +00:00
cyy
8bd5eb8c96 [2/N] Apply clang-tidy to aten/src/ATen/core/ (#111006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111006
Approved by: https://github.com/Skylion007
2023-10-14 14:15:42 +00:00
d7317d8a11 Fix size_hint call sites failing on unbacked SymInts (#110520)
Summary: Unbacked SymInts can't get a `sizevars.size_hint` due to being data-dependent. #109893 has added a new `fallback` parameter to `sizevars.size_hint` to specify the fallback value in cases like unbacked SymInt. In this PR we add more of those.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110520
Approved by: https://github.com/jansel, https://github.com/ezyang
2023-10-14 08:10:09 +00:00
0013611c81 [inductor] Allow backend compiler to skip (#111153)
Summary:
Sometimes the backend compiler can encounter a transient failure (in
our case, a remote build service infrequently hits a hiccup).  We'd rather run
eager than fail the training job.

Test Plan:
Inject an exception in the RE path and run:
```
buck2 run @//mode/{opt,inplace} //caffe2/test/inductor:smoke
```

Differential Revision: D50234516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111153
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-10-14 02:44:15 +00:00
48e4d18388 [BE] Move ASAN from clang-12 to clang-15 (#111218)
Hopefully it will align with internal system and they will detect heap-overlow access reported in https://github.com/pytorch/pytorch/issues/111189 Also, do not build neither Triton, nor protobuf nor DB dependencies (as they are not needed for ASAN builds/tests)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111218
Approved by: https://github.com/Skylion007
2023-10-14 02:31:41 +00:00
581d97c19e Revert "[state_dict][3/N] Cleanup StateDictOptions, make it more readable (#111108)"
This reverts commit b1db9590853d2ac205bc57c906d81935874daf09.

Reverted https://github.com/pytorch/pytorch/pull/111108 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it is cleaner to reland this change ([comment](https://github.com/pytorch/pytorch/pull/111108#issuecomment-1762504496))
2023-10-14 02:22:19 +00:00
11ac4ace5f [export] Use meta val from the old nodes in run_decompositions(). (#111225)
Summary: fall back to the old nodes when meta val is missing.

Test Plan: buck2 run //executorch/examples/portable/scripts:export -- --model_name=emformer_predict

Differential Revision: D50278439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111225
Approved by: https://github.com/larryliu0820
2023-10-14 02:08:49 +00:00
ac02531bab [sparse] semi-structured sparse + torch.compile support (#111049)
Summary:

This PR adds in torch.compile support for semi-structured sparsity,
using the subclass tracing @bdhirsh added.

Based on wether we are using cuSPARSELt or CUTLASS, we return a
different representation of the inner tensors.

Test Plan:
```
python test/test_sparse_semi_structured.py -k compile
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111049
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110583
2023-10-14 01:13:01 +00:00
1c30814417 Add lazy_clone_storage to create COW storages (#110192)
This PR relands #110022 but accounts for the changes in #110191. Also, the function for creating COW storages is called `lazy_clone_storage` in this PR, instead of `try_ensure`

NOTE: COW storages do not actually copy on write yet, they just have the COW deleter and deleter context applied to them

Part of #109833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110192
Approved by: https://github.com/ezyang
2023-10-14 00:53:21 +00:00
482782406a Revert "Add lazy_clone_storage to create COW storages (#110192)"
This reverts commit 33f151348684bd74fbc9939f00c39408ef92074d.

Reverted https://github.com/pytorch/pytorch/pull/110192 on behalf of https://github.com/kit1980 due to revert to work around some importing issues ([comment](https://github.com/pytorch/pytorch/pull/110192#issuecomment-1762430374))
2023-10-14 00:48:45 +00:00
4f4e2c1c08 Add constant node sizes to proto size calculation (#111097)
Fixes #110982

https://github.com/pytorch/pytorch/pull/62257 deprecated `torch.onnx.export(use_external_data_format: bool=...)`  argument, but it seems the introduced `EncoderBase::GetGraphProtoSize` has a bug and doesn't detect models > 2GB when onnx Constant nodes are large (and responsible for the size overflow)

This PR adds the constant node to the total size of the model, along with initializers.

In python, what we need to do is:

```python
import onnx

def compute_tensor_size(tensor):
    # Compute the size of the tensor based on its shape and data type
    size = tensor.size * tensor.itemsize
    return size

def sum_constant_and_initializer_sizes(model_path):
    # Load the ONNX model
    model = onnx.load(model_path)

    total_size = 0
    initializer_size = 0
    constant_size = 0

    # Compute the size of constant nodes
    for node in model.graph.node:
        if node.op_type == 'Constant':
            constant_value = node.attribute[0].t
            # Convert constant value to numpy array
            constant_array = onnx.numpy_helper.to_array(constant_value)
            # Compute the size of the constant tensor
            tensor_size = compute_tensor_size(constant_array)
            total_size += tensor_size
            constant_size += tensor_size

    # Compute the size of initializer nodes that are not graph inputs
    for initializer in model.graph.initializer:
        if initializer.name not in [input.name for input in model.graph.input]:
            # Convert the shape and data type information to calculate size
            # tensor = onnx.helper.tensor_value_info_to_tensor(input)
            tensor = onnx.numpy_helper.to_array(initializer)
            tensor_size = compute_tensor_size(tensor)
            total_size += tensor_size
            initializer_size += tensor_size

    return total_size, constant_size, initializer_size

model_path = '/path/to/model.onnx'
total_size, constant_size, initializer_size = sum_constant_and_initializer_sizes(model_path)

print("Total size of constant nodes in bytes:", constant_size)
print("Total size of initializer nodes (excluding graph inputs) in bytes:", initializer_size)
print("Total size of constant and initializer nodes (excluding graph inputs) in bytes:", total_size)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111097
Approved by: https://github.com/justinchuby, https://github.com/zhipenghan
2023-10-14 00:37:02 +00:00
3b08a4a6b2 [dynamo] collapse local and global guard builders (#111226)
[Wait for CI] [dynamo] collapse local and global guard builders

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111226
Approved by: https://github.com/ezyang
2023-10-14 00:16:59 +00:00
bb89a9e48c Skipped CUDA Flags if C++ Extension Name includes "arch" Substring (#111211)
The CUDA architecture flags from TORCH_CUDA_ARCH_LIST will be skipped if the TORCH_EXTENSION_NAME includes the substring "arch". A C++ Extension should be allowed to have any name. I just manually skip the TORCH_EXTENSION_NAME flag when checking if one of the flags is "arch". There is probably a better fix, but I'll leave this to experts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111211
Approved by: https://github.com/ezyang
2023-10-14 00:10:01 +00:00
f4297576e6 [inductor] Adding a way to force fusion of int_mm with mul (#111125)
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111125
Approved by: https://github.com/jansel, https://github.com/cpuhrsch
2023-10-13 23:37:14 +00:00
c151163333 Documentation Clarification on torch.compile Example (#110942)
Fixes #110917
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110942
Approved by: https://github.com/msaroufim, https://github.com/malfet
2023-10-13 22:46:42 +00:00
00d962631c [BE] Enable Ruff's Flake8 PYI045 (#111184)
Enable [iter-method-return-iterable (PYI045)](https://docs.astral.sh/ruff/rules/iter-method-return-iterable/#iter-method-return-iterable-pyi045)

Link: #110950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111184
Approved by: https://github.com/Skylion007
2023-10-13 22:20:04 +00:00
ba7b9211ee [export] Update serialization schema to input/output specs. (#845) (#111204)
Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/845

Test Plan: CI

Differential Revision: D50191531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111204
Approved by: https://github.com/angelayi
2023-10-13 22:19:56 +00:00
a3e9b80082 Fix torch.diagonal for torch.onnx.export when dim1<0 or dim2<0 (#111130)
in many cases, torch.diagonal will pass (dim1=-2, dim2=-1), onnx export will always fail in these cases
this pr try to fix the bug
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111130
Approved by: https://github.com/thiagocrepaldi
2023-10-13 22:05:53 +00:00
375e7bd003 Un-skip a bunch of UnaryUfuncInfo bfloat16 tests (#110799)
It appears they were disabled because the test suite didn't use to
support weaker tolerances for them. But it seems that has since
been addressed; e.g. we have relaxed tolerances specified in ded5ee75ac/test/test_unary_ufuncs.py (L208-L209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110799
Approved by: https://github.com/eellison
ghstack dependencies: #110798
2023-10-13 21:46:53 +00:00
d8de45d22c Update arg{min,max} tests and docs (#110845)
The `argmin` docs had been updated in
https://github.com/pytorch/pytorch/issues/78791 but left a minor typo.

`argmax` had a similar issue but was not noticed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110845
Approved by: https://github.com/eellison
2023-10-13 21:40:29 +00:00
d38472c176 Don't sympify reflection_pad2d ranges (#111212)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111212
Approved by: https://github.com/eellison
2023-10-13 21:36:30 +00:00
382327bd0e [BE] Enable Ruff's Flake8 PYI034 (#111105)
Enable [non-self-return-type (PYI034)](https://docs.astral.sh/ruff/rules/non-self-return-type/#non-self-return-type-pyi034)

Link: #110950

**EDIT**: to newly added reviewers, please ignore the request, it's due to a rebase error 😅

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111105
Approved by: https://github.com/Skylion007
2023-10-13 21:19:53 +00:00
2fd546aa5e Allow strided layout in torch.normal (#111205)
Fixes https://github.com/pytorch/pytorch/issues/111119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111205
Approved by: https://github.com/ezyang
2023-10-13 21:17:38 +00:00
b1db959085 [state_dict][3/N] Cleanup StateDictOptions, make it more readable (#111108)
1. Rename DistributedStateDictOptions to StateDictOptions.
2. Remove cpu_offload as we have not yet required this option.
3. Rename save_frozen_parameters to ignore_frozen_params.

Differential Revision: [D50209711](https://our.internmc.facebook.com/intern/diff/D50209711/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111108
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107
2023-10-13 21:03:51 +00:00
625a3b1a42 Remove some patterns from PrimTorch merge rules (#111230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111230
Approved by: https://github.com/ezyang
2023-10-13 20:37:15 +00:00
6aa91c8dad [dynamo] Register einops functions lazily (#110575)
Fixes #110549

We currently have a circular import between dynamo and einops as described in the issue.
This works around the issue by adding a mechanism to register initialization callbacks
that are called the first time an object is seen from that particular module.

This means that dynamo will only import `einops` after it's already fully initialized
and being called in a function being traced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110575
Approved by: https://github.com/jansel
ghstack dependencies: #110990
2023-10-13 20:08:40 +00:00
8747e4c8c1 [dynamo] Add specialized variable tracker for sys.modules (#110990)
`sys.modules` is currently treated as a constant dictionary and any reference to
it will result in guards on the full contents of `sys.modules`. This instead
adds a specialized variable tracker which tries to guard only on the modules
referenced by the code. e.g.

```
sys.modules["operator"].add(x, x)
```

will generate the guard
```
___dict_contains('operator', G['sys'].modules)
```

It does this with special support for `__contains__` `__getitem__` and `.get`
which are probably the most commonly used with `sys.modules`. For anything else
we just fall back to building the dict tracker as normal.

While accessing `sys.modules` may seem unusual, it actually comes up when
inlining the `warnings.catch_warnings` context manager which internally accesses
`sys.modules["warnings"]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110990
Approved by: https://github.com/ezyang
2023-10-13 20:08:40 +00:00
058cb70ad9 [CI] Add auto label rule for torch/_export (#111181)
Summary: Auto label all torch/_export changes with ciflow/inductor to trigger AOTInductor tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111181
Approved by: https://github.com/angelayi
2023-10-13 20:07:47 +00:00
8db72a430d [sparse] Add padding for dense matrices in semi-structured sparse (#110583)
Summary:

Currently we have shape constraints in semi-structured sparsity for both
CUTLASS and cuSPARSELt

These shape constraints unfortunately apply to both the dense and sparse
matrices in sparsedense matmul.

This PR adds in support for calling `F.pad` in order to pad dense
matrices to the right size with zeros and then pull out the
corresponding rows from the resultant result matrix.

We also throw a warning in this case.
The tests have also been updated to take in a dense_input_shape
parameter.

Test Plan:
```
python test/test_sparse_semi_structured.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110583
Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch
2023-10-13 20:04:23 +00:00
2b6f281e5c Revert "Remove dead code (#111207)"
This reverts commit c2ed714f54eb6564123bb53401c4c66aeba40625.

Reverted https://github.com/pytorch/pytorch/pull/111207 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it breaks lint c2ed714f54 ([comment](https://github.com/pytorch/pytorch/pull/111207#issuecomment-1762126366))
2023-10-13 19:56:11 +00:00
cf6b1cdf6a Revert "AOTAutograd: Go down inference path if no outputs require grad (#111011)"
This reverts commit ded5ee75ac51af1614cc79cd9c6f76524f10c3d8.

Reverted https://github.com/pytorch/pytorch/pull/111011 on behalf of https://github.com/kit1980 due to broke internal aotinductor tests with inference_mode ([comment](https://github.com/pytorch/pytorch/pull/111011#issuecomment-1762056233))
2023-10-13 19:11:26 +00:00
d84dcfb3e0 [Doc] Fix typo in cpp/installing when wheel is used (#111143)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 34a15a6</samp>

Updated the cmake command in `docs/cpp/source/installing.rst` to use python3 and fix a documentation error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111143
Approved by: https://github.com/clee2000, https://github.com/kit1980
2023-10-13 18:56:27 +00:00
c2ed714f54 Remove dead code (#111207)
This dictionary is not used anywhere. The _make_dupe_guard function does
not exist anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111207
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
2023-10-13 18:46:27 +00:00
e99abaae2f [state_dict][2/N] Let distributed.state_dict accepts single optimizer (#111107)
It's quite annoying that users have to create a tuple of optimizers even if there is only one optimizer. This PR makes most users' life easier.

Differential Revision: [D50209704](https://our.internmc.facebook.com/intern/diff/D50209704/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111107
Approved by: https://github.com/wz337
ghstack dependencies: #111106
2023-10-13 18:40:57 +00:00
ac768333be [dynamo] fix prim lowering validation logic for dynamic shape args (#111208)
Fixes https://github.com/pytorch/pytorch/issues/111199

Fixes https://github.com/pytorch/pytorch/issues/111203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111208
Approved by: https://github.com/ezyang
2023-10-13 18:36:13 +00:00
247d5e16fc [DCP] Improve with_temp_dir robustness (#111106)
Calling os.sync() to ensure the tempfile can be seens across ranks.

Differential Revision: [D50209697](https://our.internmc.facebook.com/intern/diff/D50209697/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111106
Approved by: https://github.com/Skylion007, https://github.com/wz337
2023-10-13 18:03:24 +00:00
eqy
5a2ab7dcb7 [CUDA][cuFFT] Initialize CUDA context for cuFFT before execute is called (#110326)
Potential fix for #109448

CC @Aidyn-A

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110326
Approved by: https://github.com/Aidyn-A, https://github.com/malfet
2023-10-13 18:02:25 +00:00
f68d6e8108 Revert "Move at::{Refcounted,}MapAllocator to c10 (#109881)"
This reverts commit 68a1219f74467a4d2124288f3ab6f8bc471fe4a1.

Reverted https://github.com/pytorch/pytorch/pull/109881 on behalf of https://github.com/kit1980 due to breaking internal builds, undefined symbol: _ZN3c1022RefcountedMapAllocator6decrefEv ([comment](https://github.com/pytorch/pytorch/pull/109881#issuecomment-1761950014))
2023-10-13 17:57:53 +00:00
8162f4170b Fix typo under c10 directory (#111155)
This PR fixes typo in comments and messages in files under `c10` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111155
Approved by: https://github.com/Skylion007
2023-10-13 16:52:51 +00:00
ac48c11ab7 Fix typo under torchgen directory (#111154)
This PR fixes typo in comments and messages in files under `torchgen` directory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111154
Approved by: https://github.com/rajveer43, https://github.com/Skylion007
2023-10-13 16:43:46 +00:00
b460c30893 [BE] Enable Ruff's Flake8 PYI042 (#111114)
Enable [snake-case-type-alias (PYI042)](https://docs.astral.sh/ruff/rules/snake-case-type-alias/)

Link: #110950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111114
Approved by: https://github.com/albanD
2023-10-13 16:33:07 +00:00
5db9f911ac [pt][group_fusion] fix shape guarding in fusion candidate search (#111174)
Summary:
without the `all` in the fix
```
node.kwargs.get("beta", 1.0) == 1.0
node.kwargs.get("alpha", 1.0) == 1.0
and len(input_shape) == 2
and len(weight_shape) == 2
and all(x % 2 == 0 for x in input_shape + weight_shape)
and shape <= MAX_FUSE_TENSOR_SIZE_GROUP_LINEAR # <----- HERE
for shape in input_shape + weight_shape
```
this statement defaults to a generator object which means it will always be true. One of the issues is that the shapes could be an odd number which forces gmm to load element-by-element rather than vectorized load. In VDDv3 torchbench example(posted in test plan), you can see there is a 37ms GMM call which swamps any gain from fusion. Overall this change makes the GMM fusion 24% faster

Differential Revision: D48696572

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111174
Approved by: https://github.com/davidberard98
2023-10-13 16:28:02 +00:00
84975339bd [PyTorch] AOTI: generate reused thread_locals when tensors provably have static shape (#110892)
If a Tensor can be reused and has static shape, we can just cache it across iterations.

This is meant as a quickly shippable overhead reduction for CPU overhead-bound use cases that we can ship without relying on memory planning.

Differential Revision: [D50023678](https://our.internmc.facebook.com/intern/diff/D50023678/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110892
Approved by: https://github.com/bertmaher
ghstack dependencies: #110876, #110877, #110909
2023-10-13 16:07:05 +00:00
bf72a723ef [PyTorch] AOTI: Add aoti_torch_assign_tensors to ABI (#110909)
I need this to do a cheap and easy output copy in D50023678.

Differential Revision: [D50105080](https://our.internmc.facebook.com/intern/diff/D50105080/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110909
Approved by: https://github.com/jansel, https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #110876, #110877
2023-10-13 16:07:05 +00:00
cff71c47dd [dynamo] Forward fix a bunch of distributed collective allow fixes (#111171)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111171
Approved by: https://github.com/yanboliang
2023-10-13 15:49:04 +00:00
33f1513486 Add lazy_clone_storage to create COW storages (#110192)
This PR relands #110022 but accounts for the changes in #110191. Also, the function for creating COW storages is called `lazy_clone_storage` in this PR, instead of `try_ensure`

NOTE: COW storages do not actually copy on write yet, they just have the COW deleter and deleter context applied to them

Part of #109833

Differential Revision: [D50265134](https://our.internmc.facebook.com/intern/diff/D50265134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110192
Approved by: https://github.com/ezyang
2023-10-13 15:33:40 +00:00
35750bf9d1 [export] Fix issue with internal model (#111140)
Summary:
This was error was run into when running ExportPassBase on an exported model with lifted constant tensors:
```
  File "/data/users/angelayi/pytorch/torch/_subclasses/fake_tensor.py", line 1444, in dispatch
    len(kwargs) == 0 and len(args) == 1 and type(args[0]) is torch.Tensor
AssertionError: (FakeTensor(..., size=(s0,)),) {}

While executing %lift_fresh_copy_1 : [num_users=1] = call_function[target=torch.ops.aten.lift_fresh_copy.default](args = (%_lifted_tensor_constant99,), kwargs = {})
Original traceback:
  File "" in forward
    mean = torch.tensor([0.485, 0.456, 0.406]).reshape(3, 1, 1)
```

In ExportPassBase, we retrace using the fake tensors in the placeholder nodes, but when running into this lift_fresh_copy operators, it's unable to be called with the fake tensors.

Test Plan: CI

Reviewed By: chakriu

Differential Revision: D50211827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111140
Approved by: https://github.com/zhxchen17
2023-10-13 14:07:07 +00:00
359336e3e9 [C10D] Introduce C++ side Collective Callbacks. (#110307)
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110307
Approved by: https://github.com/fduwjj
ghstack dependencies: #111061, #111072
2023-10-13 13:53:16 +00:00
d24539ee6a Improve reflection_pad2d lowering for dynamic shapes (#110988)
Fixes https://github.com/pytorch/pytorch/issues/110696

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110988
Approved by: https://github.com/jansel, https://github.com/lezcano
2023-10-13 13:38:46 +00:00
0dfa354570 [inductor] Implement Fx graph caching to improve warm compilation time. (#103453)
Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR.

Test Plan:
* New unit tests exercising saving and load from the cache.
* New unit tests to exercise the cache key calculations.
* Ran several benchmarks to see cache hit and resulting compilation times.

Differential Revision: [D50255289](https://our.internmc.facebook.com/intern/diff/D50255289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453
Approved by: https://github.com/eellison, https://github.com/Chillee
2023-10-13 13:33:56 +00:00
69dcbc02b0 [Dynamo]Expose bytecode hooks and add example usage for decompilation in docs (#110714)
Dynamo dynamically translate bytecode of python functions, which is powerful but with difficult-to-understand bytecode. Most users cannot understand python bytecode. Although a general purpose way to decompile python bytecode into source code is very difficult, I find that this work can be greatly simplified since Dynamo already cleans up the code: the bytecode generated by Dynamo is a reduced subset of well-behaved python bytecode.

I created a tiny decompiler for pytorch 2.0, named `depyf`: https://github.com/youkaichao/depyf .

There are several takeaways:

- **It supports pyton 3.7 - 3.11 (both inclusive), the same python versions supported by pytorch.** Since the main usage of this library is to understand pytorch 2.0, I plan to keep pace with pytorch. If pytorch supports a new python version, I can add support for that. (Actually, the core code is just about 1k lines. Adding support for new versions of python bytecode can be done in just several days.)
- **I have tested the correctness of decompiled source code in torchbench.** I capture the modified bytecode generated by Dynamo, decompile it into source code, and then compile it into new bytecode, replace the Dynamo generated bytecode with new bytecode. And **it passed all the accuracy tests for timm models**. For huggingface models, the situation is more complicated: all failed cases are caused by the compile step: some functions use the `__class__`  as closure variables, but decompiler can only get the code object, so it has no way to figure out the `__class__` , leading to a name error when compiling the decompiled code. That said, it passed the rest tests without the `__class__` issue. Please see the log file https://cloud.tsinghua.edu.cn/f/685e4af8d930499baa7c/?dl=1 and https://cloud.tsinghua.edu.cn/f/cab89500e15e4b62890b/?dl=1 for details.

With the above efforts, I think it would be great to add an additional logging option in Dynamo: we can try to decompile the generated bytecode into source code, so that users can have a rough idea of what the modified bytecode does. It does not affect the workflow of Dynamo, but just adds more debug information.

An example code from the [doc](https://pytorch.org/docs/main/torch.compiler_deepdive.html):

```python
from typing import List
import torch
from torch import _dynamo as torchdynamo
def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
    print("my_compiler() called with FX graph:")
    gm.graph.print_tabular()
    return gm.forward  # return a python callable

@torchdynamo.optimize(my_compiler)
def toy_example(a, b):
    x = a / (torch.abs(a) + 1)
    if b.sum() < 0:
        b = b * -1
    return x * b
for _ in range(100):
    toy_example(torch.randn(10), torch.randn(10))
```

Run with `export TORCH_LOGS="+dynamo,guards,bytecode"`.

Bytecode logging:

```
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] ORIGINAL BYTECODE toy_example /Users/youkaichao/DeepLearning/depyf/ykc_test.py line 8
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  10           0 LOAD_FAST                0 (a)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               2 LOAD_GLOBAL              0 (torch)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               4 LOAD_METHOD              1 (abs)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               6 LOAD_FAST                0 (a)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               8 CALL_METHOD              1
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              10 LOAD_CONST               1 (1)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              12 BINARY_ADD
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              14 BINARY_TRUE_DIVIDE
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              16 STORE_FAST               2 (x)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  11          18 LOAD_FAST                1 (b)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              20 LOAD_METHOD              2 (sum)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              22 CALL_METHOD              0
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              24 STORE_FAST               3 (__temp_2)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  12          26 LOAD_FAST                3 (__temp_2)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              28 LOAD_CONST               2 (0)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              30 COMPARE_OP               0 (<)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              32 POP_JUMP_IF_FALSE       21 (to 42)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  13          34 LOAD_FAST                1 (b)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              36 LOAD_CONST               3 (-1)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              38 BINARY_MULTIPLY
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              40 STORE_FAST               1 (b)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  14     >>   42 LOAD_FAST                2 (x)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              44 LOAD_FAST                1 (b)
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              46 BINARY_MULTIPLY
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              48 RETURN_VALUE
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 23:56:44,929] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] MODIFIED BYTECODE toy_example /Users/youkaichao/DeepLearning/depyf/ykc_test.py line 8
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]   8           0 LOAD_GLOBAL              3 (__compiled_fn_0)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               2 LOAD_FAST                0 (a)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               4 LOAD_FAST                1 (b)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               6 CALL_FUNCTION            2
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               8 UNPACK_SEQUENCE          2
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              10 STORE_FAST               2 (x)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              12 POP_JUMP_IF_FALSE       12 (to 24)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              14 LOAD_GLOBAL              4 (__resume_at_34_1)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              16 LOAD_FAST                1 (b)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              18 LOAD_FAST                2 (x)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              20 CALL_FUNCTION            2
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              22 RETURN_VALUE
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]         >>   24 LOAD_GLOBAL              5 (__resume_at_42_2)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              26 LOAD_FAST                1 (b)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              28 LOAD_FAST                2 (x)
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              30 CALL_FUNCTION            2
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              32 RETURN_VALUE
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 23:56:44,930] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
```

New output with this PR:

```
[2023-10-06 16:25:21,535] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] possible source code:
[2023-10-06 16:25:21,535] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] def toy_example(a, b):
[2023-10-06 16:25:21,535] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]     __temp_1 = __compiled_fn_0(a, b)
[2023-10-06 16:25:21,535] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]     x = __temp_1[0]
[2023-10-06 16:25:21,535] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]     if __temp_1[1]:
[2023-10-06 16:25:21,535] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]         return __resume_at_34_1(b, x)
[2023-10-06 16:25:21,535] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]     return __resume_at_42_2(b, x)
[2023-10-06 16:25:21,535] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,535] [0/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] If you find the decompiled code is wrong,please submit an issue at https://github.com/youkaichao/depyf/issues.
```

The rest two log (please pay attention to the output `possible source code:`):

```
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] ORIGINAL BYTECODE <resume in toy_example> /workspace/youkaichao/code/pytorch/ykc.py line 12
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  12           0 JUMP_ABSOLUTE           22 (to 44)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               2 LOAD_FAST                2 (a)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               4 LOAD_GLOBAL              0 (torch)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               6 LOAD_ATTR                1 (abs)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               8 LOAD_FAST                2 (a)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              10 CALL_FUNCTION            1
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              12 LOAD_CONST               1 (1)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              14 BINARY_ADD
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              16 BINARY_TRUE_DIVIDE
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              18 STORE_FAST               1 (x)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              20 LOAD_FAST                0 (b)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              22 LOAD_ATTR                2 (sum)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              24 CALL_FUNCTION            0
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              26 STORE_FAST               3 (__temp_2)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              28 LOAD_FAST                3 (__temp_2)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              30 LOAD_CONST               2 (0)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              32 COMPARE_OP               0 (<)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              34 POP_JUMP_IF_FALSE       22 (to 44)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              36 LOAD_FAST                0 (b)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              38 LOAD_CONST               3 (-1)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              40 BINARY_MULTIPLY
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              42 STORE_FAST               0 (b)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  14     >>   44 LOAD_FAST                1 (x)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              46 LOAD_FAST                0 (b)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              48 BINARY_MULTIPLY
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              50 RETURN_VALUE
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] MODIFIED BYTECODE <resume in toy_example> /workspace/youkaichao/code/pytorch/ykc.py line 12
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  12           0 LOAD_GLOBAL              3 (__compiled_fn_3)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               2 LOAD_FAST                0 (b)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               4 LOAD_FAST                1 (x)
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               6 CALL_FUNCTION            2
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               8 UNPACK_SEQUENCE          1
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              10 RETURN_VALUE
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,566] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,567] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] possible source code:
[2023-10-06 16:25:21,567] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] def <resume in toy_example>(b, x):
[2023-10-06 16:25:21,567] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]     return __compiled_fn_3(b, x)[0]
[2023-10-06 16:25:21,567] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,567] [1/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] If you find the decompiled code is wrong,please submit an issue at https://github.com/youkaichao/depyf/issues.
```

```
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] ORIGINAL BYTECODE <resume in toy_example> /workspace/youkaichao/code/pytorch/ykc.py line 12
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  12           0 JUMP_ABSOLUTE           18 (to 36)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               2 LOAD_FAST                2 (a)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               4 LOAD_GLOBAL              0 (torch)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               6 LOAD_ATTR                1 (abs)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               8 LOAD_FAST                2 (a)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              10 CALL_FUNCTION            1
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              12 LOAD_CONST               1 (1)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              14 BINARY_ADD
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              16 BINARY_TRUE_DIVIDE
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              18 STORE_FAST               1 (x)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              20 LOAD_FAST                0 (b)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              22 LOAD_ATTR                2 (sum)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              24 CALL_FUNCTION            0
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              26 STORE_FAST               3 (__temp_2)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              28 LOAD_FAST                3 (__temp_2)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              30 LOAD_CONST               2 (0)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              32 COMPARE_OP               0 (<)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              34 POP_JUMP_IF_FALSE       22 (to 44)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  13     >>   36 LOAD_FAST                0 (b)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              38 LOAD_CONST               3 (-1)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              40 BINARY_MULTIPLY
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              42 STORE_FAST               0 (b)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  14     >>   44 LOAD_FAST                1 (x)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              46 LOAD_FAST                0 (b)
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              48 BINARY_MULTIPLY
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              50 RETURN_VALUE
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,579] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] MODIFIED BYTECODE <resume in toy_example> /workspace/youkaichao/code/pytorch/ykc.py line 12
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]  12           0 LOAD_GLOBAL              3 (__compiled_fn_4)
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               2 LOAD_FAST                0 (b)
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               4 LOAD_FAST                1 (x)
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               6 CALL_FUNCTION            2
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]               8 UNPACK_SEQUENCE          1
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]              10 RETURN_VALUE
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] possible source code:
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] def <resume in toy_example>(b, x):
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]     return __compiled_fn_4(b, x)[0]
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG]
[2023-10-06 16:25:21,580] [2/0] torch._dynamo.convert_frame.__bytecode: [DEBUG] If you find the decompiled code is wrong,please submit an issue at https://github.com/youkaichao/depyf/issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110714
Approved by: https://github.com/jansel
2023-10-13 12:36:00 +00:00
cdc8d709cb Fix mkldnn_matmul error on AArch64 (#110150)
Fixes #110149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110150
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2023-10-13 12:35:46 +00:00
24bd3301d9 Fixed description of run_on input for linux-binary-test workflow (#111191)
As a part of the migrating linux arm64 runners to the autoscaling group, fixed description of the run_on input for the linux-binary-test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111191
Approved by: https://github.com/jeanschmidt
2023-10-13 09:57:40 +00:00
af05fbb84a Linter to avoid csv merge conflicts (#111163)
This PR addresses the persistent issue of merge conflicts in the benchmarks/dynamo/ci_expected_accuracy/ directory, specifically those arising from frequently updated CSV files. Based on @malfet  suggestion, the solution implemented adds three spaces between each line in the CSV files. This approach has proven effective in preventing merge conflicts, as evidenced in [D50239634](https://www.internalfb.com/intern/diff/D50239634/). Regardless of these changes the extra new lines should still allow the csvs to be ingested as normal.

If you have access to the diff:
Normally, modifying a line that is later altered in the stack results in a merge conflict during restacking. With this new spacing strategy, lines that are not modified further down the stack will not trigger merge conflicts, achieving our intended outcome.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111163
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-10-13 09:35:34 +00:00
8209bbbd06 [AOTInductor] Improve validation for C++ wrapper codegen (#111102)
It's a reimplementation of #111089

1. When using fake inputs make sure they are on the same device as the original inputs.
2. Don't change the value of self.cpp_wrapper from True to False if can't generate a C++ wrapper, instead have a check and fail early to avoid producing Python code for C++ compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111102
Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/chunyuan-w
2023-10-13 08:46:17 +00:00
898482f1bf [logging] log exceptions when provided (#111164)
This PR will cause logging.exception() to also dump the exception and stacktrace. Copied from 74723e1110/Lib/logging/__init__.py (L707-L711)

repro:

<details>

```python
import torch
import torch._inductor.config

torch._inductor.config.triton.inject_relu_bug_TESTING_ONLY = "runtime_error"

def fn(x, y):
    return (x @ y).relu()

x, y = [torch.rand((16, 16), device='cuda') for _ in range (2)]
torch.compile(fn)(x, y)
```
run with TORCHDYNAMO_REPRO_AFTER=aot TORCHDYNAMO_REPRO_LEVEL=4

</details>

before:
```
...
[2023-10-12 14:18:52,902] torch._dynamo.debug_utils: [ERROR] While minifying the program in accuracy minification mode, ran into a runtime exception which is likely an unrelated issue. Skipping this graph.
```

now:
```
...
[2023-10-12 14:18:52,902] torch._dynamo.debug_utils: [ERROR] While minifying the program in accuracy minification mode, ran into a runtime exception which is likely an unrelated issue. Skipping this graph.
Traceback (most recent call last):
  File "/data/users/dberard/scripts/relu_accuracy_issue.py", line 10, in <module>
    torch.compile(fn)(x, y)
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111164
Approved by: https://github.com/eellison
2023-10-13 03:52:26 +00:00
4c01686027 Public API for constructing NT with jagged layout from tensor list (#111078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111078
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #109123
2023-10-13 03:27:41 +00:00
a2c17a2b00 [PyTorch] AOTI: add CPU fast path in aoti_torch_empty_strided (#110877)
This seems to reduce benchmark time by 15-20%. Supersedes D49835545.

Differential Revision: [D49974460](https://our.internmc.facebook.com/intern/diff/D49974460/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110877
Approved by: https://github.com/chenyang78, https://github.com/jansel, https://github.com/desertfire
ghstack dependencies: #110876
2023-10-13 02:16:11 +00:00
b85f848233 [PyTorch] -DNDEBUG in inductor codecache builds (#110876)
Things like TORCH_INTERNAL_ASSERT_DEBUG_ONLY care about this!

Differential Revision: [D49972742](https://our.internmc.facebook.com/intern/diff/D49972742/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110876
Approved by: https://github.com/chenyang78, https://github.com/jansel, https://github.com/desertfire
2023-10-13 02:16:11 +00:00
168bad5f23 [export] Reland "Fix graph signature data model to list of specs." (#111136)
Summary: reland D49876258

Test Plan: CI

Differential Revision: D50224384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111136
Approved by: https://github.com/angelayi
2023-10-13 02:04:29 +00:00
9980876cab Quant: add weight int4pack mm kernel (#110914)
Adding the weight int4pack mm CUDA kernel. The kernel comes from the tinnygemm project which developed by Jeff Johnson.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110914
Approved by: https://github.com/Chillee
2023-10-13 01:21:18 +00:00
8713a1a363 add Half support for bernoulli on CPU (#104176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104176
Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch
2023-10-13 01:18:55 +00:00
74b1f4f71a Update sdp_utils functions to accept const& params (#111144)
# Summary
All our filter functions should not mutate the passed in params, this both makes the intent more clear and allows for the compiler to possible produce more optimal code.

### Note
I used East-const style cause I think it is more clear:
https://mariusbancila.ro/blog/2018/11/23/join-the-east-const-revolution/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111144
Approved by: https://github.com/cpuhrsch, https://github.com/Skylion007
2023-10-13 00:48:08 +00:00
21dc1d2547 [Vulkan] Add the 2D case to Layernorm operator (#110796)
Summary:
We add a 2D implementation to the op [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html?fbclid=IwAR00Xi7gt-qo4_LDFo18aaKTxnC4s1vlqk5EREqL0KE0Iz_97-WTlvi0muY)

The current implementation of layer_norm D37407311 supports
- input of 3D and normalized_shape also of 3D, or
- input of 4D with batch dim equal to 1 and normalized_shape of 3D

Since a 2D tensor of [H, W] can be represented as [1, H, W] in shader, we make a straightforward generalization to the case where both input and normalized_shape are of 2D.

Test Plan:
## Before
```
[luwei@devbig984.prn1 ~/fbsource (e09fe4ae4|remote/fbsource/stable...)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm_2d*"
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.

If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa

  Targets matching .buckconfig buck2.supported_projects:
  {'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}

  To suppress this warning: touch ~/.config/.dont_hint_buck2

clang-12: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]

Downloaded 2/4 artifacts, 125.45 Kbytes, 33.3% cache miss (for updated rules)
Building: finished in 4.9 sec (100%) 2637/2637 jobs, 3/2637 updated
  Total time: 4.9 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm_2d*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.layer_norm_2d_small
unknown file: Failure
C++ exception with description "Vulkan layernorm expects 3-dim or 4-dim input!
Exception raised from layer_norm at xplat/caffe2/aten/src/ATen/native/vulkan/ops/Layernorm.cpp:66 (most recent call first):
(no backtrace available)" thrown in the test body.
[  FAILED  ] VulkanAPITest.layer_norm_2d_small (56 ms)
[ RUN      ] VulkanAPITest.layer_norm_2d_medium
unknown file: Failure
C++ exception with description "Vulkan layernorm expects 3-dim or 4-dim input!
Exception raised from layer_norm at xplat/caffe2/aten/src/ATen/native/vulkan/ops/Layernorm.cpp:66 (most recent call first):
(no backtrace available)" thrown in the test body.
[  FAILED  ] VulkanAPITest.layer_norm_2d_medium (0 ms)
[ RUN      ] VulkanAPITest.layer_norm_2d_large
unknown file: Failure
C++ exception with description "Vulkan layernorm expects 3-dim or 4-dim input!
Exception raised from layer_norm at xplat/caffe2/aten/src/ATen/native/vulkan/ops/Layernorm.cpp:66 (most recent call first):
(no backtrace available)" thrown in the test body.
[  FAILED  ] VulkanAPITest.layer_norm_2d_large (27 ms)
[----------] 3 tests from VulkanAPITest (84 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (84 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] VulkanAPITest.layer_norm_2d_small
[  FAILED  ] VulkanAPITest.layer_norm_2d_medium
[  FAILED  ] VulkanAPITest.layer_norm_2d_large

 3 FAILED TESTS
```

## After
```
[luwei@devbig984.prn1 ~/fbsource (e09fe4ae4|remote/fbsource/stable...)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm_2d*"
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.

If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa

  Targets matching .buckconfig buck2.supported_projects:
  {'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}

  To suppress this warning: touch ~/.config/.dont_hint_buck2

clang-12: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]

Downloaded 1/3 artifacts, 1.40 Mbytes, 50.0% cache miss (for updated rules)
Building: finished in 5.0 sec (100%) 2637/2637 jobs, 2/2637 updated
  Total time: 5.0 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm_2d*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.layer_norm_2d_small
[       OK ] VulkanAPITest.layer_norm_2d_small (282 ms)
[ RUN      ] VulkanAPITest.layer_norm_2d_medium
[       OK ] VulkanAPITest.layer_norm_2d_medium (0 ms)
[ RUN      ] VulkanAPITest.layer_norm_2d_large
[       OK ] VulkanAPITest.layer_norm_2d_large (214 ms)
[----------] 3 tests from VulkanAPITest (497 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (497 ms total)
[  PASSED  ] 3 tests.
```
full test result: P848167714

Differential Revision: D50048054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110796
Approved by: https://github.com/yipjustin
2023-10-13 00:37:11 +00:00
9c7f464eef [inductor]: Better debugging of can_fuse decisions with TORCH_LOGS=fusion (#110415)
Fixes https://github.com/pytorch/pytorch/issues/110393

Example logs (for adagrad on main). In this case, it clearly identifies device mismatch as a potential red flag, which is indeed the obstacle to adagrad's successful fusion. (see: https://github.com/pytorch/pytorch/pull/110339)
```
[2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (1/10): 18 nodes =====
[2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 13 possible fusions:
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf8'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf10'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf12'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf14'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf9'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf11'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf13'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf15'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf25'), SchedulerNode(name='buf33'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf43'), SchedulerNode(name='buf51'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf34'), SchedulerNode(name='buf42'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf16'), SchedulerNode(name='buf24'))
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (1/10): fused 18 nodes into 5 nodes
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG]
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (2/10): 5 nodes =====
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 0 possible fusions:
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (2/10): fused 5 nodes into 5 nodes
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG]
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== fusion complete (2 iterations) =====

```

CC @jansel @ngimel @mlazos @shunting314 @peterbell10  as code owners

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110415
Approved by: https://github.com/mlazos
2023-10-13 00:36:45 +00:00
1208a44799 [docs] export full aten opset (#111161)
Differential Revision: [D50240459](https://our.internmc.facebook.com/intern/diff/D50240459/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111161
Approved by: https://github.com/tugsbayasgalan
2023-10-13 00:28:35 +00:00
ad4472833c define public API for torch.nn.utils (#111026)
Adding modules imported here and the following functions to the `__all__`:
* [clip_grad_norm_](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html)
* [clip_grad_value_](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_value_.html)
* [remove_weight_norm](https://pytorch.org/docs/stable/generated/torch.nn.utils.remove_weight_norm.html)
* [parameters_to_vector](https://pytorch.org/docs/stable/generated/torch.nn.utils.parameters_to_vector.html)
* [vector_to_parameters](https://pytorch.org/docs/stable/generated/torch.nn.utils.vector_to_parameters.html)
* [remove_spectral_norm](https://pytorch.org/docs/stable/generated/torch.nn.utils.remove_spectral_norm.html)
* [skip_init](https://pytorch.org/docs/stable/generated/torch.nn.utils.skip_init.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111026
Approved by: https://github.com/mikaylagawarecki
2023-10-12 23:05:23 +00:00
8f90be4478 Expand NT subclass to support SAM (#109123)
This PR contains the changes needed to support using the NT jagged subclass within SAM. Note that a NT with multiple ragged dims is still required at the extremes for inputs / outputs, but the internal computation generally involves a single ragged dim, making the jagged layout usable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109123
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-10-12 20:33:22 +00:00
e0eaa95e99 [DCP] Remove _shard_tensor() call in load_sharded_optimizer_state_dict in optimizer.py (#111096)
`_shard_tensor()` calls into `dist.all_gather_object()` and this is causing optimizer state dict loading to be super slow. Workaround: call `FSDP._shard_utils._create_chunk_sharded_tensor()` to construct ShardedTensor without any communication.

Thanks to @fegin for suggesting the fix!
Thanks @mvpatel2000 for reporting the issue and providing profiling details to help us isolate the problematic source code quickly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111096
Approved by: https://github.com/fegin
2023-10-12 20:27:06 +00:00
6748a14a71 [aot_inductor] add a test with AOTInductor + TorchScript (#111124)
This test may be of a reference for using AOTInductor with
TorchScript.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111124
Approved by: https://github.com/jansel
2023-10-12 19:29:07 +00:00
397deaa825 Fix typo in mixed dtypes linear operator implementation. (#111127)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111127
Approved by: https://github.com/Skylion007
2023-10-12 19:06:04 +00:00
7fbfa4e020 Revert "[inductor] Implement Fx graph caching to improve warm compilation time. (#103453)"
This reverts commit fc1105b2827ee2febc85a3c353470edfd70a66ed.

Reverted https://github.com/pytorch/pytorch/pull/103453 on behalf of https://github.com/kit1980 due to Same issue unfortunately, the newly added test fails on internal builds ([comment](https://github.com/pytorch/pytorch/pull/103453#issuecomment-1760202365))
2023-10-12 18:54:51 +00:00
f9053877b4 Add pypi required metadata to all wheels except linux (#111042)
Will fix package after publishing https://github.com/pytorch/pytorch/issues/100974
Poetry install requires all wheels on pypi to have same metadata. Hence including linux dependencies in all non-linux wheels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111042
Approved by: https://github.com/malfet
2023-10-12 17:40:13 +00:00
94c9dbff22 Disable cutlass_template on ROCm (#111132)
Fixes #111066 #111065 #111064

Currently use_cutlass_template is returning True on ROCm but the feature is not supported. Fix to return false on ROCm. I considering adding this change to `try_import_cutlass` instead but the comments hinted that this function would be removed at some point.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111132
Approved by: https://github.com/jansel
2023-10-12 17:14:07 +00:00
bb1424d46e Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)
This reverts commit 314a502eb04c6382e2cc9af0573533efba54109d.

Changes since original PR:
Reland 1
 *  rename torch.distributed.hooks to torch.distributed._hooks

Reland 2
 * make _hooks importable even if !distributed.is_available()
 * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack)

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072
Approved by: https://github.com/malfet
ghstack dependencies: #111061
2023-10-12 16:59:23 +00:00
dede1e96e2 [BE] Enable Ruff's Flake8 PYI018 (#111101)
Enable [unused-private-type-var (PYI018)](https://docs.astral.sh/ruff/rules/unused-private-type-var/#unused-private-type-var-pyi018)

Link: #110950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111101
Approved by: https://github.com/albanD
2023-10-12 16:26:21 +00:00
53a9ac534c Added decorator skipRocmIfTorchInductor and skipped failing tests (#107760)
This PR adds a skip decorator which will disable tests in CI for ROCm inductor workflow. This new workflow will be coming in via https://github.com/pytorch/pytorch/pull/110544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107760
Approved by: https://github.com/jataylo, https://github.com/pruthvistony, https://github.com/atalman
2023-10-12 16:00:35 +00:00
918054f422 [Inductor] support channel last for xpu conv in inductor layout opt path (#111018)
# Motivation
support xpu channel last for inductor layout optimization path.
Currently, `_conv_determine_backend_memory_format` always returns torch.contiguous_format for XPU conv.

# Solution
Add xpu channel last detection stragey in `determine_backend_memory_format`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111018
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/EikanWang
2023-10-12 15:13:50 +00:00
5ace912263 fix: do not reshard parameters twice (#110948)
This PR fixes potential double resharding of parameters that both:

1. requires no gradient and,
2. were used more than once during forward pass.

[`_register_post_backward_hook`](https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_runtime_utils.py#L1415) handles the case correctly, this PR does the same for `_register_post_backward_reshard_only_hook`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110948
Approved by: https://github.com/awgu
2023-10-12 15:09:33 +00:00
aec0f98e70 Move cuda driver exit handling from helpers to threads (#111061)
The pattern here is that main may exit and kill cuda driver before
c10d watchdog related threads have cleanly exited.  If this happens,
c10d threads may still make CUDA api calls and raise an exception about
the cuda driver being dead.

In the past we've patched a few helper functions that call into cuda
to specifically handle this driver exiting message.  Instead, we know
that this problem applies only to codepaths in our background threads,
so we should catch at that scope and not worry about fine-grained
catching at the helper granularity. (and if a helper is used from the main
thread, we should NOT catch this exception- it's the application's fault)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111061
Approved by: https://github.com/malfet, https://github.com/fduwjj
2023-10-12 13:47:04 +00:00
2f53085f3f [BE] Enable Ruff's Flake8 PYI030 (#111103)
Enable [unnecessary-literal-union (PYI030)](https://docs.astral.sh/ruff/rules/unnecessary-literal-union/)

Link: #110950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111103
Approved by: https://github.com/albanD
2023-10-12 13:31:59 +00:00
68a1219f74 Move at::{Refcounted,}MapAllocator to c10 (#109881)
`libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`,
 so it makes sense to move it to c10 along with the other memory allocators.

This means `libshm.so` only depends on `c10` and we don't need to relink
`libshm.so` for every ATen change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881
Approved by: https://github.com/albanD
2023-10-12 10:51:13 +00:00
42b89aea4b Revert "[export] Fix graph signature data model to list of specs. (#111017)"
This reverts commit 33b69509d3665f82bf91cee96f9beeef0d8e0b72.

Reverted https://github.com/pytorch/pytorch/pull/111017 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111017#issuecomment-1759292161))
2023-10-12 09:52:33 +00:00
395d0eaea0 Dynamo - config gated torch.distributed allow, exclusion for special leaf funcs (#110894)
`is_allowed` is a tricky bit of functionality - it sits early up in builder and is used to drive the creation of TorchVariable (more notes here, meta only https://fb.workplace.com/groups/pytorch.dev/permalink/1393563781222098/)

If we are tracing distributed in full, we want to route certain calls in distributed to NOT PASS is_allowed (this does not, confusingly, mean that they are not allowed, lol, but rather that we dont want them to become TorchVariable), others, we are fine with preserving.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110894
Approved by: https://github.com/ezyang
2023-10-12 09:25:51 +00:00
cyy
499146354e Use CUDA image for lintrunner (#110502)
We switch to pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter for lintrunner for checking CUDA cpp source. Meanwhile, there is a Dockerfile change due to missing libiomp installation and some other clang-tidy fixes triggered by the switch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110502
Approved by: https://github.com/malfet
2023-10-12 09:16:36 +00:00
d8ad0ba5c1 [Dist][ez][nit] Formatted nccl version string in startup (#111076)
Formats the string using the existing getNCCLversion

Differential Revision: [D50193558](https://our.internmc.facebook.com/intern/diff/D50193558/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111076
Approved by: https://github.com/Skylion007
2023-10-12 08:54:32 +00:00
b35279dfac [DDP] Make _ReplicateState inherit from _State and make replicate eagerly initialized (#109647)
Follow how fully_shard store the _FSDPState, this PR makes _ReplicateState inherit from _State.  This PR also makes replicate eagerly initialize the internal DDP instance so that users can access the required methods/functions before the first forward().

Differential Revision: [D49428291](https://our.internmc.facebook.com/intern/diff/D49428291/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109647
Approved by: https://github.com/wz337, https://github.com/rohan-varma
ghstack dependencies: #110688
2023-10-12 07:58:39 +00:00
5614023f5e Move export.constrain_as_* to torch._constrain_as_* (#110757)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110757
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #109859
2023-10-12 05:37:44 +00:00
6ce3a38050 Revert "Move export.constrain_as_* to torch._constrain_as_* (#110757)"
This reverts commit 5aee22e0e033dbd2346b533fb2651ee30ca5ed86.

Reverted https://github.com/pytorch/pytorch/pull/110757 on behalf of https://github.com/kit1980 due to Depends on https://github.com/pytorch/pytorch/pull/109859 that needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/110757#issuecomment-1758908371))
2023-10-12 04:53:29 +00:00
f0e7a91030 [vision hash update] update the pinned vision hash (#111098)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111098
Approved by: https://github.com/pytorchbot
2023-10-12 04:30:55 +00:00
5e8be63e99 Allow specifiying inputs as GradientEdge in autograd APIs (#110867)
This can be useful for advanced users (like AOTAutograd) who don't want to keep the corresponding Tensor alive (for memory reasons for example) or when inplace op will change the Tensor's grad_fn (but gradients wrt to the original value is needed).

I went minimal API change but open to suggestions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110867
Approved by: https://github.com/soulitzer
2023-10-12 04:08:44 +00:00
33b69509d3 [export] Fix graph signature data model to list of specs. (#111017)
Summary:
Previously we design the GraphSignature format as a bunch of inputs and outputs node names. After a discussion in the design meeting we decide to change the format to make signature more self-contained. Now the signature format look like the following:
```
[
InputSpec(
   kind=InputKind.USER_INPUT,
   arg=TensorArgument(name="arg0_1"),
   target=None,
),
...
]
```

Test Plan: CI

Reviewed By: angelayi

Differential Revision: D49876258

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111017
Approved by: https://github.com/angelayi
2023-10-12 03:39:04 +00:00
097defb160 [device mesh] only check when world size > num_devices per host (#111091)
as titled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111091
Approved by: https://github.com/awgu, https://github.com/wz337
ghstack dependencies: #110898, #110900
2023-10-12 03:37:18 +00:00
9316c8b4bc Use torch._check for cat error checking (#111035)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111035
Approved by: https://github.com/Skylion007
2023-10-12 03:28:27 +00:00
6dca81c054 Revert 107846 and 109695 (#111099)
https://github.com/pytorch/pytorch/pull/107846 caused Meta-internal S369412
https://github.com/pytorch/pytorch/pull/109695 depends on 107846 so also needs to be reverted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111099
Approved by: https://github.com/malfet
2023-10-12 02:30:30 +00:00
07f0f383fa update tensor-like to check instance for torch function impl (#111087)
tensor like should check the instance for a torch function impl, not the type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111087
Approved by: https://github.com/ezyang
2023-10-12 02:14:38 +00:00
8e32e62f67 [TP] Validate TP mesh dim for 2D composition (#111001)
Currently, we only support intranode TP when compositin TP with other parallelism. This PR adds additional check to validate the TP mesh dim in TP initialization when parent mesh exists.

cc. @fegin, @fduwjj
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111001
Approved by: https://github.com/fduwjj
2023-10-12 02:11:44 +00:00
80ea8784f3 Bump xla_base version tag to v1.1 (#109757)
Update to a new base image for xla workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109757
Approved by: https://github.com/malfet
2023-10-12 01:45:26 +00:00
e0ddc3ff9c [quant][pt2e][be] Move xnnpack quantizer tests to separate file (#111004)
Summary:
att

Test Plan:
python test/test_quantization.py TestXNNPACKQuantizer
python test/test_quantization.py TestXNNPACKQuantizerModels

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111004
Approved by: https://github.com/andrewor14
2023-10-12 01:16:05 +00:00
8f8d8a0b50 Linear Quantize (#110581)
Summary: Adding Linear vulkan quantize operator

Test Plan:
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1

//xplat/caffe2/fb/custom_ops/vulkan_quantized:pt_vulkan_quantized_test_binAppleMac\#macosx-arm64

[       OK ] VulkanAPITest.convert_qconv2d_context (135 ms)
[ RUN      ] VulkanAPITest.linear_2d
[       OK ] VulkanAPITest.linear_2d (4 ms)
[----------] 2 tests from VulkanAPITest (139 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (139 ms total)
[  PASSED  ] 2 tests.

##############################################################

buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource

//xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
[       OK ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_int8_int32 (11 ms)
[ RUN      ] VulkanAPITest.linear_2d_flat
[       OK ] VulkanAPITest.linear_2d_flat (4 ms)
[ RUN      ] VulkanAPITest.linear_2d_small
[       OK ] VulkanAPITest.linear_2d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_2d_large
[       OK ] VulkanAPITest.linear_2d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_3d_flat
[       OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_small
[       OK ] VulkanAPITest.linear_3d_small (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_large
[       OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_flat
[       OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_custom
[       OK ] VulkanAPITest.linear_custom (0 ms)
[----------] 76 tests from VulkanAPITest (1811 ms total)

[----------] Global test environment tear-down
[==========] 76 tests from 1 test suite ran. (1811 ms total)
[  PASSED  ] 76 tests.

  YOU HAVE 8 DISABLED TESTS

##############################################################

buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

[----------] Global test environment tear-down
[==========] 346 tests from 1 test suite ran. (5648 ms total)
[  PASSED  ] 345 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS

Reviewed By: manuelcandales

Differential Revision: D48812642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110581
Approved by: https://github.com/yipjustin
2023-10-12 01:04:06 +00:00
5292a92e03 Add torch.unravel_index (#110580)
Fixes #35674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110580
Approved by: https://github.com/lezcano, https://github.com/kulinseth
2023-10-12 00:55:51 +00:00
577e3dff88 [aotinductor] Fail models temporarily (#111100)
Temporarily mark these models as fail. Failures are due to https://github.com/pytorch/pytorch/pull/111030 which is needed for ExecuTorch's release so it can't be reverted. Will forward fix the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111100
Approved by: https://github.com/desertfire
2023-10-12 00:48:44 +00:00
986ad3bfa6 [2/N] Dynamo supports skip by function & removes skipfiles circular import (#110835)
Several improvements for skipfiles:
* Add ```FUNC_INLINELIST``` to support function level skip/inline check.
  * Use ```fn.__code__``` to match function since we can't get the function object sometimes.
* Use python module string name for ```FILE_INLINELIST``` and ```SUBMODULE_INLINELIST```.
  * Use filename to match file and python module, which can fundamentally resolved the circular import issues introduced by skipfiles.
  * Use ```TYPE_CHECKING``` to ensure the python module string name is correct.
* Add unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110835
Approved by: https://github.com/ezyang
2023-10-12 00:44:41 +00:00
cyy
a6b452dfdc [2/N] Enable Wunused-result, Wunused-variable and Wmissing-braces in torch targets (#110836)
This PR enables Wunused-result, Wunused-variable and Wmissing-braces because our code base is clean.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110836
Approved by: https://github.com/Skylion007
2023-10-11 23:49:15 +00:00
6d7744ca46 Fix typo under torch/_functorch directory (#111067)
This PR fixes typo the the of comments and exception messages in files under `torch/_functorch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111067
Approved by: https://github.com/Skylion007
2023-10-11 23:09:36 +00:00
4d29b40299 torch.compile DTensor E2E (#105236)
This PR updates DTensor to support torch.compile

Cool stuff: there are some new tests in `test_dtensor.py` that show both the forward and backward graphs that we can send to inductor, when running a matmul with DTensor's. In particular, for this user code:
```
        def fn(x, y):
            dt = DTensor.from_local(x.reshape(2, 4), mesh, [Shard(0)], run_check=False)
            dt2 = DTensor.from_local(y.reshape(4, 2), mesh, [Shard(1)], run_check=False)
            dt_out = torch.matmul(dt, dt2)
            dt_out_redistribute = dt_out.redistribute(mesh, [Replicate()])
            return dt_out.to_local()
```

We generate the following fw and backward graphs.

Forward graph:
```
def forward(self, primals_1, primals_2):
    view = torch.ops.aten.view.default(primals_1, [2, 4]);  primals_1 = None
    _to_copy = torch.ops.aten._to_copy.default(view, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0));  view = None
    detach = torch.ops.aten.detach.default(_to_copy);  _to_copy = None
    detach_1 = torch.ops.aten.detach.default(detach);  detach = None
    view_1 = torch.ops.aten.view.default(primals_2, [4, 2]);  primals_2 = None
    _to_copy_1 = torch.ops.aten._to_copy.default(view_1, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0));  view_1 = None
    detach_2 = torch.ops.aten.detach.default(_to_copy_1);  _to_copy_1 = None
    detach_3 = torch.ops.aten.detach.default(detach_2);  detach_2 = None
    detach_4 = torch.ops.aten.detach.default(detach_1)
    all_gather_into_tensor = torch.ops.c10d_functional.all_gather_into_tensor.default(detach_3, 'ptd:0', [0, 1], 2)
    wait_tensor = torch.ops.c10d_functional.wait_tensor.default(all_gather_into_tensor);  all_gather_into_tensor = None
    split = torch.ops.aten.split.Tensor(wait_tensor, 4);  wait_tensor = None
    getitem = split[0]
    getitem_1 = split[1];  split = None
    cat = torch.ops.aten.cat.default([getitem, getitem_1], 1);  getitem = getitem_1 = None
    detach_5 = torch.ops.aten.detach.default(cat);  cat = None
    mm = torch.ops.aten.mm.default(detach_4, detach_5);  detach_4 = detach_5 = None
    detach_6 = torch.ops.aten.detach.default(mm);  mm = None
    detach_9 = torch.ops.aten.detach.default(detach_6);  detach_6 = None
    detach_10 = torch.ops.aten.detach.default(detach_9);  detach_9 = None
    t = torch.ops.aten.t.default(detach_1);  detach_1 = None
    detach_13 = torch.ops.aten.detach.default(t);  t = None
    t_1 = torch.ops.aten.t.default(detach_3);  detach_3 = None
    detach_15 = torch.ops.aten.detach.default(t_1);  t_1 = None
    clone = torch.ops.aten.clone.default(detach_15, memory_format = torch.contiguous_format);  detach_15 = None
    return [detach_10, detach_13, clone]
```

Backward graph:
```
def forward(self, detach_13, clone, tangents_1):
    detach_11 = torch.ops.aten.detach.default(tangents_1);  tangents_1 = None
    detach_12 = torch.ops.aten.detach.default(detach_11);  detach_11 = None
    mm_1 = torch.ops.aten.mm.default(detach_13, detach_12);  detach_13 = None
    detach_14 = torch.ops.aten.detach.default(mm_1);  mm_1 = None
    detach_16 = torch.ops.aten.detach.default(detach_12);  detach_12 = None
    all_gather_into_tensor_2 = torch.ops.c10d_functional.all_gather_into_tensor.default(clone, 'ptd:0', [0, 1], 2);  clone = None
    wait_tensor_2 = torch.ops.c10d_functional.wait_tensor.default(all_gather_into_tensor_2);
    detach_17 = torch.ops.aten.detach.default(wait_tensor_2);  wait_tensor_2 = None
    mm_2 = torch.ops.aten.mm.default(detach_16, detach_17);  detach_16 = detach_17 = None
    detach_18 = torch.ops.aten.detach.default(mm_2);  mm_2 = None
    split_1 = torch.ops.aten.split.Tensor(detach_14, 2, 1);  detach_14 = None
    getitem_2 = split_1[0]
    getitem_3 = split_1[1];  split_1 = None
    cat_1 = torch.ops.aten.cat.default([getitem_2, getitem_3]);  getitem_2 = getitem_3 = None
    reduce_scatter_tensor = torch.ops.c10d_functional.reduce_scatter_tensor.default(cat_1, 'SUM', 'ptd:0', [0, 1], 2);  cat_1 = None
    wait_tensor_3 = torch.ops.c10d_functional.wait_tensor.default(reduce_scatter_tensor);  reduce_scatter_tensor = None
    detach_19 = torch.ops.aten.detach.default(wait_tensor_3);  wait_tensor_3 = None
    detach_20 = torch.ops.aten.detach.default(detach_19);  detach_19 = None
    detach_21 = torch.ops.aten.detach.default(detach_20);  detach_20 = None
    detach_22 = torch.ops.aten.detach.default(detach_21);  detach_21 = None
    _to_copy_2 = torch.ops.aten._to_copy.default(detach_22, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'));  detach_22 = None
    view_2 = torch.ops.aten.view.default(_to_copy_2, [8]);  _to_copy_2 = None
    detach_23 = torch.ops.aten.detach.default(detach_18);  detach_18 = None
    detach_24 = torch.ops.aten.detach.default(detach_23);  detach_23 = None
    _to_copy_3 = torch.ops.aten._to_copy.default(detach_24, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'));  detach_24 = None
    view_3 = torch.ops.aten.view.default(_to_copy_3, [8]);  _to_copy_3 = None
    return [view_3, view_2]
```

Some of the stuff in this graph looks kinda of silly though (e.g. an unnecessary split() + cat(), and all the extra detach() calls).

Stuff that's broken:
- functionalization is pretty horribly broken. In particular, the original strategy I used in this stack was to have functionalization run **above** subclass desugaring. But that doesn't play well with with the way we want to compile DTensor. DTensor has a few API's like `.redistribute()`, `.to_local()`, and the `DTensor()` constructor, that we want to put directly into the graph so that we can compile them (e.g. redistribute() will desugar into collective ops). Doing this requires functionalization to run **underneath** the subclass though. I hacked around this for now, by forcing these functions to run functionalization first if they need to.
- the backward test that I have is... wrong. The backward graph that we trace out looks kind of reasonable, but it gives incorrect gradients on one of the two inputs. This needs further debugging (presumably we should be able to stare at the graph and identify which part of it is wrong?).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105236
Approved by: https://github.com/wanchaol
2023-10-11 21:55:27 +00:00
3553eb9b89 Add CUTLASS-based support for mixed dtypes matrix multiplication (#110981)
Resubmission without ghstack to make it easier to import https://github.com/pytorch/pytorch/pull/110934/commits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110981
Approved by: https://github.com/drisspg
2023-10-11 21:47:52 +00:00
0f924cdee3 Fix functional::smooth_l1_loss signatures to not override beta (#109798)
This splits `nn::functional::smooth_l1_loss` into two different signatures in order to keep backward compatibility for calling the function like `smooth_l1_loss(input, target, /*reduction=*/..., /*beta=*/...)`

Fixes #70163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109798
Approved by: https://github.com/mikaylagawarecki
2023-10-11 21:37:37 +00:00
73f4c1a406 [reland2] Update custom Function preserve torch function when inputs … (#110895)
…returned as-is

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110895
Approved by: https://github.com/albanD
2023-10-11 21:37:19 +00:00
52e76a3056 fix ShardedTensor.gather when shard is empty (#110962)
Summary:
current ShardedTensor.gather is not working as expectation when the shard is empty on any rank

The root cause is identified that when a sharded tensor has no placement on a specific rank, the metadata doesn't include that rank's placement which introduces KeyError in :                 ```shard_offset = shard_placement[shard. Metadata][1]```

It's fixed by adding an empty tensor check.

Test Plan:
before change:

after change:

Differential Revision: D50114085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110962
Approved by: https://github.com/wz337
2023-10-11 21:26:41 +00:00
ded5ee75ac AOTAutograd: Go down inference path if no outputs require grad (#111011)
Fixes https://github.com/pytorch/pytorch/issues/110666

Slight update to original PR here: https://github.com/pytorch/pytorch/pull/111005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111011
Approved by: https://github.com/Chillee
2023-10-11 20:59:47 +00:00
6d8e0c4b5a [export] Get export APIs ready for PTC (reland) (#111030)
Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130
Changes:
* `torch.export` will return a functional ATen graph but not lowered to core aten decompositions (CompositeImplicitAutograd decomps still run)
* `exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.

Calling convention for Executorch stays the same:
```
pre_autograd_graph = capture_pre_autograd_graph(f, args, ...)
aten_graph_no_decomps = torch.export.export(pre_autograd_graph, args, ...)
# Within to_edge we decompose to core aten and then convert to edge
edge_graph = exir.to_edge(aten_graph_no_decomps)
```

Test Plan: CI

Differential Revision: D50172210

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111030
Approved by: https://github.com/ydwu4
2023-10-11 20:48:24 +00:00
20a7366147 Fix Android publish step with lite interpreter (#111071)
This file needs to be added to the list like others.  The publish command `BUILD_LITE_INTERPRETER=1 android/gradlew -p android publish` finishes successfully with this and files are available on Nexus:

![Screenshot 2023-10-11 at 11 56 53](https://github.com/pytorch/pytorch/assets/475357/849d4aa7-79f6-47fa-a471-d452d7c1bdf6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111071
Approved by: https://github.com/atalman
2023-10-11 20:28:12 +00:00
6c7013a3dc [Doc] Add weight dtype in torch.nn.CrossEntropyLoss (#110998)
Fixes #101213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110998
Approved by: https://github.com/albanD
2023-10-11 19:52:13 +00:00
d589106bcd [quant][pt2e] Disable remove_qconfig (#111000)
Summary:
This is a hacky flag that we had before in fx flow, and we don't want this in the new flow

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111000
Approved by: https://github.com/andrewor14
2023-10-11 19:43:46 +00:00
cf1da9bd17 enable index add test (#111016)
Dynamo is swallowing a user exception when suppress_errors is set to True. There's an issue filed for that: https://github.com/pytorch/pytorch/issues/108798. In the meantime we still like the functionality in this test which works without the default setting (dont suppress errors) to not regress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111016
Approved by: https://github.com/yanboliang
2023-10-11 19:41:35 +00:00
e151307db0 Clean-up composite implicit ops for aten::isfinite, isreal and log_sigmoid (#110896)
Functions:
* aten::isfinite
* aten::log_sigmoide
* aten::isreal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110896
Approved by: https://github.com/Skylion007, https://github.com/kshitij12345
2023-10-11 19:28:10 +00:00
d3205f8377 Revert "[2/N] Dynamo supports skip by function & removes skipfiles circular import (#110835)"
This reverts commit 0bd4ce728b9af2a14cfbda89e8faa9c9cfd61a5b.

Reverted https://github.com/pytorch/pytorch/pull/110835 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/110835#issuecomment-1758279590))
2023-10-11 18:39:36 +00:00
80dfc974dd [2D] Enable 2D FSDP+TP model.load_state_dict() (#110925)
This PR adds a all_gather_dtensor() method to fsdp/_fsdp_extensions.py and the actual implementation in tensor/parallel/fsdp.py. This enables FSDP to load 2D DTensor state_dict into model when calling `model.load_state_dict()`.

cc. @fegin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110925
Approved by: https://github.com/fegin
ghstack dependencies: #110831, #110846
2023-10-11 18:22:20 +00:00
fd4ba806f6 Implement tensor slice in inductor to stop falling back for aten.index (#111015)
Fixes #110711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111015
Approved by: https://github.com/Chillee
2023-10-11 17:53:24 +00:00
6c136c3302 [2D] Enable 2D DTensor state_dict for FSDP + TP (#110846)
This PR adds a `chunk_dtensor()` method to fsdp/_fsdp_extensions.py and the actual implementation of `chunk_dtensor()` in tensor/parallel/fsdp.py. This enables FSDP to return 2D DTensor state_dict when composing FSDP with TP.

cc. @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110846
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #110831
2023-10-11 17:40:39 +00:00
0bd4ce728b [2/N] Dynamo supports skip by function & removes skipfiles circular import (#110835)
Several improvements for skipfiles:
* Add ```FUNC_INLINELIST``` to support function level skip/inline check.
  * Use ```fn.__code__``` to match function since we can't get the function object sometimes.
* Use python module string name for ```FILE_INLINELIST``` and ```SUBMODULE_INLINELIST```.
  * Use filename to match file and python module, which can fundamentally resolved the circular import issues introduced by skipfiles.
  * Use ```TYPE_CHECKING``` to ensure the python module string name is correct.
* Add unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110835
Approved by: https://github.com/ezyang
2023-10-11 17:24:56 +00:00
de1ca4a081 [dtensor] small change to refactor random ops (#110900)
make random ops be a set instead of list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110900
Approved by: https://github.com/fduwjj
ghstack dependencies: #110898
2023-10-11 17:03:08 +00:00
657e8f2cad [dtensor] make replicate -> partial do division instead (#110898)
This PR switches the replicate -> partial to do division instead of
zeroing out other ranks, it preserve same numerics, but avoid the
per-rank behavior difference, and friendly to torch compile
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110898
Approved by: https://github.com/fduwjj
2023-10-11 17:03:08 +00:00
204f338f71 Reland [Profiler] Improve the docstring for export_memory_timeline (#110983)
Summary: Add more details about the export_memory_timeline API, as we've landed new representations of the memory timeline data.

Test Plan: CI, should be no functional change, as we only changed comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110983
Approved by: https://github.com/DanilBaibak
2023-10-11 16:42:05 +00:00
cae3a2e6eb Revert "[sparse] Add i8i8->i32 support for cuSPARSELt (#110499)"
This reverts commit 33da6c89516d9d9067f7181826826224a4cf5afe.

Reverted https://github.com/pytorch/pytorch/pull/110499 on behalf of https://github.com/jcaip due to cslt v0.5.0 requires a newer linker and we will be using v0.4.0 as the base version ([comment](https://github.com/pytorch/pytorch/pull/110499#issuecomment-1758039953))
2023-10-11 16:14:59 +00:00
86619c9c9d [aotinductor] Add both cpu and cuda tests for the AOTInductor cpp test (#110920)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110920
Approved by: https://github.com/chenyang78
ghstack dependencies: #110652, #110891
2023-10-11 15:58:28 +00:00
3058700f7f [aotinductor] Add AOTIModelRunner as a utility class (#110891)
Summary: Introduce a utility class AOTIModelRunner to take care of running an AOTInductor compiled model. It does things like dlopen a model, initialize the model container, setup inputs and outputs, and destroy the model container.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110891
Approved by: https://github.com/chenyang78
ghstack dependencies: #110652
2023-10-11 15:58:28 +00:00
b17c247eb1 [aotindutor] Update the cpp test example (#110652)
Summary: store inputs and outpus in python, and load them back to run the compiled model in c++ and compare the output
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110652
Approved by: https://github.com/chenyang78
2023-10-11 15:58:28 +00:00
3062e267b1 [cond] Add more tests for valid inputs of cond (#110727)
This PR adds a parametrized test for cond. It tests cond can be traced with valid inputs. Specifically valid inputs is combination of:
- pred (python boolean, boolean tensor, int tensor, scalar tensor)
- true_fn/false_fn (func, obj, nn_module)
- Operands (0 or more tensor inputs), tested with 0  and 2
- closures (0 or more tensor closures), tested with 0 and 2
- nested_level (no nesting or level-2 nested cond)

What this test doesn't cover:
- pred: symbolic boolean expression as predicate
- true_fn/false_fn: that mutates indermediate tensors
- operands: non-tensor operands such as float, int
- closures: nn_module attribute closures, python constant closures
- nested_level: 3+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110727
Approved by: https://github.com/zou3519
2023-10-11 15:56:13 +00:00
ef19824db8 Suppress warnings in tensorpipe.h (#111012)
To fix distributed compilation with clang-15

Fixes https://github.com/pytorch/pytorch/issues/110974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111012
Approved by: https://github.com/huydhn, https://github.com/drisspg, https://github.com/Skylion007
2023-10-11 15:41:30 +00:00
f2d476843e [MPS][BE] Avoid redispatch in sign_out (#110955)
By calling `at::mps::sign_outf` rather than `at::sign_out` that calls dispatcher again.
Also, do not copy output unnecessarily.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at f942e74</samp>

> _Metal tensors rise from the ashes_
> _`sign` and `sgn` unleash their flashes_
> _MPSFunctions reign supreme_
> _In the header of the metal dream_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110955
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-10-11 15:10:21 +00:00
fc1105b282 [inductor] Implement Fx graph caching to improve warm compilation time. (#103453)
Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR.

Test Plan:
* New unit tests exercising saving and load from the cache.
* New unit tests to exercise the cache key calculations.
* Ran several benchmarks to see cache hit and resulting compilation times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453
Approved by: https://github.com/eellison, https://github.com/Chillee
2023-10-11 14:39:14 +00:00
2cf9782912 [generate_opcheck_tests] Add some reasonable defaults (#110977)
Summary:
Make it easier to add `generate_opcheck_tests` by adding defaults for
the failures_dict location, the additional decorators, and the test
utils.

Test Plan:
Existing tests

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110977
Approved by: https://github.com/williamwen42
ghstack dependencies: #110951
2023-10-11 14:28:05 +00:00
4abfa22812 [aotinductor] Add a perf smoke test for AOTInductor (#110972)
Summary: To prevent perf regression like the one caused by #110510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110972
Approved by: https://github.com/chenyang78
2023-10-11 13:30:05 +00:00
98c329b19e Revert "[core ATen IR] Add decompositions for max, min, var_mean (#110906)"
This reverts commit 9606cda64e97210cfcca07110ef4872cedc5a1d9.

Reverted https://github.com/pytorch/pytorch/pull/110906 on behalf of https://github.com/SS-JIA due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110906#issuecomment-1757490740))
2023-10-11 11:41:21 +00:00
95ff51d8ed [MPS] Add support for Softshrink to MPS Backend (#110814)
Adds the softshrink activation function to the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110814
Approved by: https://github.com/kulinseth
2023-10-11 07:55:39 +00:00
de370eb313 [Distributed] Small nits to apply_optimizer_in_backward (#110903)
Clarify a few things around the documentation

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110903
Approved by: https://github.com/janeyx99
2023-10-11 07:45:45 +00:00
0821868110 Revert "[export] Get export APIs ready for PTC (#110410)"
This reverts commit b96ea9f361f2ed872c4a7d662427cadec345b702.

Reverted https://github.com/pytorch/pytorch/pull/110410 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/110410#issuecomment-1757017249))
2023-10-11 07:31:51 +00:00
74029fae9d Fix broken period workflow after #110976 (#111013)
My typo mistake after #110976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111013
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-10-11 06:40:18 +00:00
056d6247c7 [MPS] Use Metal Events to synchronize buffers in MPSAllocator (Part 1) (#106938)
- This PR is the first part of a bigger change to use `MPSEvent` to synchronize shared-buffers between CPU/GPU.
- Add APIs to record and wait for `MPSEvents` in `MPSAllocator`.
- Use a container list for Buffer Pools to simplify iterating over them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106938
Approved by: https://github.com/kulinseth
2023-10-11 06:13:05 +00:00
b96ea9f361 [export] Get export APIs ready for PTC (#110410)
Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130
Changes:
* `torch.export` will return a functional ATen graph w/o decompositions
* `exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.

Calling convention for Executorch stays the same:
```
pre_autograd_graph = capture_pre_autograd_graph(f, args, ...)
aten_graph_no_decomps = torch.export.export(pre_autograd_graph, args, ...)
# Within to_edge we decompose to core aten and then convert to edge
edge_graph = exir.to_edge(aten_graph_no_decomps)
```

Test Plan: CI

Differential Revision: D49742989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110410
Approved by: https://github.com/ydwu4
2023-10-11 06:10:07 +00:00
1e7947b3e0 Revert "Reland 3rd try [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#109323)" + Forward fixes + test (#110964)
This reverts commit f786fbdebdd24d3a6807e3b9fbf055836db4ad60.

Forward fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110964
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2023-10-11 05:16:47 +00:00
e49ea87162 Fix socket.cpp compilation using gcc-9.4 (#111002)
Otherwise following error is thrown when attempted to compile with WERROR enabled:
```
In file included from /home/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30:
/home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:24: warning: redundant redeclaration of ‘constexpr’ static data member ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’ [-Wdeprecated]
  340 | constexpr const size_t codecvt_result<CodeUnit>::max_size;
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:335:33: note: previous declaration of ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’
  335 |   static constexpr const size_t max_size = 32;
      |                                 ^~~~~~~~
```
or following if using clang as host compiler
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30:
/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:50: warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated [-Wdeprecated]
constexpr const size_t codecvt_result<CodeUnit>::max_size;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111002
Approved by: https://github.com/drisspg
2023-10-11 05:16:00 +00:00
a614281ea9 Add current_device() to torch.cpu (#110987)
Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987
Approved by: https://github.com/wanchaol
2023-10-11 05:13:10 +00:00
110382bacf Make NestedTensor compilable with eager backend (#109171)
In this PR:
- Adds support for strides for jagged tensor (design doc for this coming soon)
- NestedTensor skips automatic dynamic
- Make use of @bdhirsh's subclass fakification logic by adding the __tensor_{un,}flatten__ functions.
- Additional logic for fakification: since existing subclass fakification logic does not handle the case where the outer tensor has an additional dimension. We insert one-off logic to (1) insert an extra SingletonSymInt onto the fakified NestedTensor. (2) make sure we call track_symint on both the sizes on the inner and outer tensor during guard creation.

Remaining things that are weird:
- Still need to skip some logic in meta utils for some reason (I was going to write this up more, but decided not to since we're not able to do this anyway for a immediate reason: we cannot arbitrarily compare singleton ints. For now I'm just following Brian's advise from [here](https://github.com/pytorch/pytorch/pull/109171#discussion_r1328137070) )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109171
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-10-11 04:47:10 +00:00
e0dbaa04d2 Fix the meta func for mem_eff_backward (#110893)
Fixes #110832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110893
Approved by: https://github.com/eellison
2023-10-11 02:58:54 +00:00
0e551bbcd7 [quant][pt2] Preserve source_fn_stack after QAT fusion (#110899)
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_preserve_source_fn_stack

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar

Differential Revision: [D50101253](https://our.internmc.facebook.com/intern/diff/D50101253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110899
Approved by: https://github.com/jerryzh168
2023-10-11 02:55:52 +00:00
5aee22e0e0 Move export.constrain_as_* to torch._constrain_as_* (#110757)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110757
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #109859
2023-10-11 02:37:55 +00:00
c9eb8d8d90 Add set_checkpoint_debug_enabled that overrides local setting (#110728)
People access activation checkpoint through many layers of config and it is not always guaranteed that all the layers of wrapping around checkpoint properly propagate all the kwargs, e.g. debug mode. This context manager offers an alternative way to enable debug mode that bypasses the need for all layers to propagate kwargs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110728
Approved by: https://github.com/albanD
ghstack dependencies: #110673, #110674, #110675, #110676
2023-10-11 02:12:31 +00:00
02f6a8126e Support a simple subset of functions as backward hooks on intermediate tensors (#109537)
The main thrust of the initial effort here was to capture `register_hook` calls on tensors in compile regions. The first part of this was done in https://github.com/pytorch/pytorch/pull/108903 wherein we added support for register_hook input tensors.

The distinction between input and intermediary is due to implementation differences.

There are 2 kinds of hooks:

1) Hooks on objects with sources (inputs, params)
2) Hooks on objects w/o sources (intermediaries, and outputs).

Note: As outputs can be made simple by how dynamo handles residuals, they could actually be handled as if they were inputs, but, for the sake of this PR, we will refer to hooks as either hooks on inputs (sourced), or hooks on intermediaries (not sourced).

**The plan:**

For tensors w/ a source: (The PR above)
We record registered hooks, store them as a global, and associate them with the tensor in residuals. This means that when dynamo goes to create the frame, where we produce bytecode to stitch together our PT2 modified bytecode with the original eager code, we call register_hook. This registration of hooks in residuals is sound because (a) it happens right after a Pt2 frame region ends and (b) we know that the tensor is alive in f_locals, f_globals, or a module in the users invoking frame. This means we can soundly know it will be around to invoke register_hook on. As long as we guard on the identity of the lifted function, this is sound to do.

For tensors w/o a source: (This PR)

Ostensibly, the most correct and complete solution would be to smuggle hooks into a runtime wrapper in aot_autograd, where all the items the hooks close over are lifted to inputs as necessary and passed alongside the user provided function. This is necessary so that we can properly trace out and capture all the mutations within the user defined hook at backwards time.

This is too complicated - so, we limited the scope of this initial PR to a simple subset of hooks:

- Hooks must have a source (be known to us already, not a lambda or intermediary defined function)
- We must be tracing under compiled autograd

**The flow**:

We use the HOP added in https://github.com/pytorch/pytorch/pull/109690/files, referred to as the HOP below.

1) We intercept register_hook calls and wrap the user defined fn in the HOP
2) We write a `_register_hook_trampoline` to the graph that is a local no-arg function that is invoked as a call_function in the dynamo graph
3) aot_autograd inlines through it during its trace, and sees the HOP
4) the HOP preserves itself in the graph - it does not get traced into
5) During backwards, compiled_autograd installs the HOP under a hook call
6) When compiled_autograd enters compilation over its generated graph, dynamo traces the contents of the hook

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109537
Approved by: https://github.com/ezyang
2023-10-11 01:35:37 +00:00
79212430df feat(inductor): fx graph debug should display device (#110346)
Device mismatch issues are root cause of: https://github.com/pytorch/pytorch/issues/107006, hence make device-related scheduling issues easier to diagnose.
Also format single-kwarg graphs to be more concise

Example rendering:
![image](https://github.com/pytorch/pytorch/assets/9093549/1b59a994-f2df-45c9-8cb7-37eb3ba12654)

CC code owners: @ngimel @jansel @shunting314 @mlazos @peterbell10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110346
Approved by: https://github.com/eellison
2023-10-11 00:34:55 +00:00
24bf9aeb6b Fix arange with dynamic end argument. (#110979)
Fixes https://github.com/pytorch/pytorch/issues/93468

There's a few extra tests that are sort of unrelated, but I ended up writing them while working on the fix and decided to keep them. The big idea here is to split the `_check` so that `expect_true` works; I could have probably also improved the symbolic reasoning but I'm lazy. One small logging fix too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110979
Approved by: https://github.com/Skylion007
2023-10-11 00:32:34 +00:00
a11d4a8378 [Reland] [Inductor] Break the loop fusion when node2 depends on node1 mutations (#110677)
Reland PR https://github.com/pytorch/pytorch/pull/109172 which has been reverted in https://github.com/pytorch/pytorch/pull/110622

Differential Revision: [D50097373](https://our.internmc.facebook.com/intern/diff/D50097373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110677
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-10-11 00:26:45 +00:00
314a502eb0 Revert "Reland "[C10] PG observability hooks. (#108815)" (#110907)"
This reverts commit 7678cd22af46c9df4fb47a409d3e8ad71a6127ea.

Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this 7678cd22af ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))
2023-10-11 00:23:42 +00:00
2edc75a669 Add a workflow to release Android binaries (#110976)
This adds 2 jobs to build PyTorch Android with and without lite interpreter:

* Keep the list of currently supported ABI armeabi-v7a, arm64-v8a, x86, x86_64
* Pass all the test on emulator
* Run an the test app on emulator and my Android phone `arm64-v8a` without any issue
![Screenshot_20231010-114453](https://github.com/pytorch/pytorch/assets/475357/57e12188-1675-44d2-a259-9f9577578590)
* Run on AWS https://us-west-2.console.aws.amazon.com/devicefarm/home#/mobile/projects/b531574a-fb82-40ae-b687-8f0b81341ae0/runs/5fce6818-628a-4099-9aab-23e91a212076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110976
Approved by: https://github.com/atalman
2023-10-11 00:19:33 +00:00
5aa96fd336 [dynamo] list index: add more list types to testing, support namedtuple, improve error handling (#110919)
Follow up: #110817

Minor improvements as discussed in prev PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110919
Approved by: https://github.com/ezyang
2023-10-11 00:16:39 +00:00
9606cda64e [core ATen IR] Add decompositions for max, min, var_mean (#110906)
## Context

Add decompositions for `aten.max`, `aten.min`, and `aten.var_mean`. These operators follow a pattern of returning a tuple of outputs from two component operators:

```
aten.max(x) -> return aten.amax(x), aten.argmax(x)
aten.min(x) -> return aten.amin(x), aten.argmin(x)
aten.var_mean(x) -> return aten.var(x), aten.mean(x)
```

For `var_mean`, the `refs` implementation was doing something similar, so I changed it to call `torch.` ops instead like was done for other `refs` implementations previously. cc: @peterbell10 @lezcano

Note that Inductor lowers all these directly, so they are excluded from the Inductor decomp table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110906
Approved by: https://github.com/manuelcandales
2023-10-11 00:06:24 +00:00
3100d3e661 Revert "[inductor] Implement Fx graph caching to improve warm compilation time. (#103453)"
This reverts commit 8a8668e1aeac8d1726ac746372f5a93262994f62.

Reverted https://github.com/pytorch/pytorch/pull/103453 on behalf of https://github.com/kit1980 due to The newly added test fails on internal builds ([comment](https://github.com/pytorch/pytorch/pull/103453#issuecomment-1756449919))
2023-10-10 23:21:59 +00:00
cyy
f98d6ad8b3 [1/N] Apply clang-tidy to aten/src/ATen/core/ (#110861)
It is time to cliang-tidy aten.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110861
Approved by: https://github.com/Skylion007
2023-10-10 23:20:58 +00:00
ca03f36233 Change ProcessGroupNCCL default timeout to 10 min (#110947)
Avoid changing default for other backends as CPU backend (GLOO) may need
longer timeouts.

Motivated by trying to save cluster time when encountering collective
hangs.  Generally collectives should time out within seconds and 30
minutes (or 10 minutes) should provide ample headroom for edge cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110947
Approved by: https://github.com/xw285cornell, https://github.com/fduwjj
2023-10-10 22:28:39 +00:00
cd275dc24f Remove RangeConstraints in favor of ValueRanges (#109859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109859
Approved by: https://github.com/avikchaudhuri
2023-10-10 22:22:05 +00:00
7a69e3d30b [fx][subgraph_matcher] Add a matcher that supports name to node map (#110743)
Summary:
We want the matcher to return a name -> node in target graph
so that we can refer to the node by name, this is useful for downstream applications like
quantization.

and also we can use the torch API as source of truth instead of matching aten API directly.

Test Plan:
python test/fx/test_matcher_utils.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110743
Approved by: https://github.com/SherlockNoMad
2023-10-10 22:21:24 +00:00
91eeb77260 StackDataset batched sampling (#110694)
Optimization of loading minibatches

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110694
Approved by: https://github.com/ejguan
2023-10-10 22:05:51 +00:00
ac01304e22 pin_memory support for NT (#110404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110404
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
ghstack dependencies: #110292
2023-10-10 21:58:19 +00:00
43ea782af3 Multiprocessing support for NT (#110292)
Fixes #110161

Allows NTs to be used in DataLoaders with `num_workers > 1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110292
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-10-10 21:58:19 +00:00
7f2d25c547 [ONNX] bump onnx submodule to rel-1.15.0 (#110663)
- onnx==1.15.0rc1
- onnxscript==0.1.0.dev20231006
- ort-nightly==1.17.0.dev20231005001
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110663
Approved by: https://github.com/ezyang, https://github.com/thiagocrepaldi
2023-10-10 21:44:09 +00:00
3a29cdc5e6 [optests] Add dontGenerateOpCheckTests and is_inside_opcheck_mode (#110951)
This PR adds the following helper functions for generated opcheck tests:
- dontGenerateOpCheckTests is a decorator that skips generation of the
  opcheck tests for the generated function
- is_inside_opcheck_mode lets us query if we are in a generated test.
  Useful for fast debugging out-of-tree without needing to update
  PyTorch.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110951
Approved by: https://github.com/williamwen42
2023-10-10 21:43:43 +00:00
d9eb5a57aa [FSDP] Change _create_chunk_dtensor in fsdp/_shard_utils.py to use public API from DTensor (#110831)
This PR:
1) updates _create_chunk_dtensor() in _shard_utils.py to use public APIs from DTensor. This will avoid the global_size calculation error from using DTensor.from_local() for uneven-sharded parameters, as described in https://github.com/pytorch/pytorch/issues/110762
2) updates test/distributed/fsdp/test_fsdp_dtensor_state_dict.py to include unit test for a model with uneven sharding.

cc. @wanchaol, @fegin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110831
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-10-10 21:04:27 +00:00
6e770c0dda [dynamo] Add itertools.repeat via polyfill (#110953)
Fixes https://github.com/pytorch/pytorch/issues/110286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110953
Approved by: https://github.com/ezyang
2023-10-10 20:40:33 +00:00
02a02a23ee Revert "Move at::{Refcounted,}MapAllocator to c10 (#109881)"
This reverts commit 0341deb1c720d8c908ed40e853eaacfc8ac37181.

Reverted https://github.com/pytorch/pytorch/pull/109881 on behalf of https://github.com/albanD due to It does break buck build ([comment](https://github.com/pytorch/pytorch/pull/109881#issuecomment-1756195823))
2023-10-10 20:39:12 +00:00
495f77be7a [cpu] explicitly vectorize digamma (#110217)
### Benchmarking results
```python
[-------------- torch.digamma(x) Benchmark -------------]
                                        |  implicitly vectorized |     explicitly  vectorized
1 threads: -----------------------------------------------------------------------
      dtype torch.float16 - n : 100     |        3.8      |      3.5
      dtype torch.float16 - n : 200     |        5.8      |      5.3
      dtype torch.float16 - n : 500     |       11.8      |     10.7
      dtype torch.float16 - n : 1000    |       22.0      |     19.6
      dtype torch.float16 - n : 10000   |      203.6      |    179.7
      dtype torch.float32 - n : 100     |        3.8      |      3.6
      dtype torch.float32 - n : 200     |        5.7      |      5.5
      dtype torch.float32 - n : 500     |       11.1      |     11.1
      dtype torch.float32 - n : 1000    |       20.6      |     20.5
      dtype torch.float32 - n : 10000   |      191.7      |    189.6
      dtype torch.float64 - n : 100     |        3.8      |      3.7
      dtype torch.float64 - n : 200     |        5.9      |      5.7
      dtype torch.float64 - n : 500     |       11.9      |     11.7
      dtype torch.float64 - n : 1000    |       22.1      |     21.7
      dtype torch.float64 - n : 10000   |      203.6      |    199.7
      dtype torch.bfloat16 - n : 100    |        3.7      |      3.5
      dtype torch.bfloat16 - n : 200    |        5.6      |      5.3
      dtype torch.bfloat16 - n : 500    |       11.2      |     10.6
      dtype torch.bfloat16 - n : 1000   |       20.8      |     19.5
      dtype torch.bfloat16 - n : 10000  |      190.0      |    179.7

Times are in microseconds (us).`
```

### Benchmarking config
Machine: Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz
<p>

```python
>>> import torch
>>> print(f"Torch config: {torch.__config__.show()}")
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/usr/local/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.2.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF,
```

</p>

Script -
```
import torch
import pickle
from torch.utils import benchmark
from itertools import product

device = 'cpu'
dtypes = (torch.float16, torch.float32, torch.float64, torch.bfloat16)
n = (100, 200, 500, 1000, 10000)

result = []

for dtype, num in product(dtypes, n):
    x = torch.rand(num, dtype=dtype, device='cpu')
    torch.digamma(x)
    stmt = 'torch.digamma(x)'
    measurement = benchmark.Timer(
        stmt=stmt,
        globals={'x': x},
        label=stmt + " Benchmark",
        sub_label=f"dtype {dtype} - n : {num}",
        description="vectorized",
    ).blocked_autorange(min_run_time=10)

    result.append(measurement)

fname_prefix = "benchmark_digamma_"

benchmark.Compare(result).print()
with open(fname_prefix+"vectorized", "wb") as f:
    pickle.dump(result, f)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110217
Approved by: https://github.com/sanchitintel, https://github.com/vfdev-5, https://github.com/ezyang
2023-10-10 20:31:25 +00:00
7678cd22af Reland "[C10] PG observability hooks. (#108815)" (#110907)
This reverts commit ff0358b0384d6a3a5b8ceeae625c93221612ba8e.

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
2023-10-10 20:09:40 +00:00
84ad3ed7b2 [dynamo] add config for displaying all guard failures (#110927)
Fixes https://github.com/pytorch/pytorch/issues/110879

Example output:
```
('Recompiling function fn in /home/jonch/Desktop/Programming/mlsys/pytorch/test/dynamo/test_misc.py:4578', 'triggered by the following guard failures: ["___check_type_id(L[\'obj\'], 94834370481168)", "L[\'obj\'].x == -0.5"]')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110927
Approved by: https://github.com/lezcano
2023-10-10 19:57:44 +00:00
8cf1a02e80 Rever [Profiler] Improve the docstring for export_memory_timeline (#110978)
Rever [Profiler] Improve the docstring for export_memory_timeline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110978
Approved by: https://github.com/huydhn, https://github.com/aaronenyeshi
2023-10-10 19:57:25 +00:00
bc49b1e50b [reland] Use is_symbolic instead of testing isinstance in some place (#110676)
reland of https://github.com/pytorch/pytorch/pull/110372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110676
Approved by: https://github.com/ezyang
ghstack dependencies: #110673, #110674, #110675
2023-10-10 19:37:17 +00:00
df9a6bcaef [reland] Symintify guards.cpp (#110675)
reland of https://github.com/pytorch/pytorch/pull/110371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110675
Approved by: https://github.com/ezyang
ghstack dependencies: #110673, #110674
2023-10-10 19:37:17 +00:00
3842b175d2 [reland] Add symbolic singleton int (#110674)
reland of https://github.com/pytorch/pytorch/pull/110370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110674
Approved by: https://github.com/ezyang
ghstack dependencies: #110673
2023-10-10 19:37:17 +00:00
fda0a965c7 [reland] Support SingletonSymNode mul with coefficient (#110673)
reland of https://github.com/pytorch/pytorch/pull/110369
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110673
Approved by: https://github.com/ezyang
2023-10-10 19:37:17 +00:00
fb4b9e9c8e Re-enable a couple of fixed tests (#110770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110770
Approved by: https://github.com/yanboliang, https://github.com/int3, https://github.com/Skylion007
ghstack dependencies: #110651
2023-10-10 19:13:14 +00:00
5183760ca5 Adding Backward Support for NestedTensors and FlashAttention (#97485)
# Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 318764f</samp>

This pull request implements the CUDA backend of the SDPA kernel for nested tensors, which enables efficient transformer models with variable-length sequences. It adds a new dispatch key, a backward function, a unit test, and some helper functions for the kernel. It modifies `test/test_transformers.py`, `aten/src/ATen/native/native_functions.yaml`, `aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctionsBackward.cpp`, and `aten/src/ATen/native/nested/cuda/NestedTensorTransformerUtils.h`.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ed4a773</samp>

> _Fused kernels of doom, unleash the flash attention_
> _Nested tensors on fire, reshape and pad with caution_
> _Backward pass of power, dispatch the CUDA key_
> _Test the gradients of hell, warn the user if they disagree_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97485
Approved by: https://github.com/jbschlosser
2023-10-10 18:08:17 +00:00
77e5f5d8a4 Updates to patch version release plans (#110952)
1. Updates to patch release process
2. Add Release cadence section
3. Changed description for Modify release matrix to reflect current process
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110952
Approved by: https://github.com/malfet
2023-10-10 17:59:29 +00:00
52b1470935 [Profiler] Improve the docstring for export_memory_timeline (#110949)
Summary: Add more details about the export_memory_timeline API, as we've landed new representations of the memory timeline data.

Test Plan: CI, should be no functional change, as we only changed comments.

Differential Revision: D50123450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110949
Approved by: https://github.com/davidberard98
2023-10-10 17:53:56 +00:00
31611b40b9 cmake: allow to build pytorch as a CMake subproject (#110373)
This is a re-attempt of fixing https://github.com/pytorch/pytorch/issues/53980, first submitted in https://github.com/pytorch/pytorch/pull/54978.

Quoting @SpaceIm
```
Fixes https://github.com/pytorch/pytorch/issues/53980

Maybe it would be nice to find why some files are generated in CMAKE_BINARY_DIR instead of CMAKE_CURRENT_BINARY_DIR or Torch_BINARY_DIR or PROJECT_BINARY_DIR, but there is a lot of indirection in the logic of pytorch build files, so I was not able to find where it comes from.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110373
Approved by: https://github.com/malfet
2023-10-10 17:47:35 +00:00
57f6368b8e [collective] Add a torch.compile + functional_collectives test (#110688)
Add a test to ensure functional_collectives + torch.compile always works.

Differential Revision: [D50001491](https://our.internmc.facebook.com/intern/diff/D50001491/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110688
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-10-10 17:14:50 +00:00
c5f06b9753 Re-enable test_copy_transpose_math_view, neg_view/dce fix (#110651)
- neg view can just be lowered to neg() post functionalization
- we were treating all fallback kernels as not having side effects. we shouldn't dce mutating fallback kernels - either mutations induced by the reinplacing pass or clone_ with unsupported arguments (complex)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110651
Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/malfet, https://github.com/Skylion007
2023-10-10 16:34:01 +00:00
ba86dfcd83 AOTDispatch subclass (#104483)
This is a PoC of AOTDispatch support. This PR actually works on basic examples, and I'm working on testing it out on `DTensor` (with @wanchaol), `SemiStructuredSparsityTensor` (with @jcaip), and `FP8Tensor`.

There are some design decisions baked into the PR that I think we need consensus on though - so I'm planning on writing a larger design doc to go over the changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104483
Approved by: https://github.com/ezyang
2023-10-10 16:13:16 +00:00
8bc04f46fe [inductor cpp] use c10::bit_cast to avoid violating strict-aliasing (#110809)
Fix https://github.com/pytorch/pytorch/issues/110807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110809
Approved by: https://github.com/jansel
2023-10-10 11:16:31 +00:00
7b25c2b90e [FSDP][optim_state_dict] Move local optimizer state to FSDP compute_device (#110929)
This will ensure all the tensors are on FSDP compute_device.

Differential Revision: [D50059492](https://our.internmc.facebook.com/intern/diff/D50059492/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110929
Approved by: https://github.com/wz337
2023-10-10 10:34:31 +00:00
fb68aa0a92 [Easy] Remove unused return type from utils (#110887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110887
Approved by: https://github.com/ezyang
2023-10-10 09:02:11 +00:00
a425307589 [ATen core IR] De-register full_like and empty_like as core (#110924)
## Context

Following up from @peterbell10 comments on https://github.com/pytorch/pytorch/pull/110882.

* `empty_like` was erroneously classified as `core`. It can be decomposed using `empty_permuted` and in fact is currently decomposed this way in the core decomposition table.
* `full_like` can be similarly decomposed to `full_permuted` once https://github.com/pytorch/pytorch/pull/110234 lands. The current decomposition into `empty_like` and `fill` doesn't work because `fill` decomposes to `full_like`, resulting in a recursive loop.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110924
Approved by: https://github.com/kirklandsign
2023-10-10 09:02:05 +00:00
37567fdf31 Nvfuser cpp api deprecation attempt 2 (#110881)
attempting to re-try #110318 deprecating nvfuser c++ API

warning has been updated to TORCH_WARN_ONCE;
Warning thrown inside torch::jit::fuser::cuda::isEnabled() is turned off and will be deprecated when we pulled out TorchScript integration in the follow up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110881
Approved by: https://github.com/davidberard98, https://github.com/NicolasHug
2023-10-10 08:07:03 +00:00
8820dda943 Revise def of contiguity in bmm (#110811)
Fixes #108754.

`hf_T5_generate` would encounter a regression when calling `extern_kernels.bmm`, if one input is `reinterpret_tensor(buf2, (8, 1, 64), (64, 0, 1))` rather than `reinterpret_tensor(buf2, (8, 1, 64), (64, 512, 1), 0)`. As @jgong5 mentioned in comment, in fact the two tensors are equivalent: The stride doesn't matter when the corresponding size is 1.

We revise the definition of contiguity in `bmm` to add the above situation as a contiguous case. Thus, when stride equals to 0, `extern_kernels.bmm` could still use `gemm` of MKL to gain the performance.

Speedup of `hf_T5_generate` is **1.343x** now and **1.138x** before, with script `bash inductor_single_test.sh multiple inference performance torchbench hf_T5_generate float32 first dynamic default 0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110811
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/Chillee
2023-10-10 06:48:08 +00:00
35e48e262c [custom op] Use canonical API to constrain unbacked values (#108372)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108372
Approved by: https://github.com/angelayi, https://github.com/ezyang
2023-10-10 05:14:28 +00:00
33403336fa Revert "[user errors] compulsory case names, allow multiple (#110878)"
This reverts commit 2ae71c45982109065e19a2c05473fbe7237215ab.

Reverted https://github.com/pytorch/pytorch/pull/110878 on behalf of https://github.com/kit1980 due to export/test_export.py::TestExport::test_multiple_definitions_same_name_dim - TypeError: UserError.init() missing 1 required positional argument: 'case_names' ([comment](https://github.com/pytorch/pytorch/pull/110878#issuecomment-1754360051))
2023-10-10 04:44:40 +00:00
8891da40d7 [vision hash update] update the pinned vision hash (#110915)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110915
Approved by: https://github.com/pytorchbot
2023-10-10 04:31:10 +00:00
19ecb5d0d5 Revert "[Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567)"
This reverts commit 37a02659921490d85b2b0712ad52b924e0c431cd.

Reverted https://github.com/pytorch/pytorch/pull/110567 on behalf of https://github.com/kit1980 due to breaking internal builds, see D50091340 ([comment](https://github.com/pytorch/pytorch/pull/110567#issuecomment-1754308982))
2023-10-10 03:49:20 +00:00
2ae71c4598 [user errors] compulsory case names, allow multiple (#110878)
We want to get to a point where most UserErrors link to exportdb examples. This PR makes passing case names non-optional to make this intent clearer and encourage developers who raise UserErrors to make or point to examples that make fixing such errors more obvious for users.

In addition, sometimes there are multiple examples that are relevant to an error. Thus this PR also enables passing multiple case names.

Retry of #110733 which was reverted due to a landrace.

Differential Revision: [D50087148](https://our.internmc.facebook.com/intern/diff/D50087148/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110878
Approved by: https://github.com/gmagogsfm, https://github.com/tugsbayasgalan
2023-10-10 03:48:07 +00:00
f10aab03c4 [sparse] Fix semi-structured sparse shape mismatch bug (#110420)
Summary:

Currently, PyTorch incorrectly calculates the size of the returned
matrix when we pass a non-contiguous batched (>2d) input to the
semi-structured sparse subclass.

This is most common in MLP layers, where we have 2 linear layers back to back.

This will lead to an error like the following:
```
RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size
62914560

```
Where the size of the sparse matmul result is off because we infer the
output shape with the wrong tensor shape.

This happens because of a bug where we did not update the subclass
tensor shape when doing transpose.
For semi-structured sparsity, transposing is a no-op where we just set
the boolean flag, but we forgot to also update the tensor shape.

Note that this error goes away in inference mode, since we avoid
decomposing the aten.linear op and handle shape folding ourselves,
which changes the execution path.

An alternative way to fix this issue is to set
TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error.

Test Plan:
```
python test/test_sparse_semi_structured.py -k test_mlp

```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110420
Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch
2023-10-10 03:07:31 +00:00
468a73f0e3 Support Numpy ints in the torch.nn.functional.interpolate dtype check (#110778)
In https://github.com/pytorch/pytorch/pull/99243, a check was added to ensure the `size` only contained integers.

This PR updates the check to also include numpy integers based on this comment (cc @kit1980): https://github.com/pytorch/pytorch/pull/99243#issuecomment-1646736646. Similar to the other commenter, I also ran into issues where existing software broke due to this after upgrading to PT2.1:

```
                if not torch.jit.is_scripting():
                    if not all(_is_integer(x) for x in size):
>                       raise TypeError(
                            "expected size to be one of int or Tuple[int] or Tuple[int, int] or "
                            f"Tuple[int, int, int], but got size with types {[type(x) for x in size]}"
                        )
E                       TypeError: expected size to be one of int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int], but got size with types [<class 'numpy.int64'>, <class 'numpy.int64'>]

/conda-env/lib/python3.8/site-packages/torch/nn/functional.py:3924: TypeError
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110778
Approved by: https://github.com/mikaylagawarecki
2023-10-10 01:46:33 +00:00
de3ae93e9b Include rank of default PG in C++ log messages (#110623)
I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test.

The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work.

This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623
Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj
2023-10-10 00:26:52 +00:00
0341deb1c7 Move at::{Refcounted,}MapAllocator to c10 (#109881)
`libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`,
 so it makes sense to move it to c10 along with the other memory allocators.

This means `libshm.so` only depends on `c10` and we don't need to relink
`libshm.so` for every ATen change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881
Approved by: https://github.com/albanD
2023-10-09 23:53:47 +00:00
3704bf4ee8 [export] Update custom ops docs (#110492)
Updating the doc links in the custom ops documentation in export
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110492
Approved by: https://github.com/avikchaudhuri
2023-10-09 23:40:40 +00:00
28d7d7fc42 device agnostic: torch.cpu.set_device (#110716)
to support device agnostic, add a dummpy placeholder in torch.cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110716
Approved by: https://github.com/albanD
2023-10-09 23:00:15 +00:00
2aa0ba38a4 Make is_sparse a property of MaskedTensor (#110725)
Fixes #104574

Seeing that MaskedTensor is a prototype, the BC breaking nature of this change seems okay?

Locally tested:
<img width="1372" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/239e61ba-e0b9-4909-8c7a-0ce3869d7375">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110725
Approved by: https://github.com/cpuhrsch
2023-10-09 22:35:38 +00:00
6c8096ec31 [ATen core IR] Register additional ATen operators as core (#110882)
## Context

For more context, please refer to [this PyTorch forums post](https://dev-discuss.pytorch.org/t/defining-the-core-aten-opset/1464).

This PR registers some additional ATen operators as `core`, based on feedback from the forums post as well as the experiences from adding other core ATen decompositions.

The ATen operators registered as core in this diff, with the associated reasoning, are:

ATen op | reasoning
--|--
aten::atan2 | This operator often maps to a hardware intrinsic.
aten::diagonal | There is no straightforward decomposition for this operator.
aten::empty_like | Decomposition for this operator would require `as_strided` to retain the strides of the input tensor, which should be avoided.
aten::expm1 | This operator often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.
aten::full_like | Decomposition for this operator would require `as_strided` to retain the strides of the input tensor, which should be avoided.
aten::log10 | This operator often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.
aten::log1p | This operator often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.
aten::log2 | This operator often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.
aten::pow.Scalar_Tensor | This is a Scalar variant of pow.Tensor_Tensor, which is a part of core.
aten::resize | There is no valid decomposition for this operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110882
Approved by: https://github.com/lezcano
2023-10-09 22:27:00 +00:00
733368a822 Change default NCCL_ASYNC_ERROR_HANDLING to 3:SkipCleanUp (#110723)
Summary

Currently, when detecting a timeout/exception in the watchdog
workCleanupLoop, we call nccl APIs to abort all the active communicators
before finally re-raising the exception and killing the process.  The
nccl APIs may hang, causing additional problems. Instead, just re-raise.

@kumpera proposed that changing this default should save us from a lot of commonly observed errors.

Note: there are other cuda/nccl api calls in our watchdog, which also could hang. This change is not a substitute for a deeper refactor.

Detail

The current default (NCCL_ASYNC_ERROR_HANDLING=1:TearDown) meant the following:

SHOULD_TEAR_DOWN() evaluates to true
  - This affects 'ProcessGroupNCCL::WorkNCCL::handleException`
  - handleException is called from two places:
     - work.wait() -> synchronizeInternal() -> handleException()
     - workCleanupLoop() -> handleException()
  - when true, the excpetion is logged and rethrown

SHOULD_CLEAN_UP() evaluates to true
  - This only impacts the workCleanupLoop()
  - When true, it means all communicators will be aborted (ncclCommAbort())
    upon work exception or timeout

The proposed new default is NCCL_ASYNC_ERROR_HANDLING3=3:SkipCleanUp.

This only changes SHOULD_CLEAN_UP() to false, impacting workCleanupLoop() behavior.
Communicators will no longer be aborted, which should avoid a class of bugs where the watchdog hangs due to calling nccl APIs which may block/hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110723
Approved by: https://github.com/fduwjj, https://github.com/xw285cornell
2023-10-09 21:38:32 +00:00
0a580da582 Add batch decomposition for torch.linalg.eigh (#110640)
Closes https://github.com/pytorch/pytorch/issues/108481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110640
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-10-09 21:36:49 +00:00
201d02ef77 stop non-differentiable values from being materialized in aotautograd (#110721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110721
Approved by: https://github.com/bdhirsh
ghstack dependencies: #110720
2023-10-09 20:18:19 +00:00
c596db762f refactor aotautograd to set requires_grad on info rather than a separate array (#110720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110720
Approved by: https://github.com/bdhirsh
2023-10-09 20:18:19 +00:00
db760527e0 fix(dynamo): list index via polyfill (#110817)
Fixes https://github.com/pytorch/pytorch/issues/109031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110817
Approved by: https://github.com/ezyang
2023-10-09 19:48:39 +00:00
2a76c7f018 [dtensor] skip move to device when device_type match (#110774)
skip tensor.to in from_local and distribute_tensor when device_type of
device mesh matches tensor.device type, since from_local on the critial
path of TP, this might also reduce some overhead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110774
Approved by: https://github.com/fduwjj
2023-10-09 19:39:11 +00:00
50bd252863 Fix typo the the (#110869)
This PR fixes typo `the the` of comments and exception message in files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110869
Approved by: https://github.com/soulitzer
2023-10-09 19:32:45 +00:00
b5f9696d81 Fix typo under torch directory (#110824)
This PR fixes typo `the the` of comments and exception messages in files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824
Approved by: https://github.com/H-Huang
2023-10-09 19:16:43 +00:00
d1c157c598 Revert "[reland] Update custom Function preserve torch function when inputs r… (#110679)"
This reverts commit 563728f61c39379070661af3a431aa49eaf5c8ac.

Reverted https://github.com/pytorch/pytorch/pull/110679 on behalf of https://github.com/kit1980 due to The diff has Meta-internal changes, please land from Phabricator ([comment](https://github.com/pytorch/pytorch/pull/110679#issuecomment-1753523182))
2023-10-09 19:09:01 +00:00
8ae623db9d Don't pass tuple to with statement (#110864)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110864
Approved by: https://github.com/Skylion007, https://github.com/awgu
2023-10-09 19:00:34 +00:00
4b881b0da3 [MPS] add support for sgn to MPS backend (#110829)
Fixes #86805

Adds support for sgn to MPS backend.

Notes:

1. @malfet self-assigned this when he was working on implementing polar, but from what I can tell, he didn't end up needing to implement it.

2. @Berzeg implemented this last year, before view_as_complex was supported. Because of @malfet recent contributions, however, @Berzeg 's implementation works. I've removed the part of his implementation that dealt with non-complex dtypes (since these can just be passed to at::sign), matched the more recent pattern we've been using in UnaryOps.mm, and thrown in a simple implementation of _efficientzerotensor for mps, so that the backward function works.
3. @Berzeg deserves a good bit of credit for this, so let me know if there's a way to assign him some without jamming up the pr (he seems to be AWOL since last working on this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110829
Approved by: https://github.com/malfet
2023-10-09 16:53:25 +00:00
144cda7f06 [BE]: Enable ruff's flake8-PYI rules (#110830)
Enable Flake8-PYI rules codebase wide. Most of the rules already match our codebase style, the remaining ones that were not autofixed I have added to the pyproject.toml to be enabled in a later PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110830
Approved by: https://github.com/albanD
2023-10-09 16:37:26 +00:00
306b2284f2 Add meta kernel for ctc_loss.intList (#107949)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107949
Approved by: https://github.com/zou3519
2023-10-09 16:35:14 +00:00
bbdc8c7b05 Revert "deprecating nvfuser c++ API (#110318)"
This reverts commit bf0866fc164b1eab10a5174a57e21eb3321bef89.

Reverted https://github.com/pytorch/pytorch/pull/110318 on behalf of https://github.com/davidberard98 due to too many warnings being thrown in torchvision https://github.com/pytorch/pytorch/issues/110857 ([comment](https://github.com/pytorch/pytorch/pull/110318#issuecomment-1753245449))
2023-10-09 15:41:50 +00:00
2e57b1e847 [BE]: Update NCCL submodule to v2.19.3 (#110827)
Updates NCCL submodule to v2.19.3 Mostly contains some more performance fixes for H100s as well as a couple new performance features and some new plugin support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110827
Approved by: https://github.com/malfet
2023-10-09 13:37:26 +00:00
a18b98f8a2 [xla hash update] update the pinned xla hash (#110852)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110852
Approved by: https://github.com/pytorchbot
2023-10-09 12:00:17 +00:00
cyy
3a70a02a81 Enable Wrange-loop-analysis (#110837)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110837
Approved by: https://github.com/Skylion007
2023-10-09 11:19:03 +00:00
d2a2a67fa4 Added new test sample to interpolate op in OpInfo (#104181)
Description:
- Added new test sample to interpolate op in OpInfo
- Fixed silent issue with zero tensor test sample for uint8 dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181
Approved by: https://github.com/pmeier, https://github.com/lezcano
2023-10-09 10:55:56 +00:00
ddb0c26511 [inductor] Re-enable more fixed tests (#110798)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110798
Approved by: https://github.com/Skylion007
2023-10-09 04:36:51 +00:00
92fea5ae3f [GHF] Re-enable test_internal_changes (#110834)
As Jon fixed the internal change status reporting after the issue is closed
Fixes https://github.com/pytorch/pytorch/issues/110218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110834
Approved by: https://github.com/janeyx99
2023-10-09 03:23:07 +00:00
cyy
3ec33957eb [1/N] Enable Wunused-result and Wunused-variable in torch targets (#110722)
They are useful for checking results of function calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110722
Approved by: https://github.com/Skylion007
2023-10-08 23:43:45 +00:00
e1f0f9c64e [dynamo][easy] Move code from GetAttrVariable to a suitable place (#110535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110535
Approved by: https://github.com/jansel
2023-10-08 22:37:34 +00:00
ad24965f6c typo: add space after cudnn error messages (#110806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110806
Approved by: https://github.com/Skylion007
2023-10-08 20:58:40 +00:00
a603dcc307 Fix typo under test directory (#110826)
This PR fixes typo `the the` of comments in files under `test` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110826
Approved by: https://github.com/Skylion007
2023-10-08 20:52:38 +00:00
afed0314a8 Fix typo under aten directory (#110822)
This PR fixes typo `the the` of comments in files under `aten` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110822
Approved by: https://github.com/Skylion007
2023-10-08 20:52:22 +00:00
105f3b5f91 Fix typo under caffe2 directory (#110825)
This PR fixes typo `the the` of comments in files under `caffe2` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110825
Approved by: https://github.com/Skylion007
2023-10-08 20:48:12 +00:00
fde28fdc8c Fix typo under torch/_decomp directory (#110821)
This PR fixes typo of comments in files under `torch/_decomp` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110821
Approved by: https://github.com/Skylion007
2023-10-08 20:33:49 +00:00
8a8668e1ae [inductor] Implement Fx graph caching to improve warm compilation time. (#103453)
Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR.

Test Plan:
* New unit tests exercising saving and load from the cache.
* New unit tests to exercise the cache key calculations.
* Ran several benchmarks to see cache hit and resulting compilation times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453
Approved by: https://github.com/eellison
2023-10-08 20:32:15 +00:00
5ef490f736 Update AOTInductor compile logic for CPU backend for Meta internal env (#110729)
Reviewed By: muchulee8, chenyang78

Differential Revision: D49944410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110729
Approved by: https://github.com/chenyang78
2023-10-08 19:48:12 +00:00
36e6b0cfa2 Fix cpuinfo related crash on ppc64 (#110708)
The "import  torch" crashes with following cpuinfo error on powerpc64.
==============================================================
>>> import torch
Error in cpuinfo: processor architecture is not supported in cpuinfo
Fatal error in cpuinfo: cpuinfo_get_processors_count called before cpuinfo is initialized
Aborted (core dumped)
==================================================================
The patch fixes this by excluding powerpc from using cpuinfo as it is not supported for ppc64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110708
Approved by: https://github.com/ezyang
2023-10-08 13:31:54 +00:00
bff28ec568 Fix typo under torch/_export directory (#110808)
This PR fixes typo of comments and message in files under `torch/_export` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110808
Approved by: https://github.com/gmagogsfm
2023-10-08 11:47:51 +00:00
844ea6408b feat(dynamo): handle accumulate kwargs ("func", "initial") (#110686)
Follow up to: https://github.com/pytorch/pytorch/pull/110683

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110686
Approved by: https://github.com/ezyang
2023-10-08 07:06:52 +00:00
fa8e4ea212 Add support for hasattr on ListVariable (#110438)
Fixes #109502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110438
Approved by: https://github.com/jansel
2023-10-08 05:34:00 +00:00
58637c4b43 [dynamo] Remove SuperSource (#110475)
The motivation for removing this is already present in the pre-PR comments. Copying it

~~~
# NB - SuperSource is a weird one.
# it is our only source with 2 bases, so we use the objec
# as the base, rather than the type, since an invocation
# like super(Foo, foo) is represented here, the source object base is more spiritually
# aligned with the instance, rather than the type.
# This whole construction is questionable tho, and we should probably find a way to
# avoid this exception to our otherwise nice source parentage invariant.
~~~

Instead of using super(a, b), we can use `type(b).__mro__[index]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110475
Approved by: https://github.com/jansel
2023-10-08 04:45:06 +00:00
6b4c686b9a [aotindutor] Forward fix a performance regression (#110800)
Summary: Forward fix a performance regression caused by https://github.com/pytorch/pytorch/pull/110510. When a model is run once, all those kernel pointers are initialized and removing the if-nullptr check will cause those loadKernel be unnecessarily executed again when we rerun the foward function. Another way to do this is to codegen loadKernel in the initializer, which I may do in a later PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110800
Approved by: https://github.com/jansel
2023-10-08 04:06:44 +00:00
1824ea3c0f Add a test to make sure all modules in the codebase are importable (#110598)
As per title, running import on any of these files lead to a crash.
I'm very curious how the code in them is used!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110598
Approved by: https://github.com/janeyx99, https://github.com/malfet
2023-10-08 03:52:30 +00:00
cyy
230a124a7a [5/N] Move remaining c10::variant calls to std::variant (#110423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110423
Approved by: https://github.com/colesbury
2023-10-08 03:52:02 +00:00
459cef8649 switch dtensor and functional collective to use optree (#110670)
optree recently landed and provide quite good perf, conditionally import
new optree if optree is installed

Some numbers testing mlp layer with TP + func collective:
before this PR: 10.390ms
after this PR: 9.189ms

so around e2e 10% CPU overhead reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110670
Approved by: https://github.com/fegin
2023-10-08 03:05:39 +00:00
defa0d3a2d Add a side table for triton kernels to avoid using itertools.partial (#110633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110633
Approved by: https://github.com/jansel
2023-10-08 02:01:59 +00:00
57cc886639 Fix public binding check to check all submodules (#110601)
Fix https://github.com/pytorch/pytorch/issues/86619

The test to make sure modules are importable is being added at https://github.com/pytorch/pytorch/pull/110598
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110601
Approved by: https://github.com/zou3519
2023-10-08 00:36:31 +00:00
8edb561631 Fix use after free in tensor creation (#106707)
Fix https://github.com/pytorch/pytorch/issues/106534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106707
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2023-10-07 22:41:21 +00:00
0a5f0b5db3 Suport tracing HuggingFace models (#110748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110748
Approved by: https://github.com/avikchaudhuri
2023-10-07 22:37:28 +00:00
d84bcb9c8c [HigherOrderOp] expose torch.cond (#110293)
This pr expose torch._higher_order_ops.cond as torch.cond.

1. Need to add #noqa: F811 to the _check calls in torch/__init__.py to address some confusing linter error "Redefinition of unused 'cond'" but only one cond is imported and for these lines that have this error, they don't define the cond but just use it as an argument.
2. Also add cond to the list that allows it to be traced through so as dynamo could trigger the CondHigherOrder logic instead of creating a TorchVariable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110293
Approved by: https://github.com/zou3519
2023-10-07 20:39:52 +00:00
0a5bb1c2eb Feature/stft no window warn (#110695)
Fixes #88919

@mruberry @peterbell10

This PR adds a warning to the .cpp STFT and ISTFT functions if a window is not provided.
It also describes the warning in the documentation on `functional.py`.
Finally, it adds unit tests to check if the warning is being produced.

I have audited for internal calls of `stft` and `istft` on Pytorch and haven't found any.

Thank you for the opportunity to contribute!

Eric
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110695
Approved by: https://github.com/ezyang
2023-10-07 20:24:36 +00:00
cyy
c3e4e4f6d2 [4/N] Add -Wdeprecated and related fixes (#110204)
This PR enables Wdeprecated on torch_cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110204
Approved by: https://github.com/ezyang
2023-10-07 19:46:08 +00:00
096b14eae8 Fix numel test to be > 2 (#110731)
This makes it consistent with the comment.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110731
Approved by: https://github.com/angelayi
2023-10-07 19:18:59 +00:00
2dc5e166a5 [TP][Inference] Enable DTensor TP inference (#110751)
In https://github.com/pytorch/pytorch/pull/109977, we observed that during inference mode, aten.Linear does not get decomposed. So instead of enabling sharding propagation for linear op, we use func.decompose so that it gets decomposed to matmul and mm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110751
Approved by: https://github.com/bdhirsh, https://github.com/wanchaol
2023-10-07 18:57:27 +00:00
19ce68a45c Fix typo under torch/_numpy directory (#110782)
This PR fixes typo of comments in files under torch/_numpy directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110782
Approved by: https://github.com/Skylion007
2023-10-07 17:42:35 +00:00
a119efe9c7 [AOTInductor][ez] Fix FallbackKernel.codegen() (#110777)
Summary: ProxyExecutor should only used in fbcode for cpp codegen.

Test Plan: Existing CIs

Differential Revision: D50048488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110777
Approved by: https://github.com/chenyang78
2023-10-07 15:29:09 +00:00
cyy
12f97bb2e9 [Reland][3/N] Add -Wdeprecated and related fixes (#110518)
Fixes the string_view errors and reland the work. The previous changes in torch/csrc/utils/invalid_arguments.cpp were too aggressive and not tested thoroughly. They are discarded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110518
Approved by: https://github.com/ezyang
2023-10-07 08:38:40 +00:00
98b79e9488 [inductor] Add AOTI ABI shim function for torch.nonzero (#110766)
Summary: `torch.nonzero` doesn't have inductor lowering (yet). To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols
...
----------------------------------------------------------------------
Ran 4 tests in 78.650s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110766
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713, #110745, #110764
2023-10-07 08:32:27 +00:00
13a2f42635 [inductor] Add size, stride, storage_offset to RAIIAtenTensorHandle (#110764)
Summary: For unbacked SymInts, the C++ wrapper codegen can generate expressions like `buf123.size()` or `.stride()` or `.storage_offset()`:

7cc0020a80/torch/_inductor/ir.py (L2504-L2520)

Here we add corresponding methods to the `RAIIAtenTensorHandle` class so that the above codegen works in the ABI compatibility mode.

Test Plan: CI + the following PR

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110764
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713, #110745
2023-10-07 08:26:42 +00:00
abb00f66d8 [inductor] Add AOTI ABI shim function for repeat_interleave.Tensor (#110745)
Summary: `repeat_interleave.Tensor` doesn't have inductor lowering. To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_repeat_interleave
...
----------------------------------------------------------------------
Ran 4 tests in 70.526s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110745
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713
2023-10-07 08:18:01 +00:00
432df71820 [inductor] added a config to always add tensor constants (#110491)
Summary:
In some scenarios, we want to update constants at runtime.
In such cases, we have to keep the original constants in
the generated code without applying any constant-inlining
optimizations.

This PR adds a config to force us to add tensor constants.

Differential Revision: D49895154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110491
Approved by: https://github.com/mikekgfb
2023-10-07 07:51:54 +00:00
840e68301c [AOTInductor] Change UpdateConstants to UpdateConstantsMap (#110576)
Summary: Change name of UpdateConstants to UpdateConstantsMap

Test Plan:

Differential Revision: D49937744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110576
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2023-10-07 07:36:57 +00:00
18f0d3af72 Revert "[user errors] compulsory case names, allow multiple (#110733)" (#110783)
This reverts commit 983f6f36dbaf0210360926547b05deb1e4f798a4.  I have no idea how to revert https://github.com/pytorch/pytorch/pull/110733 with the bot.  So reverting it manually for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110783
Approved by: https://github.com/ZainRizvi, https://github.com/kit1980
2023-10-07 07:32:39 +00:00
d54e20f457 [FSDP][state_dict] Add a unittest for local_state_dict resharding (#110625)
This PR adds a unittest to demonstrate the ability for LOCAL_STATE_DICT to do resharding.

Differential Revision: [D44260141](https://our.internmc.facebook.com/intern/diff/D44260141/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110625
Approved by: https://github.com/wz337
2023-10-07 07:22:41 +00:00
1b34238d67 fix get device index if has _utils._get_device_index in privateuse1 (#108123)
**Get device index by torch.privateuse1._utils._get_device_index, if the metched exists.**

Reason:
Can only get device_index 0 if ```location``` such as 'privateuse1' before modify.
Can get accurate deivce index use _get_device_index in this scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108123
Approved by: https://github.com/albanD
2023-10-07 06:18:59 +00:00
c2e7a0d689 [core IR] Add decomps for aten.sum and aten.squeeze variants (#110645)
Summary:
## Context

Both `aten.sum` and `aten.squeeze` have a "most generic" variant in the form of `aten.sum.dim_IntList` and `aten.squeeze.dims` respectively. Add decompositions for other non generic variants of these operators to express them using the most generic variant.

Note that to register these decomps, the reference implementation under `_refs` had to be removed as registered decompositions. cc: @lezcano @peterbell10

Test Plan: Github CI + Meta Internal CI

Differential Revision: D49965952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110645
Approved by: https://github.com/peterbell10, https://github.com/digantdesai, https://github.com/manuelcandales
2023-10-07 04:21:51 +00:00
c77dd684c9 Enable typechecking in _inductor/ir.py (#110112)
I used a bunch of ignore-type comments, mostly due to
https://github.com/pytorch/pytorch/issues/109963.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110112
Approved by: https://github.com/peterbell10
2023-10-07 04:19:38 +00:00
e8ef8bfdce [Inductor] Allow matmul to have flexiable layout when we are not autotuning (#110726)
Fixes #102804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110726
Approved by: https://github.com/Chillee
2023-10-07 04:08:37 +00:00
5cc1a38370 [release_notes] Some updates after 2.1 release (#110771)
Summary:
1. aligned topic with labels
2. added some more descriptions in release note worksheet template

Test Plan:
.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110771
Approved by: https://github.com/drisspg
2023-10-07 03:10:46 +00:00
bf0866fc16 deprecating nvfuser c++ API (#110318)
deprecating nvfuser c++ API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110318
Approved by: https://github.com/davidberard98
2023-10-07 02:25:21 +00:00
983f6f36db [user errors] compulsory case names, allow multiple (#110733)
We want to get to a point where most `UserError`s link to `exportdb` examples. This PR makes passing case names non-optional to make this intent clearer and encourage developers who raise `UserError`s to make or point to examples that make fixing such errors more obvious for users.

In addition, sometimes there are multiple examples that are relevant to an error. Thus this PR also enables passing multiple case names.

Differential Revision: [D50020465](https://our.internmc.facebook.com/intern/diff/D50020465/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110733
Approved by: https://github.com/zhxchen17
2023-10-07 01:25:12 +00:00
90bf6e3938 [FSDP][optim_state_dict] Enable cpu_offload config for optimzer state_dict (#108434)
We had the option but never used cpu_offload as optimizer state_dict offloads the tensors to CPU by default. And this is usually most users want as the tensors are required to be moved to CPU eventually. However, we may want to disable offloading to CPU in some cases, epsecially for the debugging purpose. This PR lets optimizer state_dict read the flag.

Differential Revision: [D48913340](https://our.internmc.facebook.com/intern/diff/D48913340/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108434
Approved by: https://github.com/wz337
2023-10-07 01:14:49 +00:00
563728f61c [reland] Update custom Function preserve torch function when inputs r… (#110679)
…eturned as-is

reland of https://github.com/pytorch/pytorch/pull/109825#issuecomment-1749803837

Opening this without ghstack to do codev. In our PR, we changed the signature of `_wrap_outputs`. There is some internal code that calls `_wrap_outputs` directly, so we also need to update that callsite.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110679
Approved by: https://github.com/albanD
2023-10-07 00:27:45 +00:00
1c97808f81 [dtensor] support lt/gt op (#110585)
This PR enables lt/gt aten op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110585
Approved by: https://github.com/fduwjj
ghstack dependencies: #110584
2023-10-07 00:06:36 +00:00
9378a2ceda [dtensor] support aten.where and enable implicit scalar promotion (#110584)
This PR adds support for aten.where and support implicit scalar
promotion, basically when we meet scalar tensors in dispatching logic,
we implicitly convert it those to replicated dtensor

The latter also enables bunch of ops in op db to pass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110584
Approved by: https://github.com/fduwjj
2023-10-07 00:06:36 +00:00
e3bf5000a7 Hide the contiguous requirement for user input mesh when initializing DeviceMesh (#110628)
Summary:
As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh.

In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided:
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
                "cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```

Test Plan:
**Unit Test**:
```
buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh

Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399
Network: Up: 0B  Down: 0B
Jobs completed: 6. Time elapsed: 1:58.7s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

**Test with MP**
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
                "cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```
Without the change: exception.
After this change: initialzied sucessfully.

Differential Revision: D49942839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110628
Approved by: https://github.com/wanchaol, https://github.com/xw285cornell, https://github.com/fduwjj
2023-10-06 23:54:13 +00:00
a0bbd075b2 Add the Mode section in the extending doc (#110073)
Cover the basic principles of Mode and an example on how to use them and their behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110073
Approved by: https://github.com/janeyx99
2023-10-06 23:50:55 +00:00
6b1007b2a7 Fix error in div lowering with integers (#102809)
Fixes https://github.com/pytorch/pytorch/issues/101016
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102809
Approved by: https://github.com/ngimel
ghstack dependencies: #110501, #110504, #110591, #110668, #110687
2023-10-06 23:21:40 +00:00
d35e3dbd06 Fix concurrency limits for Create Release (#110759)
Also, don't run it on tags, but run on release branch and on `release` event.
Tweak linter to accept different concurrency limits for `create_release.yml`

Fixes https://github.com/pytorch/pytorch/issues/110569 as all the invocations of workflow in the past were cancelled by concurrently limit due to the tag push and release happening at roughly the same time, see https://github.com/pytorch/pytorch/actions/workflows/create_release.yml?query=event%3Arelease

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110759
Approved by: https://github.com/atalman
2023-10-06 23:14:12 +00:00
9b55194f81 fix(dynamo): Incorrect accumulate implementation, bad tests (#110683)
Root cause of: https://github.com/pytorch/pytorch/issues/110287

Fixed many tests that didn't actually test due to unreliability of `CompileCounter.frame_count` in detecting graph breaks: https://github.com/pytorch/pytorch/issues/110730

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110683
Approved by: https://github.com/voznesenskym
2023-10-06 23:07:56 +00:00
4342b0849f [vision hash update] update the pinned vision hash (#110667)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110667
Approved by: https://github.com/pytorchbot
2023-10-06 23:01:11 +00:00
f952551963 Handle invalid cancellation signals in trymerge (#110690)
This change is needed after https://github.com/pytorch/test-infra/pull/4579 and https://github.com/pytorch/test-infra/pull/4610.  All invalid cancelled signals have been removed from Dr.CI and HUD.  So trymerge should ignore them accordingly for a consistent experience.

### Testing

https://github.com/pytorch/pytorch/pull/110367#issuecomment-1750099960 is the PR where a bunch of invalid cancelled signals showed up and blocked merges

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110690
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
2023-10-06 22:43:33 +00:00
2aa3064364 [inductor] Add aoti_torch_dtype_bool to AOTI ABI shim (#110713)
Summary: ATT

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110713
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-10-06 22:16:39 +00:00
65d40a72c4 Delete rogue print from test_quantize_pt2e.py (#110732)
Introduced by https://github.com/pytorch/pytorch/pull/110308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110732
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/jerryzh168
2023-10-06 22:16:10 +00:00
59592ce9f2 [CUDA Host Allocator][ROCm] fixes (#110715)
Follow up to #110123, removing the CUDA_VERSION check for ROCm because HIP already has hipMallocAsync() and doesn't need the version check there.

Follow up to #108488, fixing the unit failing unit tests by accepting either a "cuda" or "hip" attribute for the caching allocator options.  This is aligned to the masquerading strategy for ROCm/HIP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110715
Approved by: https://github.com/ezyang
2023-10-06 21:42:24 +00:00
3d87c52cef Remove stuff for Python before 3.8 from install_conda.sh (#110671)
As we only support Python 3.8+ now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110671
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman, https://github.com/malfet, https://github.com/ZainRizvi
2023-10-06 21:40:28 +00:00
f4796df914 Add support for generators on the IPU device (#110704)
This change adds hooks similar to those used on other device types, to allow the Torch to create and use generators provided by the IPU backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110704
Approved by: https://github.com/ezyang
2023-10-06 21:36:14 +00:00
44d34fe65c different bounds for same Dim name (#110638)
Previously,`Dim` definitions that shared the same name but had different ranges were allowed to appear in the `dynamic_shapes` argument of an `export` call. They would correspond to the *same* dynamic dimension (identified by the shared name) with an effective range would be the *intersection* of the different ranges.

However this behavior can be confusing, because having different definitions with the same name is more likely than not  unintentional. Therefore, this PR makes it a user error.

We still allow different definitions with the same name to exist at the same time (no global uniqueness) as long as they are not confused in the same `export` call. Redefinitions with the same bounds are also allowed, in case they are accidentally created by executing the same code multiple times.

Differential Revision: [D49965944](https://our.internmc.facebook.com/intern/diff/D49965944/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110638
Approved by: https://github.com/zhxchen17
2023-10-06 21:22:52 +00:00
0d4a360fa2 remove replaced symbols from range_constraints (#110644)
While the `range_constraints` that is initially derived by processing of constraints only contains symbols that appear in the graph module, eventually the `range_constraints` that are in the exported program seem to contain more symbols than those that appear in the graph module. Clearly this is a regression, because the example of "Expressing Dynamism" in our public docs (https://pytorch.org/docs/stable/export.html#expressing-dynamism) does not show the extra symbols in `range_constraints`, but running the example does.

The problem seems to arise when we are running `_transform` passes, where we regenerate the `range_constraints` from the `shape_env`. However, as a rule, symbols that have `replacements` are actually replaced (by other expressions, including constants or other symbols), so they should never appear in the graph module. Thus we can filter such symbols out from `range_constraints` as well.

Differential Revision: [D49969620](https://our.internmc.facebook.com/intern/diff/D49969620/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110644
Approved by: https://github.com/zhxchen17
2023-10-06 21:13:55 +00:00
f74937741e Remove runtime assertions between export and AOT compilation (#110710)
Summary: The runtime assertions inserted in the `torch._export.export` by the `_AddRuntimeAssertionsForInlineConstraintsPass` lead to errors in AOT Inductor like #109884. In `torch._export.aot_compile` export and AOT compilation are run consecutively which would lead to the above issue if any assertions are inserted.

In this PR, we're adding a new parameter / flag to `torch._export.aot_compile`, `remove_runtime_assertions`, to remove the assertions inserted during export before AOT compilation. The flag is set to `False` for BC.

Additionally, we remove the flag `add_runtime_assertions_for_inline_constraints` recently added to `torch._dynamo.config`, as it can lead to undesirable `torch._export` behavior and is 's no longer required for the AOT Inductor testing purposes.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110710
Approved by: https://github.com/zhxchen17, https://github.com/chenyang78
2023-10-06 21:09:35 +00:00
7cc0020a80 [decomp] Fix different return type in threshold_backward vs. eager (#110689)
due to type promotion with floating point scalar in decompositions.py

Fixes part of #100838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110689
Approved by: https://github.com/ezyang
2023-10-06 20:59:58 +00:00
756b4e9e08 [export] Add codeowners. (#110718)
Summary: So that we can catch all changes under export/

Test Plan: CI

Differential Revision: D50017157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110718
Approved by: https://github.com/tugsbayasgalan
2023-10-06 20:57:51 +00:00
b8a3998c23 add batch rule for missing inplace ops (#110692)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110692
Approved by: https://github.com/ezyang
2023-10-06 20:53:28 +00:00
1b1bc08557 [Dynamo] SizeVariable can be indexed by symint (#110349)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110349
Approved by: https://github.com/williamwen42
2023-10-06 20:48:07 +00:00
ff0358b038 Revert "[C10] PG observability hooks. (#108815)"
This reverts commit 0c7a877745f98b8fce8868291408945c0dd817d6.

Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))
2023-10-06 19:49:49 +00:00
37a0265992 [Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567)
In ABI compatible mode, We always need op_overload.schema for FallbackKernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110567
Approved by: https://github.com/jansel
2023-10-06 19:20:50 +00:00
0c7a877745 [C10] PG observability hooks. (#108815)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-10-06 18:52:46 +00:00
17348b0f51 Implement split_with_sizes backward for NT (#110647)
Needed internally. Note that `split_with_sizes()` for NT is currently supported only on `dim=-1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110647
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #110646
2023-10-06 18:44:22 +00:00
48240ec62e Make unbind() overrideable for NT subclass (#110646)
Reland of #109122. Fixed the memory leak by not saving the outputs of `unbind()` for backward. Rather, the NT sizes are saved so undefined grads can replaced with zeros of the correct size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110646
Approved by: https://github.com/soulitzer, https://github.com/cpuhrsch
2023-10-06 18:44:22 +00:00
33da6c8951 [sparse] Add i8i8->i32 support for cuSPARSELt (#110499)
Summary:

With the release of cuSPARSELt v0.5.0, we now have support for
int8 int8 -> int32 matmul.

This PR adds support for this via out_dtype.

Test Plan:
```
python test/test_sparse_semi_structured.py -k int32
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110499
Approved by: https://github.com/cpuhrsch
2023-10-06 18:32:47 +00:00
f7ce19d40a Fix typo under torch/onnx directory (#110697)
This PR fixes typo of comments in files under `torch/onnx` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110697
Approved by: https://github.com/ezyang
2023-10-06 18:21:00 +00:00
69ea214cc2 [reland] Update singleton int to error when inequality relation is undefined (#110672)
reland of https://github.com/pytorch/pytorch/pull/110044
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110672
Approved by: https://github.com/ezyang
2023-10-06 17:50:25 +00:00
576b80d23e Revert "[HigherOrderOp] expose torch.cond (#110293)"
This reverts commit 601f872831649bccf1069ac59b2ecfd0895a88e3.

Reverted https://github.com/pytorch/pytorch/pull/110293 on behalf of https://github.com/ydwu4 due to Sorry, didn't check the error carefully on the PR. A doc error is related to this pr ([comment](https://github.com/pytorch/pytorch/pull/110293#issuecomment-1751176719))
2023-10-06 17:44:17 +00:00
cyy
e75f2e2ea1 Fix clang-tidy warnings in CUDAPluggableAllocator (#110678)
This PR fixes clang-tidy warnings in CUDAPluggableAllocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110678
Approved by: https://github.com/Skylion007
2023-10-06 17:33:08 +00:00
601f872831 [HigherOrderOp] expose torch.cond (#110293)
This pr expose torch._higher_order_ops.cond as torch.cond.

1. Need to add #noqa: F811 to the _check calls in torch/__init__.py to address some confusing linter error "Redefinition of unused 'cond'" but only one cond is imported and for these lines that have this error, they don't define the cond but just use it as an argument.
2. Also add cond to the list that allows it to be traced through so as dynamo could trigger the CondHigherOrder logic instead of creating a TorchVariable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110293
Approved by: https://github.com/zou3519
2023-10-06 17:04:31 +00:00
e8f1f4ed66 [quant][pt2][ROCm] follow-up PR 109908 for miopen_batch_norm (#110653)
Fixes recent broken unit tests caused by PR #109908 because cudnn and miopen have separate batch norm functions.

```
2023-10-05T09:35:01.6606614Z _______________ TestQuantizePT2EQAT.test_qat_conv_bn_fusion_cuda _______________
2023-10-05T09:35:01.6606948Z Traceback (most recent call last):
2023-10-05T09:35:01.6607362Z   File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 323, in test_qat_conv_bn_fusion_cuda
2023-10-05T09:35:01.6607767Z     self._verify_symmetric_xnnpack_qat_graph(
2023-10-05T09:35:01.6608217Z   File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 130, in _verify_symmetric_xnnpack_qat_graph
2023-10-05T09:35:01.6608658Z     self._verify_symmetric_xnnpack_qat_graph_helper(
2023-10-05T09:35:01.6609105Z   File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 173, in _verify_symmetric_xnnpack_qat_graph_helper
2023-10-05T09:35:01.6609623Z     m = prepare_qat_pt2e(m, quantizer)
2023-10-05T09:35:01.6610171Z   File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize_pt2e.py", line 178, in prepare_qat_pt2e
2023-10-05T09:35:01.6610561Z     _fuse_conv_bn_qat(model)
2023-10-05T09:35:01.6611072Z   File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 501, in _fuse_conv_bn_qat
2023-10-05T09:35:01.6611497Z     m = _fuse_conv_bn_qat_helper(m, is_cuda=True)
2023-10-05T09:35:01.6612065Z   File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 575, in _fuse_conv_bn_qat_helper
2023-10-05T09:35:01.6612492Z     _get_conv_bn_getitem_nodes(r.replacements)
2023-10-05T09:35:01.6613058Z   File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 383, in _get_conv_bn_getitem_nodes
2023-10-05T09:35:01.6613465Z     assert bn_node is not None
2023-10-05T09:35:01.6613716Z AssertionError
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110653
Approved by: https://github.com/jerryzh168, https://github.com/pruthvistony
2023-10-06 15:30:55 +00:00
c4db607607 Doc test non packages (#110568)
Add non-package python modules to the public API checks.
The original change is to remove the `ispkg` check in this line
https://github.com/pytorch/pytorch/blob/main/docs/source/conf.py#L518

Everything else is to add the appropriate modules to the rst files, make sure every module we provide can be imported (fixed by either making optional dependencies optional or just deleting files that have been un-importable for 3 years), make API that are both modules and functions (like torch.autograd.gradcheck) properly rendered on the docs website without confusion and add every non-documented API to the allow list (~3k of them).

Next steps will be to try and fix these missing docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110568
Approved by: https://github.com/zou3519
2023-10-06 14:16:01 +00:00
a3e5ec453a Move Docker official builds to Cuda 12.1.1 (#110703)
Since our pipy released CUDA version is 12.1.1, Moving the Docker builds to 12.1.1. Related to : https://github.com/pytorch/pytorch/issues/110643
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110703
Approved by: https://github.com/DanilBaibak
2023-10-06 13:56:45 +00:00
261cae793a [cpu] remove vec code for ops that do not support complex no (#110280)
Removes dead code pertaining to ATen ops for which complex dtype is unsupported.

Reference: https://github.com/pytorch/pytorch/pull/110217#discussion_r1340599702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110280
Approved by: https://github.com/vfdev-5
2023-10-06 12:10:18 +00:00
ceb773b68d Fix #110680 (requires_grad typo in decomp) (#110687)
Fixes https://github.com/pytorch/pytorch/issues/110680
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110687
Approved by: https://github.com/voznesenskym, https://github.com/lezcano
ghstack dependencies: #110501, #110504, #110591, #110668
2023-10-06 10:36:01 +00:00
d776dd04ac perf(optim/dynamo): shortcut is_sparse iteration in SGD multi_tensor (#110648)
Originated: https://github.com/pytorch/pytorch/pull/110353#discussion_r1347806922

Speeds up significantly in non-sparse path (majority use-case).

Benchmarks: https://github.com/pytorch/pytorch/issues/110506#issuecomment-1747732478

CC: @janeyx99
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110648
Approved by: https://github.com/janeyx99
2023-10-06 08:56:18 +00:00
96f616a054 Revert tl.int1 casting change for ROCm to avoid hangs (#110531)
Seeing hangs on ROCm seemingly after this PR https://github.com/pytorch/pytorch/pull/110388
https://ossci-raw-job-status.s3.amazonaws.com/log/17381916785
`inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp2_cuda_bool Command took >30min, returning 124`

Conditionalising out of this while we investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110531
Approved by: https://github.com/peterbell10
2023-10-06 08:53:45 +00:00
6b92c367c5 Add test_jit_cuda_fuser to ROCM_BLOCKLIST (#110440)
Adds the nvfuser related unit test suite to ROCM_BLOCKLIST as should not be run on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110440
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/lezcano
2023-10-06 08:47:15 +00:00
65afa760a6 Add a script to run iOS test app on AWS Device Farm (#110202)
This adds a script to test PyTorch on actual iOS devices on AWS Device Farm. The test could take quite a long time pending for the devices to become available, so the steps are done manually and documented in `ios/TestApp/README.md`.

### Testing

1. TestApp itself runs fine on my local iPhone 13 and on [device farm](https://us-west-2.console.aws.amazon.com/devicefarm/home#/mobile/projects/b531574a-fb82-40ae-b687-8f0b81341ae0/runs/d2653ca8-8ee2-44dd-b15e-0402f9ab0aca).  I can see the benchmark results output at the console log.
```
BUILD_LITE_INTERPRETER=1 USE_PYTORCH_METAL=1 USE_COREML_DELEGATE=1 IOS_PLATFORM=OS IOS_ARCH=arm64 ./scripts/build_ios.sh

pushd ios/TestApp/benchmark
ruby setup.rb --lite 1 -t 9HKVT38N77 --benchmark
popd

ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "OS"
```

2. Trying to run TestAppTests https://github.com/pytorch/pytorch/blob/main/ios/TestApp/TestAppTests/TestLiteInterpreter.mm on my local iPhone ends up with this error `Logic Testing Unavailable. Logic Testing on iOS devices is not supported. You can run logic tests on the Simulator`.  I update the xcode project to reuse TestApp as the host application.
```
ruby setup.rb --lite 1 -t 9HKVT38N77
```

3.. Trying [another round of testing on device farm](https://us-west-2.console.aws.amazon.com/devicefarm/home#/mobile/projects/b531574a-fb82-40ae-b687-8f0b81341ae0/runs/18dbd69d-8608-46d8-a868-bd05b69375db)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110202
Approved by: https://github.com/kit1980
2023-10-06 08:23:16 +00:00
7d98549ca9 retain_graph=True in compiled_autograd (#110367)
Adds support for retain_graph=True - known as keep_graph_ internally in the autograd engine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110367
Approved by: https://github.com/jansel
2023-10-06 08:22:10 +00:00
63fe5de89b feat(optim): add SGD sparse multitensor to testing path (#110562)
Follow up to: https://github.com/pytorch/pytorch/pull/110454, which defines the infra for sparse multi tensor optimizer testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110562
Approved by: https://github.com/janeyx99
2023-10-06 07:48:25 +00:00
371d8ba599 vmap: decompose real and imag instead of registering batch rule (#110508)
Clean-up

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110508
Approved by: https://github.com/zou3519
2023-10-06 06:01:12 +00:00
e8605f6f22 Correct outdated Doxygen link (#110654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110654
Approved by: https://github.com/huydhn
2023-10-06 05:23:27 +00:00
6d23193aab Added strict=True to zip in aot_autograd (#110668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110668
Approved by: https://github.com/ezyang
ghstack dependencies: #110501, #110504, #110591
2023-10-06 05:12:05 +00:00
d279979102 perf(inductor): improve Adam compile times by shortcutting for loops (via has_complex) (#110607)
Adam part of: https://github.com/pytorch/pytorch/issues/110506

TODO:
- If this approach is validated as a good one, it an also be applied to all other optimizers which convert `complex` via list comprehensions

### Results:
`NUM_PARAMS=200, foreach=True`
- main: dynamo: 43s, inductor: 31s, total: 74s
- this PR: dynamo: 3.5s, inductor: 30s, total: 34s (dynamo speedup: 12.3x, overall speedup: 34s, 2.1x)

`NUM_PARAMS=1000, foreach=True, has_complex shortcut`:

```
<class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics:
Function                              Runtimes (s)
------------------------------------  -------------------------------
_compile.<locals>.compile_inner       0.0329, 50.0806, 0.0041
OutputGraph.call_user_compiler        44.9924
```

`NUM_PARAMS=1000, foreach=True`:
```
<class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics:
Function                              Runtimes (s)
------------------------------------  -------------------------------
_compile.<locals>.compile_inner       0.0389, 58.6069, 0.0043
OutputGraph.call_user_compiler        44.1425
```

### Discussion
- `has_complex` shortcut provides additional 2x dynamo speedup. It is not necessary to achieve a significant overall speedup.

CC: @janeyx99 @mlazos

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110607
Approved by: https://github.com/janeyx99, https://github.com/lezcano
2023-10-06 05:08:49 +00:00
26bfb0fc21 Check for both workflow and job names from Dr.CI (#110661)
In https://github.com/pytorch/pytorch/pull/110362, the failure was flaky but merge bot treated it as an actual failure. This is a regression after https://github.com/pytorch/test-infra/pull/4604 where the name returned by Dr.CI now includes workflow name.  For example, the name is `trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)` in the JSON response:

```
{"FAILED": [], "FLAKY": [{"workflowId": 6372581477, "id": 17297638807, "name": "trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)", "jobName": "macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)", "conclusion": "failure", "completed_at": "2023-10-01T22:18:28Z", "html_url": "https://github.com/pytorch/pytorch/actions/runs/6372581477/job/17297638807", "head_branch": "ciflow/trunk/110362", "pr_number": 110362, "head_sha": "03f51e36dedf234931006d1db61677b229c9a119", "failure_captures": ["Failure: There is only 4671284KB free space left in /, which is less than the minimum requirement of"], "failure_line": "Failure: There is only 4671284KB free space left in /, which is less than the minimum requirement of 6291456KB for macOS", "time": "2023-10-01T22:17:53.847751Z"}], "BROKEN_TRUNK": [], "UNSTABLE": []}
```

I update merge bot to handle this better by considering both workflow name, job name, and the combination full name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110661
Approved by: https://github.com/clee2000
2023-10-06 04:36:52 +00:00
64583c4d04 [CUDA Host Allocator] Add support of CudaHostRegister (#108488)
Summary: This diff adds another option to create cuda pinned memory using cudaHostRegister.

Differential Revision: D45843715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108488
Approved by: https://github.com/zdevito
2023-10-06 04:13:02 +00:00
57e9969021 feat(optim): Add adadelta multi_tensor support for complex, with has_complex shortcut (#110631)
Partial fix: https://github.com/pytorch/pytorch/issues/110606

More on `has_complex` shortcut: https://github.com/pytorch/pytorch/pull/110613#issuecomment-1749314805

CC: @janeyx99, @mlazos, @lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110631
Approved by: https://github.com/lezcano
2023-10-06 03:34:41 +00:00
11047be10e feat(optim): Add NAdamsupport for complex, with has_complex shortcut (#110634)
Partial fix: https://github.com/pytorch/pytorch/issues/110606

More on `has_complex` shortcut: https://github.com/pytorch/pytorch/pull/110613#issuecomment-1749314805

CC: @janeyx99 @mlazos @lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110634
Approved by: https://github.com/lezcano
2023-10-06 03:31:48 +00:00
347ea3fe0d feat(optim): Add RAdam support for complex, with has_complex shortcut (#110635)
Partial fix: https://github.com/pytorch/pytorch/issues/110606

More on `has_complex` shortcut: https://github.com/pytorch/pytorch/pull/110613#issuecomment-1749314805

CC: @janeyx99 @mlazos @lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110635
Approved by: https://github.com/lezcano
2023-10-06 03:29:26 +00:00
be5dc3a00d [export] Update ArgumentSpec definition. (#110612)
Summary: Changing ArgumentSpec into a true union type in Python without changing serialization format.

Test Plan: CI

Differential Revision: D49871088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110612
Approved by: https://github.com/angelayi
2023-10-06 03:14:45 +00:00
83061ee177 [aotinductor] Fix benchmarks with self.autocast (#110490)
Fixes https://github.com/pytorch/pytorch/issues/108173

The original error was that there was a type mismatch between the output of eager mode (float16) and from aot_compile (float32). This is because when we run the model eagerly in the benchmarks, we call [self.model_iter_fn](https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L2072-L2076) to run the model, rather than directly calling the model. In the case of timm models, it calls the model with [self.autocast()](https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/timm_models.py#L321-L323), causing the eager model to return a float16 value. However, the model we export with aot_compile does not have the self.autocast context, so it returns float32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110490
Approved by: https://github.com/desertfire
2023-10-06 02:13:47 +00:00
8a09fe4a05 [ez] Remove print in heuristics aggregation (#110621)
Move print to the beginning instead because putting it at the end makes it so you have to scroll through when debugging, and nothing in that function indicates that it should be printing anything

Also the line for printing disabled issues out of the for loop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110621
Approved by: https://github.com/huydhn
2023-10-06 02:04:53 +00:00
dac895c10a Revert "Multiprocessing support for NT (#110292)"
This reverts commit f17fe89e14ef7c29690d989c857ae011b8589b80.

Reverted https://github.com/pytorch/pytorch/pull/110292 on behalf of https://github.com/kit1980 due to Causes CUDA memory leaks ([comment](https://github.com/pytorch/pytorch/pull/110292#issuecomment-1749852095))
2023-10-06 01:07:40 +00:00
555c83d097 Added a UserWarning when using torch.{std,var,std_mean,std_var} with dof<=0 (#109824)
Fixes #109696.

This PR adds a `UserWarning` when calling
- `torch.var`
- `torch.var_mean`
- `torch.std`
- `torch.std_mean`

with an effective `dof<=0`. Until now, only `torch.cov` warned about this. The code also handles edge cases, such as `torch.empty`
```
>>> import torch; torch.std_mean(torch.empty(0), correction=0)
<stdin>:1: UserWarning: std_mean(): degrees of freedom is <= 0 (Triggered internally at /app/aten/src/ATen/native/ReduceOps.cpp:1671.)
(tensor(nan), tensor(nan))
```

multi-dim reductions

```
>>> import torch; torch.std_mean(torch.empty(10, 30, 20, 50), correction=600, dim=(1, 2))
<stdin>:1: UserWarning: std_mean(): degrees of freedom is <= 0 (Triggered internally at /app/aten/src/ATen/native/ReduceOps.cpp:1671.)
[... snip ...]
```

and a negative `correction`.

```
>>> import torch; torch.std_mean(torch.randn(0), correction=-5)
(tensor(nan), tensor(nan))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109824
Approved by: https://github.com/soulitzer
2023-10-06 01:03:47 +00:00
81ce5d5725 Revert "pin_memory support for NT (#110404)"
This reverts commit 3597325bc7f07d97ded1c94c47bb59c98e080a0f.

Reverted https://github.com/pytorch/pytorch/pull/110404 on behalf of https://github.com/kit1980 due to Previous PR in the stack caused CUDA memory leaks ([comment](https://github.com/pytorch/pytorch/pull/110404#issuecomment-1749850211))
2023-10-06 01:03:17 +00:00
cyy
11b3210a11 [Reland2] Remove calls of c10::either (#110487)
This PR is reland of #109707 with fixes of MSVC failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110487
Approved by: https://github.com/soulitzer
2023-10-06 00:25:15 +00:00
330db8278b Revert "Update singleton int to error when inequality relation is undefined (#110044)"
This reverts commit 07331c65e6b47f41475fc0d81ba03917f39b55dd.

Reverted https://github.com/pytorch/pytorch/pull/110044 on behalf of https://github.com/PaliC due to bottom diff is causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/110044#issuecomment-1749805209))
2023-10-05 23:55:37 +00:00
1c3fae46ee Revert "Support SingletonSymNode mul with coefficient (#110369)"
This reverts commit eb8feb8ff8610d53d92773c2d7dce05c2196d672.

Reverted https://github.com/pytorch/pytorch/pull/110369 on behalf of https://github.com/PaliC due to bottom diff is causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/110369#issuecomment-1749802899))
2023-10-05 23:51:28 +00:00
236afe73a2 Revert "Update custom Function preserve torch function when inputs returned as-is (#109825)"
This reverts commit 4e73eee93f411596fcabb32cc8e7686890d1c7fb.

Reverted https://github.com/pytorch/pytorch/pull/109825 on behalf of https://github.com/PaliC due to causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/109825#issuecomment-1749802739))
2023-10-05 23:49:41 +00:00
fdf6055ea7 Revert "Add symbolic singleton int (#110370)"
This reverts commit a7145cb3a42e925209c7f34c0b8b169dc72ff4c6.

Reverted https://github.com/pytorch/pytorch/pull/110370 on behalf of https://github.com/PaliC due to bottom diff is causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/110370#issuecomment-1749801188))
2023-10-05 23:47:09 +00:00
585e2bd818 Revert "Symintify guards.cpp (#110371)"
This reverts commit e1cfcdfa06d476fb7c6dc9be1b677b23569d4ed6.

Reverted https://github.com/pytorch/pytorch/pull/110371 on behalf of https://github.com/PaliC due to bottom diff is causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/110371#issuecomment-1749798063))
2023-10-05 23:42:35 +00:00
bcd44dac60 Revert "Use is_symbolic instead of testing isinstance in some place (#110372)"
This reverts commit 8672d64fed2d76062f14a74075d560fe6fc38b1a.

Reverted https://github.com/pytorch/pytorch/pull/110372 on behalf of https://github.com/PaliC due to bottom diff is causing a plethora of internal failures ([comment](https://github.com/pytorch/pytorch/pull/110372#issuecomment-1749795074))
2023-10-05 23:37:37 +00:00
5d963474aa Replace enforce_dtype with dtype in ShardedTensor.gather (#110561)
Summary:
Sometimes local_shards are empty on some ranks, and out.dtype is float16, which will cause error if enforce_dtype is True because `data` will be float32.

Callers know best what dtype they want, so we can just let callers decide.

Temporarily keep enforce_dtype for backward compatibility

Test Plan: Run local and MAST job

Reviewed By: uciyc123

Differential Revision: D46886551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110561
Approved by: https://github.com/wanchaol, https://github.com/malfet
2023-10-05 23:16:23 +00:00
f274c7b32c Add functional collective all_to_all_single and support it in Inductor (#110195)
Copy of https://github.com/pytorch/pytorch/pull/106655 from yf225
rebased on top of item() support changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110195
Approved by: https://github.com/Skylion007
2023-10-05 23:11:51 +00:00
df7d01aed5 perf(inductor): use for loop with shortcut in Optimizers to speedup against list comprehensions (e.g. complex conversion) (#110613)
Fully fixes: https://github.com/pytorch/pytorch/issues/110506

Depends: https://github.com/pytorch/pytorch/pull/110607
Potential merge conflicts:
- https://github.com/pytorch/pytorch/pull/110339
- https://github.com/pytorch/pytorch/pull/110345
- https://github.com/pytorch/pytorch/pull/110454

Related:
- https://github.com/pytorch/pytorch/issues/110606 (we can apply the improvements here orthogonally to the complex support)

### Results

Benchmark: 100 params.

Breakdowns (float32, dynamo):
```
Adagrad: this PR: 4.4s, main: 8.8s
Adam: this PR: 2.1s, main: 9.8s
AdamW: this PR: 2.5s, main: 8.2s
ASGD: this PR: 3.1s, main: 8.5s
RMSProp: this PR: 1.3s, main: 4.2s
RProp: this PR: 6.7s, main: 14.9s
```

Notes:
1. Adagrad is still slow due to `_get_value` list comprehension. Can be fixed in https://github.com/pytorch/pytorch/pull/110339/files by utilizing capturable path
2. Adamax is not actually compiled (it is currently disabled).
3. Inductor compile time is quite variable. We calculate dynamo by subtracting `call_user_compiler` from `compile_inner` timing.

<details>

This PR:
```
Adagrad (torch.float32): 28.47496461868286s
Adagrad (torch.complex64): 29.379547357559204s
Adam (torch.float32): 17.334211587905884s
Adam (torch.complex64): 29.637500524520874s
Adamax (torch.float32): 2.4749321937561035s
Adamax (torch.complex64): 3.1997995376586914s
AdamW (torch.float32): 18.06532859802246s
AdamW (torch.complex64): 28.25661015510559s
ASGD (torch.float32): 23.70255398750305s
ASGD (torch.complex64): 25.33756995201111s
RMSprop (torch.float32): 7.964028596878052s
RMSprop (torch.complex64): 12.909599781036377s
Rprop (torch.float32): 30.512362003326416s
Rprop (torch.complex64): 44.74405765533447s
```

Main
```
Adagrad (torch.float32): 26.919506072998047s
Adagrad (torch.complex64): 35.190622091293335s
Adam (torch.float32): 25.715000867843628s
Adam (torch.complex64): 24.17716670036316s
Adamax (torch.float32): 2.4404726028442383s
Adamax (torch.complex64): 3.3538928031921387s
AdamW (torch.float32): 25.2022807598114s
AdamW (torch.complex64): 28.915700912475586s
ASGD (torch.float32): 24.108731985092163s
ASGD (torch.complex64): 26.589075088500977s
RMSprop (torch.float32): 10.781344175338745s
RMSprop (torch.complex64): 15.136352777481079s
Rprop (torch.float32): 42.46482181549072s
Rprop (torch.complex64): 48.28277635574341s
```

Seems that it doesn't help the complex case by much (but that's not the majority case). torch.float32 is generally positive, when it does not show drastic improvement / regresses, it is due to inductor variance (by manually inspecting the logs).

</details>

### Benchmark Script
```python
import torch
import time
from torch.optim import Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop

OPTIMS = [Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop]
DTYPES = [torch.float, torch.cfloat]

NUM_PARAMS = 100
kwargs = { "lr": 0.01, "foreach": True }
summary = []

for optim_cls in OPTIMS:
    for dtype in DTYPES:
        torch._dynamo.reset()
        # torch._inductor.metrics.reset()
        input = torch.ones([10, 10], dtype=dtype, device="cuda:0")
        model = torch.nn.Sequential(
            *[torch.nn.Linear(10, 10, dtype=dtype, device="cuda:0") for _ in range(NUM_PARAMS)]
        )

        model(input).sum().abs().backward()
        opt_compiled = optim_cls(model.parameters(), **kwargs)
        compiled_step = torch.compile(opt_compiled.step)

        with torch.set_grad_enabled(False):
            start_time = time.time()
            compiled_step()
            summary.append(f"{optim_cls.__name__} ({dtype}): {time.time() - start_time}s")

        print(optim_cls, kwargs, dtype, torch._dynamo.utils.compile_times())

for s in summary:
    print(s)
```

CC: @janeyx99 @mlazos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110613
Approved by: https://github.com/janeyx99
2023-10-05 23:10:52 +00:00
7b6042111f [quant][pt2e] Refactor conv related annotation for XNNPACKQuantizer (#110308)
Summary:
Since we changed IR that we are working with to pre autograd aten IR, it's easier
to use plain pattern match instead of relying on source_matcher_utils now, this
PR refactors the annotation for conv to use aten ops directly.

Also fixed reentrant test after this change.

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110308
Approved by: https://github.com/kimishpatel
2023-10-05 22:36:18 +00:00
be02103786 [BE] Get rid of code duplication (#110619)
Replace `dispatch_to_CDouble`, `dispatch_to_CLong` and `dispatch_to_CComplexDouble` with `dispatch_to<T>` template

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at c3d9d01</samp>

> _Sing, O Muse, of the clever coder who devised_
> _A wondrous template function, `dispatch_to<T>`, that could_
> _Handle with ease the various scalar types that vexed_
> _The previous code, which was verbose and dull as wood._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110619
Approved by: https://github.com/soulitzer, https://github.com/albanD
ghstack dependencies: #110618
2023-10-05 22:05:57 +00:00
82e353fffc [BE] Use nested namespaces in autograd/templates (#110618)
As PyTorch can now use C++17 language features
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110618
Approved by: https://github.com/soulitzer
2023-10-05 22:05:57 +00:00
cae537126f Set _diffThreshold on our TestCase (#110603)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110603
Approved by: https://github.com/albanD
2023-10-05 21:49:28 +00:00
668eb55488 [BE]: Enable some basic pytest style rules (#110362)
Adds some basic flake8-pytest-style rules from ruff with their autofixes. I just picked a couple uncontroversial changes about having a consistent pytest style that were already following. We should consider enabling some more in the future, but this is a good start. I also upgraded ruff to the latest version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110362
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/kit1980
2023-10-05 21:40:43 +00:00
c95cf4b4c9 [dtensor] add grad placements kwarg to to_local API (#110629)
When we convert to local tensor, dtensor can't track autograd or
gradient layout of the local tensor anymore, if user do sth not expected, there
needs to be a way for user to hint about the gradient layout of the
local tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110629
Approved by: https://github.com/zdevito
2023-10-05 21:34:01 +00:00
ada65508d2 Add option to flop counter formula registration to get raw values (#110591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110591
Approved by: https://github.com/awgu
ghstack dependencies: #110501, #110504
2023-10-05 21:14:41 +00:00
9e72c9cccd [torch] easy missing move in aoti_runtime/model.h (#110469)
Just an extra shared_ptr copy, nothing fancy.

Differential Revision: [D49792510](https://our.internmc.facebook.com/intern/diff/D49792510/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110469
Approved by: https://github.com/Skylion007
2023-10-05 20:56:06 +00:00
71beca4899 [dynamo, logging] Report name of defining class along side function name in Dynamo logs (#110190)
Implement https://github.com/pytorch/pytorch/issues/109236

Sample code:
```python
import torch

class AAA:
    class DUMMY:
        class DUMMY2:
            pass
    def dummy(self):
        def dummy2():
            pass
    class BBB:
        @staticmethod
        def CCC():
            class DDD:
                if True:
                    @staticmethod
                    def EEE():
                        x = [torch.ones(3, 3) for _ in range(5)]
                        return x
            return DDD

def fn():
    return AAA.BBB.CCC().EEE()

opt_fn = torch.compile(fn, backend="eager")

opt_fn()
```

Logs:
```bash
$TORCH_LOGS="trace_source" python playground2.py
[2023-09-27 17:38:35,641] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:21 in fn (fn)
[2023-09-27 17:38:35,641] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]     def fn():
[2023-09-27 17:38:35,642] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:22 in fn (fn)
[2023-09-27 17:38:35,642] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]         return AAA.BBB.CCC().EEE()
[2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:11 in CCC (AAA.BBB) (inline depth: 1)
[2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]             @staticmethod
[2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:13 in CCC (AAA.BBB.CCC.DDD) (inline depth: 1)
[2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]                 class DDD:
[2023-09-27 17:38:35,723] [1/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:17 in <listcomp> (AAA.BBB.CCC.DDD.EEE)
[2023-09-27 17:38:35,723] [1/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]                             x = [torch.ones(3, 3) for _ in range(5)]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110190
Approved by: https://github.com/ezyang, https://github.com/mlazos
2023-10-05 20:41:38 +00:00
c99de9f37c fix(optim): adagrad sparse multitensor incorrect early exit (#110454)
Fixes https://github.com/pytorch/pytorch/issues/110444#issuecomment-1745181530

This PR:
Passes

Main:
```
test/optim/test_optim.py::TestOptim::test_adagrad_sparse FAILED [0.0058s]

==================================================================================================================================== FAILURES =====================================================================================================================================
__________________________________________________________________________________________________________________________ TestOptim.test_adagrad_sparse __________________________________________________________________________________________________________________________
Traceback (most recent call last):
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/test/optim/test_optim.py", line 1448, in test_adagrad_sparse
    self._test_rosenbrock_sparse(
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/test/optim/test_optim.py", line 128, in _test_rosenbrock_sparse
    self.assertEqual(params, params_c, atol=1e-6, rtol=1e-6)
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/torch/testing/_internal/common_utils.py", line 3309, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 2 (50.0%)
Greatest absolute difference: 0.09999999999993325 at index (1,) (up to 1e-06 allowed)
Greatest relative difference: 0.06249999999996089 at index (1,) (up to 1e-06 allowed)

```

CC: @janeyx99
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110454
Approved by: https://github.com/janeyx99
2023-10-05 20:37:57 +00:00
ecdd1bcf03 Back out "[Inductor] Break the loop fusion when node2 depends on node1 mutations (#109172)" (#110622)
Summary:
Original commit changeset: 03980fb054d5

Original Phabricator Diff: D49519512

Bisecting shows that this diff is the cause of S369683. Since this affects Ads production, need to back out this diff immediately.

Test Plan: See S369683

Reviewed By: ezyang

Differential Revision: D49958638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110622
Approved by: https://github.com/yanboliang
2023-10-05 20:09:09 +00:00
88616349d7 [state_dict][1/N] Implement the basic functions of distributed.checkpoint._state_dict (#105902)
This PR implements the basic functions of distributed.checkpoint._state_dict. This PR currently contains the flattening of optimizer state_dict which makes the PR too large. A later version may split it into 2 for a better code review.

Differential Revision: [D47647719](https://our.internmc.facebook.com/intern/diff/D47647719/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D47647719/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105902
Approved by: https://github.com/wz337
2023-10-05 20:04:15 +00:00
298f01d9a2 [aotinductor] Avoid generating redundant kernel loading code (#110510)
Summary: 1) Stop forcing triton.unique_kernel_names to True for AOTInductor, because the unique kernel name can be read from metadata; 2) Only generate load_kernel once for each kernel since we don't have control flow in our generated code.  This solves https://github.com/pytorch/pytorch/issues/105553.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110510
Approved by: https://github.com/chenyang78, https://github.com/jansel
2023-10-05 19:59:38 +00:00
f1b94461aa [AOTInductor] ProxyExecutor support Dynamic Shape (#110526)
Summary:
Extend ProxyExecutor to support dynamic shape.

Example of ProxyExecutor invocation with symints.
```
    int64_t* arg0_1_size;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg0_1, &arg0_1_size));
    auto s0 = arg0_1_size[0];
    auto s1 = arg0_1_size[1];
    int64_t* arg1_1_size;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg1_1, &arg1_1_size));
    auto s2 = arg1_1_size[0];
    auto s3 = arg1_1_size[1];
    ...
    aoti_torch_proxy_executor_call_function(proxy_executor, 0, 15, std::vector<int64_t>{42, 16, 17, s0 + s1, s0 + s1, s2*s3, 45, 67, 16, 17, s2*s3, s2*s3, s0 + s1, 89, 910}.data(), 7, std::vector<AtenTensorHandle>{arg0_1, arg0_1, arg1_1, buf2, arg0_1, arg1_1, buf4}.data());
```

Example of serialized SymInt(s) arguments:
```
          {
            "name": "symint",
            "arg": {
              "asSymInt": {
                "asName": "s0 + s1"
              }
            }
          },
          {
            "name": "symints",
            "arg": {
              "asSymInts": [
                {
                  "asName": "s0 + s1"
                },
                {
                  "asName": "s2*s3"
                }
              ]
            }
          },
          ...
          {
            "name": "o_symint",
            "arg": {
              "asSymInt": {
                "asName": "s2*s3"
              }
            }
          },
          {
            "name": "o_symints",
            "arg": {
              "asSymInts": [
                {
                  "asName": "s2*s3"
                },
                {
                  "asName": "s0 + s1"
                }
              ]
            }
          },
```

Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops

Differential Revision: D49887555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110526
Approved by: https://github.com/chenyang78
2023-10-05 19:05:20 +00:00
a0cea517e7 Add 9.0a to cpp_extension supported compute archs (#110587)
There's an extended compute capability 9.0a for Hopper that was introduced in Cuda 12.0: https://docs.nvidia.com/cuda/archive/12.0.0/cuda-compiler-driver-nvcc/index.html#gpu-feature-list

E.g. Cutlass leverages it: 5f13dcad78/python/cutlass/emit/pytorch.py (L684)

This adds it to the list of permitted architectures to use in `cpp_extension` directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110587
Approved by: https://github.com/ezyang
2023-10-05 17:41:06 +00:00
c89d35adfe Bump pillow from 9.5.0 to 10.0.1 in /.ci/docker (#110494)
Bumps [pillow](https://github.com/python-pillow/Pillow) from 9.5.0 to 10.0.1.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/9.5.0...10.0.1)

---
updated-dependencies:
- dependency-name: pillow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-05 10:37:26 -07:00
efdf155383 Add requirement for input to AllGatherIntoTensor to be contiguous (#109561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109561
Approved by: https://github.com/Chillee
2023-10-05 17:04:48 +00:00
f21c322e20 Fix typo in BatchLinearAlgebraLibBlas.cpp (#110608)
accomodate -> accommodate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110608
Approved by: https://github.com/malfet
2023-10-05 16:48:53 +00:00
d6e5898e8d Quieter logs in CI (#110033)
To reduce the amount of logs
* for successes, only print the part that says what tests ran and don't print the rest.  Zip the log into an artifact.  The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line.  The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [  9%]`
* for failures/reruns, print logs.  Do not zip.

Also
* change log artifact name

Examples of various logs:
a074db0f7f failures
1b439e24c4 failures

possibly controversial haha
should i include an option for always printing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033
Approved by: https://github.com/huydhn
2023-10-05 16:40:37 +00:00
3597325bc7 pin_memory support for NT (#110404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110404
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
ghstack dependencies: #110292
2023-10-05 16:33:22 +00:00
cc1de49340 [HigherOrderOp] fallthrough some keys by default. (#110478)
Fixes #109253

Test Plan:
Added a new test that shows default fallthrough keys can be overrided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110478
Approved by: https://github.com/ezyang
2023-10-05 16:25:42 +00:00
26f634eefb Enable aarch64 for fixing undefined symbol error. (#110542)
Summary: ARM can be safely supported

Reviewed By: andrewjcg, aaronenyeshi

Differential Revision: D49921679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110542
Approved by: https://github.com/aaronenyeshi
2023-10-05 16:16:06 +00:00
a94b6f39d1 [ROCm] conditionally enable hipsparse const descriptors for version >= 2.4.0 (#110317)
This is in preparation for upcoming backwards-incompatible hipsparse changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110317
Approved by: https://github.com/malfet
2023-10-05 16:07:51 +00:00
f767a6c57a Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110504
Approved by: https://github.com/mlazos, https://github.com/eellison
ghstack dependencies: #110501
2023-10-05 15:47:30 +00:00
1e4c0641ce Revert "Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504)"
This reverts commit 9648df1a6af8509ba2f5455a8465e0c67d0dd0c2.

Reverted https://github.com/pytorch/pytorch/pull/110504 on behalf of https://github.com/PaliC due to temporarily will revert as it's causing problems with difftrain import ([comment](https://github.com/pytorch/pytorch/pull/110504#issuecomment-1749132253))
2023-10-05 15:28:23 +00:00
1a729618ef [FSDP][optim_state_dict] Make the new optimizer allgather fusion work with fine-tuning models (#110540)
With use_orig_params=True, it is possible that some parameters with the same FlatParameter are in the optimizer while others parameters are frozen. This PR makes the allgather fusion logic support the case.

Differential Revision: [D49922028](https://our.internmc.facebook.com/intern/diff/D49922028/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110540
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2023-10-05 15:17:10 +00:00
f17fe89e14 Multiprocessing support for NT (#110292)
Fixes #110161

Allows NTs to be used in DataLoaders with `num_workers > 1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110292
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-10-05 15:04:48 +00:00
7c72238e4b Back out "Enable pickling model prepared with QAT qconfig" (#110392)
Summary:
D49187352 caused our model conversion and loading of QAT checkpoint to be stuck with thrift time out.

we are actively checking in final code and model for static quant HTP prod model, and encountered this breakage at head Thursday.

Thrift timeout is a not failing, and because of that, it's hard to bisect and find this culprit. It is also hard to set up unit test, because the job simply time-out. Better test is needed to guard downstream model conversion against upstream changes.

Our suspicion of why this diff broke us is that we create a lot of modules with qat (in a recursive manner) but our model is not a qat traceable module (it is a graph with many qat modules and floating point modules). With fuctools.partial as in the original diff, we will be caching modules in the memory and causing the memory of the machine to be taken up completely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110392
Approved by: https://github.com/junesg, https://github.com/jerryzh168
2023-10-05 14:41:00 +00:00
cf1b494afd [AOTInductor] Store loaded kernels in the model (#110554)
Defining kernels as static vars is problematic for subsequent model loading on non-default CUDA devices.

Assuming those kernels were loaded in context of the device #0, so, they are not nullptr anymore, therefore kernels won't work on devices other than the device #0.

This change makes devices remembered at model level in AOT mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110554
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-10-05 10:17:05 +00:00
c36b31d530 torch::nn::AdaptiveLogSoftmaxWithLoss: check length of cutoffs (#106777)
Fixes #106698

Also added a check for python API, because current error message
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sehoon/pytorch-latest/torch/nn/modules/adaptive.py", line 128, in __init__
    or (min(cutoffs) <= 0) \
ValueError: min() arg is an empty sequence
```
is not very comprehensible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106777
Approved by: https://github.com/albanD
2023-10-05 05:35:47 +00:00
00b9afa429 [vision hash update] update the pinned vision hash (#110571)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110571
Approved by: https://github.com/pytorchbot
2023-10-05 05:14:04 +00:00
416eca9736 export db links for user errors (#110555)
Ideally all `_dynamo.exc.UserError`s should have "case names", i.e., link to examples in `exportdb`.

This PR adds case names to several instances of `_dynamo.exc.UserError`. In particular, looking at coverage based on `UserErrorType`:
* `DYNAMIC_CONTROL_FLOW`, `ANTI_PATTERN`, and `STANDARD_LIBRARY` are fully covered.
* `CONSTRAINT_VIOLATION` and `DYNAMIC_DIM` have no coverage. We don't seem to have any dedicated examples of specifying dynamic shapes in `exportdb` (although they are used in some other examples without explanation, to avoid some specialization that would make such examples moot).
* `INVALID_INPUT` is only partly covered. Frankly this is tedious to cover via examples.

Differential Revision: [D49928518](https://our.internmc.facebook.com/intern/diff/D49928518/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110555
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2023-10-05 05:03:04 +00:00
21019620ee Revert "[Dynamo] SizeVariable can be indexed by symint (#110349)"
This reverts commit 510ec7e3c539dfed49df587d09e8a0a87e187201.

Reverted https://github.com/pytorch/pytorch/pull/110349 on behalf of https://github.com/PaliC due to breaking internal tests (check diff) ([comment](https://github.com/pytorch/pytorch/pull/110349#issuecomment-1748021641))
2023-10-05 04:42:33 +00:00
62cad5b5b0 [quant][pt2] Support cudnn_batch_norm in QAT fusion (#109908)
Summary: Today, we get different batch norm ops depending on
the device the model is placed on at export time. Exporting
`model.cpu()` gives `_native_batch_norm_legit`, while exporting
`model.cuda()` gives `cudnn_batch_norm`. QAT fusion currently
only supports the former and silently ignores the latter. This
commit fixes this by additionally matching on the latter op
during QAT fusion.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_fusion
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_relu_fusion

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar

Differential Revision: [D49615145](https://our.internmc.facebook.com/intern/diff/D49615145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109908
Approved by: https://github.com/jerryzh168
2023-10-05 04:08:44 +00:00
4b1e138162 [dynamo] [easy]Remove InstructionTranslator from within Set (#110521)
I believe this was a left over from the before times. See if CI agrees.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110521
Approved by: https://github.com/ezyang
2023-10-05 04:01:18 +00:00
a93337ed55 [export] Add ir spec (#110394)
Summary: Copied IR spec over from Executorch

Test Plan: _docs_

Differential Revision: D49829187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110394
Approved by: https://github.com/ydwu4, https://github.com/gmagogsfm
2023-10-05 03:06:30 +00:00
a8653f35de One more small Perf Tweak to fill_ (#110294)
# Summary
Perf win by check which device tensors are on

## Before this PR:
``` Shell
CPU | CPU: 1.3328152848407626
GPU | GPU: 6.614773320034146
CPU | GPU: 29.027153505012393
GPU | CPU: 17.22372299991548
```
## After this PR
``` Shell
CPU | CPU: 1.4241038949694484
GPU | GPU: 7.060713530518115
CPU | GPU: 15.149936103262007
GPU | CPU: 5.774620908778161
```

#### Repro Script
``` Python
    a = torch.tensor([0.2, 0.5], device="cpu")
    amax = torch.tensor(0.5, device="cpu")
    print(f"CPU | CPU: {benchmark_torch_function_in_microseconds(torch.fill_, a, amax)}")

    a = torch.tensor([0.2, 0.5], device="cuda")
    amax = torch.tensor(0.5, device="cuda")
    print(f"GPU | GPU: {benchmark_torch_function_in_microseconds(torch.fill_, a, amax)}")

    a = torch.tensor([0.2, 0.5], device="cpu")
    amax = torch.tensor(0.5, device="cuda")
    print(f"CPU | GPU: {benchmark_torch_function_in_microseconds(torch.fill_, a, amax)}")

    a = torch.tensor([0.2, 0.5], device="cuda")
    amax = torch.tensor(0.5, device="cpu")
    print(f"GPU | CPU: {benchmark_torch_function_in_microseconds(torch.fill_, a, amax)}")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110294
Approved by: https://github.com/mikaylagawarecki
2023-10-05 02:42:57 +00:00
434a996c42 Fix typo under torch/_inductor directory (#110530)
This PR fixes typo of comments and messages in files under `torch/_dynamo` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530
Approved by: https://github.com/kit1980
2023-10-05 02:17:20 +00:00
9648df1a6a Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110504
Approved by: https://github.com/mlazos, https://github.com/eellison
ghstack dependencies: #110501
2023-10-05 01:34:57 +00:00
e686341f64 Consider that ops can be fused into cat in the min-cut partitioner (#110501)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110501
Approved by: https://github.com/eellison
2023-10-05 01:34:57 +00:00
d24e7be243 Include onnx and onnxscript information in collect_env.py (#110560)
`onnx` and `onnxscript` are used in torch.onnx.dynamo_export since 2.0. It would be helpful to collect version information in user issue reports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110560
Approved by: https://github.com/albanD
2023-10-05 01:29:04 +00:00
653f966df0 Fix type promotion of float8_e5m2 and float8_e4m3fn (#110279)
There is an issue with float8 type promotion, because _promoteTypesLookup doesn't contain records for few types between bfloat16 and float8.
I have simply moved float8 types just after bfloat16, however I'm not sure if it doesn't break serialization.

Please, decide if it can stay like this, or should I insert missing records filled with "ud" into _promoteTypesLookup instead of moving types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110279
Approved by: https://github.com/albanD
2023-10-05 01:28:48 +00:00
c121f957c2 [aotinductor] Enable test_non_default_cuda_device on CI (#110509)
Summary: test_non_default_cuda_device needs to run on a multi-gpu CI instance

Differential Revision: [D49937115](https://our.internmc.facebook.com/intern/diff/D49937115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110509
Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78
2023-10-05 01:25:50 +00:00
9f40ffeec6 [optim] disable large_tensor tests for ROCm (#110559)
Closes #105825 #105820 #105754 by replacing with an incode skip.

Fixes #105825, fixes #105820, fixes #105754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110559
Approved by: https://github.com/albanD
2023-10-05 01:21:21 +00:00
6a974bec5d Change flash attention outputs to be SymInt instead of int (#110533)
Fixes https://github.com/pytorch/pytorch/issues/110322

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110533
Approved by: https://github.com/albanD
2023-10-05 01:00:07 +00:00
f1d81134ef Print output type if assert fires (#110534)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110534
Approved by: https://github.com/albanD
2023-10-05 00:59:17 +00:00
f3aba45049 [ONNX] Create onnxscript-torchlib specific xfails/skips for fx tests (#110536)
Creates xfail_onnxscript/skip_onnxscript so that it is clear torchlib needs to support it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110536
Approved by: https://github.com/BowenBao
2023-10-05 00:39:05 +00:00
95c59b30b8 Update fully_sharded_data_parallel to fix typing (#110545)
Fixes typing so that linter does not complain when using CustomPolicy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110545
Approved by: https://github.com/awgu, https://github.com/Skylion007
2023-10-05 00:00:10 +00:00
0daa7d4815 [test][docs] Fix doctest warnings for syntax errors (#110517)
Fixes some syntax errors in doctest find in CI tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110517
Approved by: https://github.com/albanD
2023-10-05 00:00:06 +00:00
053367b1ed fix: flake8-bugbear code B024 (#107265)
See #106571 item B024

This fix concerns the addition of `abstractmethod` to methods declared inside abstract classes.

Should I also include PEP8 compliant reformatting on the files I had to modify ?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107265
Approved by: https://github.com/kit1980
2023-10-04 23:52:52 +00:00
449271f3f1 [pytree] Extract reusable generic tests for pytree (#110395)
Part of #109684

- #109684

Changes:

- Add new functions `tree_structure`, `tree_leaves`, `tree_map_` and `tree_map_only_` to Python pytree.
- Extract reusable tests for pytree to `TestGenericPytree`.
- Change `treespec_dumps` and `treespec_loads` in C++ pytree to call Python pytree and use JSON string as serialization type.
- Rename `torch.utils.pytree` -> `torch.utils._cxx_pytree`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110395
Approved by: https://github.com/zou3519
2023-10-04 23:40:50 +00:00
37afa0c349 fix(inductor): Increase coverage of Inductor ATen lowering (#110473)
Add sqrt to decomp testing path and fix missing `minimum`, `clamp_min`,`clamp_max` lowerings and/or registrations.

Follow up to: https://github.com/pytorch/pytorch/pull/110468#issuecomment-1745718602 (requires upstream to merge to avoid merge conflict)

CC: @janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110473
Approved by: https://github.com/janeyx99
2023-10-04 23:40:46 +00:00
2e31fae5c5 Cleanup the code in the dynamo userbenchmark (#110519)
Summary:
Skip importing the modules that are only available in the pytorch source code, not pytorch nightly release.

Make dynamo benchmark work on both OSS and internal.

X-link: https://github.com/pytorch/benchmark/pull/1960

Test Plan:
```
$ python run_benchmark.py dynamo --only alexnet --training --performance --inductor
loading model: 0it [00:05, ?it/s]
cuda train alexnet
running benchmark: 100%|█████████████████| 30/30 [00:00<00:00, 41.46it/s]
1.129x
```

```
$ buck2 run mode/opt //pytorch/benchmark:run_benchmark -- dynamo --only alexnet --training --inductor --performance --output-directory $HOME
loading model: 0it [00:16, ?it/s]
running benchmark: 100%|█████████████████| 30/30 [00:00<00:00, 37.94it/s]
cuda train alexnet
1.120x
```

Differential Revision: D49912006

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110519
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-10-04 23:26:30 +00:00
0949d97c16 fix batch_isend_irecv example incorrect usage (#110408)
mismatched dtypes silently leads to wrong outputs in nccl

```
1:recv_tensor=tensor([0., 0.], device='cuda:1')
0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408
Approved by: https://github.com/awgu, https://github.com/Neilblaze
2023-10-04 22:57:03 +00:00
8672d64fed Use is_symbolic instead of testing isinstance in some place (#110372)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110372
Approved by: https://github.com/ezyang
ghstack dependencies: #110044, #110369, #110370, #110371
2023-10-04 22:56:42 +00:00
e1cfcdfa06 Symintify guards.cpp (#110371)
Separating this out so we can check perf more easily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110371
Approved by: https://github.com/ezyang
ghstack dependencies: #110044, #110369, #110370
2023-10-04 22:56:42 +00:00
a7145cb3a4 Add symbolic singleton int (#110370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110370
Approved by: https://github.com/ezyang
ghstack dependencies: #110044, #110369
2023-10-04 22:56:26 +00:00
eb8feb8ff8 Support SingletonSymNode mul with coefficient (#110369)
We want to be able to use SingletonSymNode to represent strides for Jagged layout tensor. The following is for 3D, but easily generalizable to higher dimensions.

Constraints:
- [B, x, D] (where x represents the "variably lengthed dim") can be strided in two ways [x, 1, sum(x)] and [dx, d, 1]. We need two different placeholder values depending on how the jagged tensor is strided.
- When doing operations we need the strides of output tensors to be expressable in terms of the strides and sizes of the inner tensors. Given [B, x, D] @ [D, D'], the output strides is [x * D', D', 1] rather than some opaque [x2, D', 1]. This constraint exists because if I'm tracing, I need a symint to represent the output stride. This symint needs to come from somewhere; I get it in several ways: (1) create a constant, (2) unbacked symint, (3) create a new input using a source, (4) output of an operation on an existing symint. It is clear that (4) is what we want here, which brings us to the design below.

Design:

Given the two constraints, the most straightforward way to implement this is actually to update SingletonSymNode to include some scalar factor, i.e. Morally, SingletonSymNode represents `factor * [s_0, s_1, …, s_n]` This enables us to symbolically compute strides from sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110369
Approved by: https://github.com/ezyang
ghstack dependencies: #110044
2023-10-04 22:56:15 +00:00
07331c65e6 Update singleton int to error when inequality relation is undefined (#110044)
Previously, something like j0 >= 3, would return False. In sympy however, it is not possible to make it so that both j0 >= 3 and j0 < 3 return False. In sympy, you only get to dispatch on Ge, and the remaining are derived, e.g. defining Ge(j0 >= 3) to be False would force Lt(j0, 3) to be True, which is not what we want.

In this PR, we make it so that both j0 >=3 and j0 < 3 error, so that in a future PR when we create the symbolic counterpart of this singleton, the behaviors can be the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110044
Approved by: https://github.com/ezyang
2023-10-04 22:55:53 +00:00
4e73eee93f Update custom Function preserve torch function when inputs returned as-is (#109825)
Fixes https://github.com/pytorch/pytorch/issues/109805
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109825
Approved by: https://github.com/albanD
2023-10-04 22:45:11 +00:00
21d77bcf80 added path to correct directory containing headers (#110063)
After make install the headers are placed in include/openblas/ folder instead of include/ folder. Updated FindOpenBLAS.cmake to make that change clear.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110063
Approved by: https://github.com/Blackhex, https://github.com/kit1980
2023-10-04 21:56:36 +00:00
6fc09aee36 constant output errors (#110472)
When mapping between the original signature of a program and the graph-captured signature of its exported program, we emit errors when we see unexpected original or graph-captured inputs or outputs.

These errors can arise because of various reasons, e.g.:
1. some input or output has been lifted because of mutation
2. some type is not pytree-registered for flattening / unflattening
3. some type cannot be realized with graph operations

(This is probably not an exhaustive list.)

Previously we used to emit errors based on a vanilla id-based membership check between the two sides, mostly anticipating (1) as the reason for errors. But this does not do justice to errors because of (2) or (3).

This PR emits a different error when it finds (3) to be a probable cause. Specifically, it considers only Tensor and Sym* types to be "supported": no other type seems to be realizable by graph operations.

When (2) is a probable cause, we sometimes also hit the same error because we would expect the supported types to show through upon registration. But this kind of error may need some more work in the future.

Differential Revision: [D49885828](https://our.internmc.facebook.com/intern/diff/D49885828/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110472
Approved by: https://github.com/ydwu4
2023-10-04 21:56:20 +00:00
a9df9e5187 [inductor] get_system shouldn't error if CUDA is not installed (#110282)
Using inductor on a CPU-only device should be OK.

Differential Revision: [D49749912](https://our.internmc.facebook.com/intern/diff/D49749912/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110282
Approved by: https://github.com/desertfire
2023-10-04 21:28:55 +00:00
6db3853eeb Add doc for torch.cond (#108691)
We add a doc for torch.cond. This PR is a replacement of https://github.com/pytorch/pytorch/pull/107977.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108691
Approved by: https://github.com/zou3519
2023-10-04 21:24:14 +00:00
901aa85b58 fix TEST_ROCM definition to disable test_jit_cudnn_extension on rocm (#110385)
Define TEST_ROCM before modification TEST_CUDA. Otherwise TEST_ROCM will always be false and will not disable test_jit_cudnn_extension for rocm
Fixes https://github.com/pytorch/pytorch/issues/107182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110385
Approved by: https://github.com/jithunnair-amd, https://github.com/kit1980
2023-10-04 20:02:02 +00:00
46a5558cd5 [AOTInductor] Simplified AOTInductor interface and model class (#110411)
Summary:
This PR removed several APIs from the AOTInductor interface,
which are not used by the client.

It also simplified AOTInductor's model class by removing
the dim info for input/output tensors. We included dim info
before to return max output shapes, which was used by the client
to allocate memory for output tensors. Now, we allocate output
tensor memory from the .so so that we don't need to maintain
such information any more. The deletion of dim info from
the model class also simplified the codegen quite a bit.

Test Plan: ci

Reviewed By: khabinov

Differential Revision: D49835430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110411
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/jansel
2023-10-04 18:35:24 +00:00
baa9af155e Add more tests for native triton kernels (#110486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110486
Approved by: https://github.com/jansel
ghstack dependencies: #110403
2023-10-04 18:26:45 +00:00
f04b1a0d27 [AOTInductor] Implement autograd eager backend for native triton kernels (#110403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110403
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2023-10-04 17:56:56 +00:00
c0c2e052a4 [aotinductor] Clean up fallback kernel cpp name generation (#110267)
Summary: Unify the way to generate cpp kernel name when the kernel is from OpOverload

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110267
Approved by: https://github.com/zou3519
ghstack dependencies: #110233
2023-10-04 17:18:02 +00:00
539367f0bc [aotindutor] Refactor optional value codegen (#110233)
Summary: Simplify the codegen for optional values by using c10::nullopt, and we don't need placeholders like OptionalScalar because we can simply use None for that purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110233
Approved by: https://github.com/jansel
2023-10-04 17:18:02 +00:00
247c574313 [jit] make register parameter/buffer thread safe in torch::jit::Module (#110488)
Summary: Registering param/buffer will write into a vector inside Object, need to maintain thread safety if we have threads reading from the vector and writing to the vector at the same time.

Test Plan: CI

Differential Revision: D49882601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110488
Approved by: https://github.com/davidberard98
2023-10-04 17:04:23 +00:00
2c1b009e39 Fix typo under torch/_dynamo directory (#110459)
This PR fixes typo of comments in files under `torch/_dynamo` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110459
Approved by: https://github.com/colesbury
2023-10-04 16:05:05 +00:00
4c3d3b7176 [inductor] Lower small gemvs on CPU (#110456)
If the gemv fits in registers, like [1,16]*[16,16], MKL isn't going to
do much better than compiling a simple for-loop, and we end up paying
allocation overhead and ATen overhead.

A very small internal inference model drops from 7->5 us with this change.

Differential Revision: [D49875991](https://our.internmc.facebook.com/intern/diff/D49875991/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110456
Approved by: https://github.com/chenyang78, https://github.com/jgong5
2023-10-04 15:16:38 +00:00
30c4c6ff9b [PyTorch CCA] Refactor caching allocator config code (#110123)
Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well.

Test Plan: sandcastle

Differential Revision: D49653265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123
Approved by: https://github.com/zdevito
2023-10-04 14:58:23 +00:00
156aefa89b Revert "[3/N] Add -Wdeprecated and related fixes (#109698)"
This reverts commit c31fcdaa4f79e83c82ec4f5ff3cf96e2cb99eecd.

Reverted https://github.com/pytorch/pytorch/pull/109698 on behalf of https://github.com/PaliC due to breaking quantization tests ( quantization/test_quantize_per_channel_sub_byte and  quantization/test_quantize_per_channel_float_qparams) internally ([comment](https://github.com/pytorch/pytorch/pull/109698#issuecomment-1746999806))
2023-10-04 14:33:47 +00:00
cyy
5220d0dfaf Increase header coverage of clang-tidy (#110443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110443
Approved by: https://github.com/Skylion007
2023-10-04 13:52:06 +00:00
0e55cc4986 [HigherOrderOp] Flatten outputs of wrap. (#109433)
Fix: #109247

This PR flattens `wrap` outputs by inlining `pytree.tree_flatten` function after calling
the inner function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109433
Approved by: https://github.com/zou3519
ghstack dependencies: #110290
2023-10-04 13:43:55 +00:00
f68f49c462 Refactor expect tests on test_higher_order_ops.py. (#110290)
This PR inlines the expecteds strings onto the `assertExpectedInline` calls, so that, when
change is needed, we may do that by using the `expectedtest` machinery: setting the
environment variable `EXPECTTEST_ACCEPT=1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110290
Approved by: https://github.com/zou3519
2023-10-04 13:43:55 +00:00
9f0601df6d Fix a typo in cholesky_inverse documentation (#110364)
Very small PR to fix a typo in [https://pytorch.org/docs/stable/generated/torch.cholesky_inverse.html](cholesky_inverse) doc.

According to the current doc, the function expects $A$, the symmetric positive-definite matrix, as input. But the examples given (and more important, the code) is using $u$ the cholesky decomposition of this matrix (like cholesky_solve).

Also, it provides a correct example of batch usage of this function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110364
Approved by: https://github.com/lezcano
2023-10-04 12:30:11 +00:00
31d635803b [Dynamo] Fx proxy for builtin all with list iterators (#109972)
Fixes https://github.com/pytorch/pytorch/issues/109057.
Fixes https://github.com/pytorch/pytorch/issues/103620.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109972
Approved by: https://github.com/ezyang
2023-10-04 07:59:26 +00:00
2bf3ca1be7 [torchdynamo] preserve deterministic_algorithms_warn_only in convert_context (#110457)
Summary: preserve deterministic_algorithms_warn_only  in dynamo context

Test Plan: modified unit tests to test warn_only

Differential Revision: D49872622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110457
Approved by: https://github.com/jansel
2023-10-04 07:12:32 +00:00
dddf581da7 [dynamo] Add graph break on requires_grad_() (#110053)
Fixes #107861.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110053
Approved by: https://github.com/eellison
2023-10-04 06:22:16 +00:00
562c68e56f [nccl] denoise warning msg (#110433)
Summary: This is too noisy for anything set with TORCH_NCCL_USE_COMM_NONBLOCKING. Just warn once.

Test Plan: GH CI

Differential Revision: D49846339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110433
Approved by: https://github.com/awgu
2023-10-04 06:21:53 +00:00
a0e321d5ad [vision hash update] update the pinned vision hash (#110489)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110489
Approved by: https://github.com/pytorchbot
2023-10-04 06:16:41 +00:00
3fd938369f add foreach_abs meta registration and inductor decomp (#110468)
Fixes https://github.com/pytorch/pytorch/issues/110458

Somehow it is on allowlist but not on testing path.

CC @janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110468
Approved by: https://github.com/janeyx99
2023-10-04 06:09:37 +00:00
08c7dcda65 [pt2e][xnnpack_quantizer] quantize "mul" (#110428)
Adding "mul" to list of partitions that are supported by the quantizer. This shows up in EDSR, where we still want to quantize the mul op

Differential Revision: [D49850151](https://our.internmc.facebook.com/intern/diff/D49850151/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110428
Approved by: https://github.com/jerryzh168
ghstack dependencies: #110427
2023-10-04 05:11:53 +00:00
66202ed29c [pt2e][xnnpack_quantizer] add util function to convert scalars to attrs (#110427)
Jerry provided a notebook solution for converting scalars to attrs so that they may be properly quantized:

https://fburl.com/anp/kzz7tfn1

Adding this pass as a util function in xnnpack_quantizer_utils.py

Differential Revision: [D49850150](https://our.internmc.facebook.com/intern/diff/D49850150/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110427
Approved by: https://github.com/jerryzh168
2023-10-04 05:11:53 +00:00
64416a1fc7 [quant][docs] Fix formatting (#110460)
Summary:
att

Test Plan:
check generated docs

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110460
Approved by: https://github.com/andrewor14
2023-10-04 04:54:10 +00:00
005e8ddcb9 cache the hash construction on Guard (#110464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110464
Approved by: https://github.com/zou3519, https://github.com/voznesenskym
2023-10-04 04:49:18 +00:00
3fe3439242 Use LLVMSymbolizer directly for unwind inside fbcode (#108800)
Using LLVMSymbolizer directly avoids having to call fork which has caused timeouts in some circumstances.

Differential Revision: [D49070589](https://our.internmc.facebook.com/intern/diff/D49070589/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108800
Approved by: https://github.com/aaronenyeshi
2023-10-04 04:04:08 +00:00
510ec7e3c5 [Dynamo] SizeVariable can be indexed by symint (#110349)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110349
Approved by: https://github.com/williamwen42
2023-10-04 03:20:18 +00:00
50054b1a62 [AOTInductor] ProxyExecutor support ReinterpretView inputs (#110451)
Summary:
See wrapper.codegen_reinterpret_view(), it return a temporary handle for tensor, which has following problem.
```
            # NB, the return handle here represents a temporary tensor, which will be automatically
            # released.
            # Here's a sample usage in the cpp wrapper code:
            # ```
            # aoti_torch_addmm_out(
            #     buf1,
            #     arg1_1,
            #     RAIIAtenTensorHandle(tmp_tensor_handle_0),
            #     buf0,
            #     1L,
            #     1L));
            # ```
            # RAIIAtenTensorHandle(tmp_tensor_handle_0) will be released after the call to addmm_out.
            # This could be problematic when it's used in a different pattern, for example:
            # ````
            # AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
            # aoti_torch_proxy_executor_call_function(..., tensor_args);
            # ````
            # RAIIAtenTensorHandle(tmp_tensor_handle_2) will be invalid when it's used in the latter
            # kernel call.
            return f"RAIIAtenTensorHandle({tmp_name})"
```

As a result, ProxyExecutor would generate following code, which cause invalid memory access.

Before:

```
    // Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
    AtenTensorHandle tmp_tensor_handle_2;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
    ...
    AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
    int64_t int_args[] = {1};
    aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args, 3, tensor_args);
    buf3.reset();
```

With fix in this diff, ProxyExecutor generates following code

After:

```
    // Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
    AtenTensorHandle tmp_tensor_handle_2;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
    ...
    aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, std::vector<int64_t>{1}.data(), 3, std::vector<AtenTensorHandle>{RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}.data());
    buf3.reset();
```

I am not exactly a big fan of such `std::vector{...}.data()` for creating a temp array, but I can't think of another fix.

Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops

Reviewed By: desertfire

Differential Revision: D49758764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110451
Approved by: https://github.com/desertfire
2023-10-04 02:20:31 +00:00
dd95eaaf1a turn back on constant folding in fbcode (#108604)
Differential Revision: [D49020794](https://our.internmc.facebook.com/intern/diff/D49020794)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108604
Approved by: https://github.com/davidberard98, https://github.com/mlazos
2023-10-04 02:13:03 +00:00
efb73fe8e4 Fix send()/recv() to adhere to timeout (#109611)
Summary: Point to point ops don't enqueue their work to the `workMetaList_` which means that the NCCL watchdog does not watch over them, hence they do not respect the collective timeouts.

Test Plan:
While trying to add a test I found we dont have tests which validate the nccl watch dog. It looks like this is because we dont have a good way to detect when nccl watchdog has thrown an error (exception is thrown in a side thread) in our testing framework / `MultiprocessTestCase`

I manually tested this change with the script in https://github.com/pytorch/pytorch/issues/109401, but need to look more closely at how to automate a test for NCCL watchdog

Differential Revision: D49418976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109611
Approved by: https://github.com/wconstab
2023-10-03 23:27:45 +00:00
a0bffe7ed7 [S366352] Print nccl version during initialization (#110305)
Summary: print nccl version during initialization

Differential Revision: D49603220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110305
Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/rohan-varma
2023-10-03 23:09:48 +00:00
cyy
c31fcdaa4f [3/N] Add -Wdeprecated and related fixes (#109698)
This PR follows #108626. Hopefully we can enable the warning in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109698
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2023-10-03 22:50:53 +00:00
836ba6430a [AOTInductor] Initial functionality for Inf and NaN checker (#109526)
Summary:
Add initial functionality for Inf and NaN checker for AOTInductor.

Test Plan:
Included in commit. Skipped for CI as SIGABRT can't be captured by pytest.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D49379751](https://our.internmc.facebook.com/intern/diff/D49379751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109526
Approved by: https://github.com/chenyang78
2023-10-03 22:39:42 +00:00
06e88d2cfc [aotinductor] Remove output_spec from AOTInductorModelCache (#110462)
Summary: No need to store output_spec as the returned exported.call_spec already contains that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110462
Approved by: https://github.com/angelayi
2023-10-03 22:29:36 +00:00
98c8550158 Fix Triplet Margin Loss Opinfo (#110302)
Triplet Margin Loss takes in a Callable `distance_function` parameter which is not supported as an argument on the fx graph. See previous error:

> File "/scratch/eellison/work/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/scratch/eellison/work/pytorch/torch/_dynamo/variables/torch.py", line 723, in call_function
*proxy_args_kwargs(args, kwargs),
File "/scratch/eellison/work/pytorch/torch/_dynamo/utils.py", line 504, in proxy_args_kwargs
f"call_function args: {typestr(*args)} {typestr(*list(kwargs.values()))}"
File "/scratch/eellison/work/pytorch/torch/_dynamo/exc.py", line 143, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: call_function args: TensorVariable() TensorVariable() TensorVariable() ConstantVariable(float) NNModuleVariable()

This is fixable by just inlining into `triplet_margin_loss` and continuing to compile it. This required support for `has_torch_function_variadic`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110302
Approved by: https://github.com/mlazos
2023-10-03 20:26:13 +00:00
a8a31bc165 [dynamo][BE] test_misc.py shouldn't change the default dtype globally (#110412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110412
Approved by: https://github.com/jansel, https://github.com/lezcano, https://github.com/Fidget-Spinner
ghstack dependencies: #110398
2023-10-03 19:25:37 +00:00
dc794ec32c [dynamo] Trace through builtin abs (#110398)
In python `abs(x)` does nothing but delegate to `x.__abs__()` so we should do
the same in dynamo. This also adds `SymNode.__abs__` so we can trace through
indexing expressions involving `abs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110398
Approved by: https://github.com/jansel, https://github.com/lezcano
2023-10-03 19:25:37 +00:00
a389181f2e [MPS] add support for aten::nextafter (#109685)
Fixes https://github.com/pytorch/pytorch/issues/77764#issuecomment-1722515591

Adds support for aten::nextafter to the MPS backend. Supports float and half types.

Notes:
- I've added nextafter to the output_grad_check XFAILLIST since neither this nor the cpu implementations have grad functions
- Metal Shading Language 3.1 seems to have a native nextafter() function, so once that's available, this kernel can just call that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109685
Approved by: https://github.com/kulinseth
2023-10-03 19:20:22 +00:00
9ce2e02fd6 Revert "[ROCm] Remove PYTORCH_MIOPEN_SUGGEST_NHWC flag (#90725)" (#110319)
This reverts commit 66bfcd32fd7f41154f1fd520e14012d3f717db4d.

NHWC is have perf regression on MIOpen, so reverting till the performance issue is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110319
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/kit1980
2023-10-03 19:14:47 +00:00
b457e3f79a Reland attempt 2 of "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)" (#109906)" (#110079)
The first reland broke internal (failing diff: D49617462).

The major error looks like it's because there's an internal-only higher order op that needs a new functionalization rule. I'm going to land an internal diff for that and confirm tests pass before relanding this PR.

Also confirmed that the issue from https://github.com/pytorch/pytorch/issues/110121 is fixed, and added a test.

This reverts commit 1b90f07f5a9fcb9187fee94f769fc117490c1e39.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110079
Approved by: https://github.com/ezyang
2023-10-03 18:50:25 +00:00
b5c3a17c2c [fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-buffer-overflow-far-from-bounds (size 4) in c10::IValue::IValue() (#110441)
Summary: This diff fixes a heap underflow found by fuzzing in torch/csrc/jit/runtime/vararg_functions.cpp

Test Plan:
CI and
```
arc lionhead crash reproduce 1753074381791061
```
doesn't crash anymore.

Differential Revision: D49537535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110441
Approved by: https://github.com/Skylion007
2023-10-03 18:48:12 +00:00
da63c7f2c3 [AOTInductor] remove CUDA dependency for cpp backend (#110409)
Summary:
Previously, we link against cuda libs even for pure cpp backend.
This caused issues for cases where the inference platform does not
have GPUs. This diff removed cuda dependency for cpp backend.

Reviewed By: bertmaher, muchulee8, mikekgfb

Differential Revision: D49800712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110409
Approved by: https://github.com/bertmaher, https://github.com/desertfire
2023-10-03 18:36:00 +00:00
df3ab70dde Revert "Added new test sample to interpolate op in OpInfo (#104181)"
This reverts commit 87f8bc65f8cbc3202d645cdfa80a206b564276ac.

Reverted https://github.com/pytorch/pytorch/pull/104181 on behalf of https://github.com/peterbell10 due to Causing OOM in slow-gradcheck ([comment](https://github.com/pytorch/pytorch/pull/104181#issuecomment-1745472323))
2023-10-03 18:07:02 +00:00
40be6b72e1 [ez] Type function in distributed_c10d (#110435)
This function returns a `torch.device`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435
Approved by: https://github.com/awgu
2023-10-03 17:54:04 +00:00
5977d17953 Update common_methods_invocations.py (#110383)
Description:
- Fixed misleading test sample case

Context: sample input is composed of input tensor `(N, C, iH, iW)` and grid tensor `(N, oH, oW, 2)`, however, grid is defined as `(N, C, oW, 2)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110383
Approved by: https://github.com/peterbell10
2023-10-03 17:53:39 +00:00
aecfe5d168 [aoti] Remove pessimizing move (#110446)
"`std::move` of a temporary prevents copy elision" says the compiler,
and I am pretty sure it is right.  Since AtenTensorHandle* implicitly converts
to RAIIAtenTensorHandle, I simply called emplace_back; happy to put an explicit
ctor if that makes folks happier.

Differential Revision: [D49842542](https://our.internmc.facebook.com/intern/diff/D49842542/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110446
Approved by: https://github.com/desertfire, https://github.com/Skylion007
ghstack dependencies: #110445
2023-10-03 17:44:58 +00:00
174e46b853 [inductor][easy] Free functions in headers should be declared inline (#110445)
If multiple files include model.h, you end up with duplicate symbols errors.

Differential Revision: [D49842167](https://our.internmc.facebook.com/intern/diff/D49842167/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110445
Approved by: https://github.com/desertfire, https://github.com/Skylion007
2023-10-03 17:44:49 +00:00
cd0e7d133b Migrate MacOs wheel binary builds to ephemeral M1 runners (#110432)
Surprisingly there are no speed difference between running the cross-compilation on `macos12-xl` (x86_64 12 core machine) and `macos-13-xlarge` (m1 6 core machine)

Most of the changes are on the https://github.com/pytorch/builder side:
- 50a6e91f97 skips installing mkl on M1 machines
- bbb29b0467 same for llvm-9
- 8bcc83dbb1 bumps minimal numpy version to 1.19 (as 1.17 is not available for m1)
- cc4f1f9055 skips building tests/distributed for M1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110432
Approved by: https://github.com/kit1980
2023-10-03 17:31:28 +00:00
7f0a659ccc Script to compare measured (trace) runtimes with estimated runtimes (#108037) (#109076)
Summary:

X-link: https://github.com/pytorch/benchmark/pull/1856

Reviewed By: xmfan, xuzhao9

Differential Revision: D48523883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109076
Approved by: https://github.com/xw285cornell
2023-10-03 17:05:35 +00:00
f2a1b93549 Back out "[quant] Support integer implementations for adaptive_avg_pool2d (#104226)" (#110316)
Summary:
Original commit changeset: acdb5b34e3aa

Original Phabricator Diff: D47321689

Test Plan: opinfo tests in CI

Differential Revision: D49789403

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110316
Approved by: https://github.com/kimishpatel
2023-10-03 16:59:23 +00:00
9bc5e10899 [New][1/N] Dynamo skipfiles refactor (#110330)
This is the replacement of #109567. Now I preserved all existing semantics and only focusing on API (for developers) and code structure changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110330
Approved by: https://github.com/ezyang
2023-10-03 16:50:33 +00:00
aa3629ee3e Fix typo under docs directory (#110359)
This PR fixes typo in `.rst` files under docs directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110359
Approved by: https://github.com/kit1980
2023-10-03 16:36:05 +00:00
4069d1de59 [distributed] Remove recordStream for callback that ends a profiler event (#109933)
**Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled.

Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback:

c2c7c4035f/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L1822-L1824)

In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage.

c2c7c4035f/aten/src/ATen/core/ivalue_inl.h (L1171-L1173)

**Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda.

**Tests**: (a) unit tests; (b) added an assert in recordStream: c2c7c4035f/c10/cuda/CUDACachingAllocator.cpp (L3260) and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109933
Approved by: https://github.com/wconstab
2023-10-03 14:40:43 +00:00
ff96f6d04f [core IR][reland] Add split.Tensor and unbind decompositions to core ATen decomp table (#110323)
Summary:
This is a reland of [github PR #110102]( https://github.com/pytorch/pytorch/pull/110102).

The original PR had to be unlanded due to internal CI failures. This diff applies some small fixes to the failing tests to adjust to the new decompositions.

Note that `lift_fresh` will not be decomposed for now, since it was found that [constant propogation looks specifically for `lift_fresh`](13af952f94/torch/fx/experimental/proxy_tensor.py (L381-L386)). Therefore decomposing `lift_fresh` will interfere with constant propogation during export.

Test Plan: Github CI and internal CI

Differential Revision: D49761321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110323
Approved by: https://github.com/jansel
2023-10-03 14:35:04 +00:00
4cdc52a2d4 Bump urllib3 from 2.0.2 to 2.0.6 in /tools/build/bazel (#110421)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.0.2 to 2.0.6.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.0.2...2.0.6)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-03 07:13:28 -07:00
2cbfcc740f use torch.xpu.manual_seed_all in torch.seed (#110376)
# Motivate
Use manual_seed_all instead of manual_seed. Because multi-device is supported in xpu backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110376
Approved by: https://github.com/ezyang
2023-10-03 13:41:55 +00:00
428cbd7513 [ao] fixing multihead attention convert size (#110407)
Summary: after converting nn.multihead attention we weren't deleting the
old in_proj_weight and in_proj_bias despite not (really) using them.

Test Plan: python test/test_quantization.py -k
"test_custom_module_multi_head_attention"

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110407
Approved by: https://github.com/jerryzh168
2023-10-03 08:49:12 +00:00
f76e5c846d Speed-up casts to FP8 (#110251)
Unlike half/bfloat16 casts, where entire model is cast to half-precision floats, only parts of the network can be in float8 and therefore performance of the casts is important.

Speedup casts by implementing non-dynamically castable variants using new refactored `gpu_kernel_nocast` template.

Mesaure performance using the following script:
```python
import torch

def run_cast_bench(size=(10000, 10000), src_dtype=torch.float16, dtype=torch.float8_e5m2):
    x=torch.rand(size, device="cuda", requires_grad=False, dtype=src_dtype)
    z=torch.empty(size, device="cuda", dtype=dtype, requires_grad=False)
    with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as prof:
        z.copy_(x)
    rc=prof.key_averages()
    print(f"Running bench for src_dtype={src_dtype} dst_dtype={dtype} cuda_time={rc[1].cuda_time}")

if __name__ == "__main__":
    for dtype in [torch.float8_e5m2, torch.float8_e4m3fn]:
        run_cast_bench(src_dtype=torch.half, dtype=dtype)
        run_cast_bench(src_dtype=torch.float, dtype=dtype)
        run_cast_bench(src_dtype=torch.bfloat16, dtype=dtype)
```

Below are before and after results:
|  Cast type | After | Before |
| ---------- | ------ | ----- |
| fp32->e5m2 | 228 us | 336 us|
| fp16->e5m2 | 150 us | 323 us|
| bf16->e5m2 | 150 us | 322 us|
| fp32->e4m3 | 227 us | 331 us|
| fp16->e4m3 | 148 us | 318 us|
| bf16->e4m3 | 149 us | 318 us|

Skip the optimizations on ROCm platform
TODO:
 - Investigate why `__nv_cvt` intrinsics defined in https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__FP8__MISC.html end up being slower

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110251
Approved by: https://github.com/drisspg
2023-10-03 08:16:47 +00:00
f4c0ef95bc [AMD] Fix broken build from nested transformer utils (#110245)
Summary: D49374910 breaks internal amd build because we didn't hipify the header file in nested/cuda. Maybe it's just easier to move it outside.

Reviewed By: nmacchioni

Differential Revision: D49743234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110245
Approved by: https://github.com/drisspg
2023-10-03 08:05:10 +00:00
d9fe1713c3 Enabled batch rule decompositions for upsample*.vec ops (#110333)
Follow-up PR to https://github.com/pytorch/pytorch/pull/110172
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110333
Approved by: https://github.com/zou3519
2023-10-03 06:58:18 +00:00
15219f53d1 [AOTInductor] Fix ProxyExecutor's handling on multiple outputs (#110374)
Summary: Fix ProxyExecutor after D49780781

Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops

Differential Revision:
D49816044

Privacy Context Container: 368960445142440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110374
Approved by: https://github.com/chenyang78
2023-10-03 06:42:22 +00:00
03f28dbce3 [HigherOrderOp] Better testing strategy for wrap that checks guards and recompiles (#110343)
Fixes #109251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110343
Approved by: https://github.com/zou3519
2023-10-03 05:57:38 +00:00
ce50132748 [vision hash update] update the pinned vision hash (#110424)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110424
Approved by: https://github.com/pytorchbot
2023-10-03 05:20:27 +00:00
d15d7a6485 [DTensorTestbase] Add "cpu:gloo,cuda:nccl" backend to DTensorTestbase (#110397)
This PR updates backend as a property to DTensorTestbase and add "cpu:gloo,cuda:nccl" support in DTensorTestbase so that we can use `cpu:gloo,cuda:nccl` backend for checkpoint unit tests.

cc. @wanchaol, @fduwjj, @XilunWu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110397
Approved by: https://github.com/wanchaol
2023-10-03 04:54:02 +00:00
f7909cb947 Build and test iOS on GitHub M1 runners (#110406)
They are here https://github.blog/2023-10-02-introducing-the-new-apple-silicon-powered-m1-macos-larger-runner-for-github-actions

I have been able to run iOS simulator tests on my M1 laptop without issues.  Some numbers:

* iOS build takes ~1h with x86 runners
* The new M1 runners take ~20m https://github.com/pytorch/pytorch/actions/runs/6386171957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110406
Approved by: https://github.com/malfet, https://github.com/seemethere
2023-10-03 03:17:10 +00:00
3fe94e46c2 Skip test_retracibility under ASAN (#110414)
See https://github.com/pytorch/pytorch/issues/110416

Skipping this under ASAN to make CI green.
This probably needs to be moved to slow tests eventually.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110414
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-10-03 02:05:35 +00:00
3bd229b53c Add quantized tensor function to get scale and zero point (#110095)
Summary: See summary

Test Plan:
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1
//xplat/caffe2/fb/custom_ops/vulkan_quantized:pt_vulkan_quantized_test_binAppleMac\#macosx-arm64
[       OK ] VulkanAPITest.convert_qconv2d_context (135 ms)
[ RUN      ] VulkanAPITest.linear_2d
[       OK ] VulkanAPITest.linear_2d (4 ms)
[----------] 2 tests from VulkanAPITest (139 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (139 ms total)
[  PASSED  ] 2 tests.
##############################################################
buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource
//xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
[       OK ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_int8_int32 (11 ms)
[ RUN      ] VulkanAPITest.linear_2d_flat
[       OK ] VulkanAPITest.linear_2d_flat (4 ms)
[ RUN      ] VulkanAPITest.linear_2d_small
[       OK ] VulkanAPITest.linear_2d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_2d_large
[       OK ] VulkanAPITest.linear_2d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_3d_flat
[       OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_small
[       OK ] VulkanAPITest.linear_3d_small (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_large
[       OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_flat
[       OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_custom
[       OK ] VulkanAPITest.linear_custom (0 ms)
[----------] 76 tests from VulkanAPITest (1811 ms total)
[----------] Global test environment tear-down
[==========] 76 tests from 1 test suite ran. (1811 ms total)
[  PASSED  ] 76 tests.
YOU HAVE 8 DISABLED TESTS
##############################################################
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1
[----------] Global test environment tear-down
[==========] 346 tests from 1 test suite ran. (5648 ms total)
[  PASSED  ] 345 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS

Differential Revision: D49609986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110095
Approved by: https://github.com/yipjustin
2023-10-03 01:48:31 +00:00
f69e9c8c91 run_tests.py minor logging changes (#110188)
Minor logging changes that just kind of annoyed me:
* prevent constant printing of `No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'` by moving import within the function (idk if this is ok)
* prevent constant printing of `Ignoring disabled issues:  ['']` (no idea why it was not gated behind a function or main)
* change all prints in run_tests.py to be through stderr so theres no weird interleaving (although if everything goes through stderr, might as well just print everything through stdout...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110188
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi
2023-10-03 01:22:47 +00:00
e55d6f923c minor tf32 fixes for unit tests on H100 and L40 (#110201)
fixes the following tests which were failing in the NVIDIA internal CI on H100 and L40:

test/test_nn.py:
* test_TransformerEncoderLayer_gelu_activation_cuda_tf32
* test_Transformer_multilayer_coder_cuda_tf32

test/inductor/test_torchinductor.py:
* test_batch_norm_2d_2_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110201
Approved by: https://github.com/mikaylagawarecki, https://github.com/jansel, https://github.com/Skylion007
2023-10-03 00:10:37 +00:00
3812f2e40c Preserve layout on like constructors (#110242)
Partially fixes `test_memory_format_factory_like_functions_preserve` with PYTORCH_TEST_WITH_INDUCTOR. Inductor preserves memory layouts for user-visible outputs as annotated on the fx graph that it is passed in. That graph is generated from running aot_autograd with decompositions. If the decompositions give incorrect strides, so will inductor.

This preserves the layout of `_like` operators when it corresponds to a `torch.memory_format`. It doesnt fix a) arbitrary permutations, b) striding of non-dense outputs. Both of these are lower-pri compared to preserving channels last. We would need either https://github.com/pytorch/pytorch/issues/92920 or a `to` variant that takes in a physical layout arbitrary permutations. I converted the output of rand to the correct layout instead of passing the layout in so that this would compose with the `replace_random` pass, and because the two pointwise ops will get fused anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110242
Approved by: https://github.com/int3
2023-10-02 23:53:55 +00:00
cyy
d58a91b2a6 [4/N] Move remaining c10::variant calls to std::variant (#110382)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110382
Approved by: https://github.com/Skylion007
2023-10-02 23:52:04 +00:00
01b2f25ebd [inductor] Cast loads from boolean tensors to tl.int1 (#110388)
Triton currently loads pointer to `tl.int1` as `tl.int8`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110388
Approved by: https://github.com/lezcano, https://github.com/Skylion007
2023-10-02 22:52:08 +00:00
28b3ff7974 [quant][pt2e][docs] Update main quant doc with pt2 export quantization information (#110260)
Summary:
att

Test Plan:
.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110260
Approved by: https://github.com/kimishpatel
2023-10-02 21:29:38 +00:00
cba3f407b1 Revert "[HigherOrderOp] Flatten outputs of wrap. (#109433)"
This reverts commit 651b198cdfac09d506a586040b16a88db1f54d85.

Reverted https://github.com/pytorch/pytorch/pull/109433 on behalf of https://github.com/kit1980 due to Depends on reverted https://github.com/pytorch/pytorch/pull/110290 ([comment](https://github.com/pytorch/pytorch/pull/109433#issuecomment-1743766271))
2023-10-02 21:09:19 +00:00
859733512f Revert "Refactor expect tests on test_higher_order_ops.py. (#110290)"
This reverts commit d9aecaefbe477256022ae0c0eae3a77a71bcb320.

Reverted https://github.com/pytorch/pytorch/pull/110290 on behalf of https://github.com/kit1980 due to Broke multiple tests and also lint https://github.com/pytorch/pytorch/actions/runs/6384854768/job/17329068768 ([comment](https://github.com/pytorch/pytorch/pull/110290#issuecomment-1743764686))
2023-10-02 21:07:19 +00:00
cdde899a73 [FSDP][optim_state_dict] Fuse allgather for optim_state_dict when use_orig_params is True (#108298)
The original implementation of `_gather_orig_param_state` is naive. It performs one allgather_object and two allgather (if the optimizer is Adam) per FQN. This can be slow and make `_optim_state_dict` become bottleneck.

This PR rewrite the implementation and fuse all the `allgather_object`s into one. As for `allgather`, it is fused based on the information of FlatParameters. So there will be 2N `allgather` where N is the number of FlatParameter and 2 is due to Adam having 2 states per FQN.

One experiment on 8GPU A100 shows that the execution of the gathering is improved to 0.3 seconds from 3 seconds.

Differential Revision: [D48835138](https://our.internmc.facebook.com/intern/diff/D48835138/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108298
Approved by: https://github.com/awgu
2023-10-02 20:57:08 +00:00
15dfe7b8e3 Actually enable typechecking for _inductor/index_propagation.py (#110110)
It was supposed to be enabled in #105622 but that PR neglected to update
.lintrunner.toml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110110
Approved by: https://github.com/Skylion007
2023-10-02 20:57:03 +00:00
80b6f072e3 [ATen] Remove ATen.h includes from transformers (#110199)
The kernel files here in particular are quite slow to compile and don't use anything from  `ATen.h`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110199
Approved by: https://github.com/malfet
2023-10-02 20:43:23 +00:00
c28bb46445 Fix test_mem_efficient_attention_vs_math_ref_grads tolerance from test_transformers.py (#108094)
Tolerance currently too low, triggering test failures via numerical mismatch in NVIDIA internal testing for certain H100, A16, A40 configs. cc: @ptrblck @eqy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108094
Approved by: https://github.com/eqy, https://github.com/msaroufim
2023-10-02 20:42:57 +00:00
6b2c52278e Benchmark flag to include slowdowns when computing gmean of speedups over eager (#108375)
`clip(1)` excludes slowdowns by treating them as 1x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108375
Approved by: https://github.com/jansel
2023-10-02 20:35:08 +00:00
b5268456f9 Fix optimize_for_inference to support modules that don't have a forward method (#110013)
Fixes #108662

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110013
Approved by: https://github.com/davidberard98
2023-10-02 20:13:44 +00:00
651b198cdf [HigherOrderOp] Flatten outputs of wrap. (#109433)
Fix: #109247

This PR flattens `wrap` outputs by inlining `pytree.tree_flatten` function after calling
the inner function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109433
Approved by: https://github.com/zou3519
ghstack dependencies: #110290
2023-10-02 19:58:30 +00:00
d9aecaefbe Refactor expect tests on test_higher_order_ops.py. (#110290)
This PR inlines the expecteds strings onto the `assertExpectedInline` calls, so that, when
change is needed, we may do that by using the `expectedtest` machinery: setting the
environment variable `EXPECTTEST_ACCEPT=1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110290
Approved by: https://github.com/zou3519
2023-10-02 19:58:30 +00:00
92242f599a [PyTorch] Add Expanded call stack to nodes [Take 2] (#110229)
Summary:
Adding back D46578700 / PR https://github.com/pytorch/pytorch/pull/108426

Note: The changes were originally reverted due to memory regression, these changes are putting the code behind a gflag so it is only used by binaries that require expanded stack for BPF Profiling.

Original Diff comment:
To get a Node's call stack we currently loop on the InlinedCallStack graph and follow the "callee" chain. Since the node's inlined stack does not change we can optimize this but expanding the node's inlined stack once and reusing it. This is particularly useful when reading the node's stack from another process (e.g. BPF) as it simplified the memory traversal process.
The new data structure (NodeSourceInfo) only holds pointers to the function name and file name variables, and assumes these objects will be alive throughout the lifetime of the process.
Each Node has an extended attribute that has an index to a vector of stack frames expanded_node_stacks_
node_stack_attr_symbol_ is only needed to make accessing the stack vector index attribute easier from BPF.

Test Plan:
- Verified using BPF Program in subsequent diffs
- Perf testing for loading large model: P822455246

Differential Revision: D49565461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110229
Approved by: https://github.com/zdevito
2023-10-02 19:52:41 +00:00
16e3f158b9 Add function to port FX minified graph to HLO via StableHLO (#109084)
If `XLA_HLO_DEBUG` flag is enabled, generated a minified HLO graph when using the minifier. This function enables HLO minification support by porting the minified FX graph to StableHLO via the `save_torch_model_as_stablehlo` function.

This allows users to port the minified graph to compilers that are not compatible with TorchDynamo/Inductor workflow and use XLA instead. The purpose of this PR is to help XLA users debug accuracy and compilation errors. It will also be helpful for existing TorchDynamo/XLA workflow on `torchxla_trace_once` backend as well.

Fixes [#5461](https://github.com/pytorch/xla/issues/5461) in Torch XLA repo. CC @GleasonK @qihqi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109084
Approved by: https://github.com/anijain2305
2023-10-02 19:36:04 +00:00
7e6cf04a84 Revert "Multiprocessing support for NT (#110292)"
This reverts commit 881e7304d6315c17953fa5b9bc1dfe07dcb7d166.

Reverted https://github.com/pytorch/pytorch/pull/110292 on behalf of https://github.com/jbschlosser due to Address review comments ([comment](https://github.com/pytorch/pytorch/pull/110292#issuecomment-1743524901))
2023-10-02 18:27:13 +00:00
881e7304d6 Multiprocessing support for NT (#110292)
Fixes #110161

Allows NTs to be used in DataLoaders with `num_workers > 1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110292
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110219
2023-10-02 18:14:34 +00:00
7827ae2864 Increase job timeout limit when running with memory leak check (#110193)
This fixes the daily timeout of ROCm jobs when running with memory leak check turning on.  I want to use something like `inputs.timeout-minutes * 2` but that syntax, unfortunately, isn't supported in GitHub action YAML.  So I decide to just x2 the current timeout value of 300 minutes to make it 600 minutes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110193
Approved by: https://github.com/clee2000
2023-10-02 18:01:49 +00:00
8d6479725a Revert "Adding Backward Support for NestedTensors and FlashAttention (#97485)"
This reverts commit 28d69d52569c8d140e83a2411e6066c903b94b29.

Reverted https://github.com/pytorch/pytorch/pull/97485 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but one of the tests test_fused_kernels_nested_broadcasting_requires_grad_failure_cuda is failing on Windows CUDA f7ba3e85e2 ([comment](https://github.com/pytorch/pytorch/pull/97485#issuecomment-1743474468))
2023-10-02 17:48:57 +00:00
26900d21c2 [dtensor] skip pytree when not necessary (#110132)
pytree is a great tool, but it sometimes considers to be evil for
tensor subclasses, it's useful to implement subclass quickly, but it:
* exposes non-trival CPU overhead
* many ops don't need pytree, only the one with list/dict ops needs
* blindly use pytree to re-wrap have semantic issues for inplace/out
ops

This PR avoid using pytree for most ops during torch_dispatch and only
enable it for certain ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110132
Approved by: https://github.com/fduwjj
2023-10-02 17:44:34 +00:00
cyy
fd6c993eea Add missing CUDA error check (#110368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110368
Approved by: https://github.com/Skylion007
2023-10-02 17:34:31 +00:00
46d1f9b385 fix(lint): Fix lint issues on main (#110389)
Lint issue was introduced in https://github.com/pytorch/pytorch/pull/110186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110389
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-10-02 17:04:01 +00:00
a3c1e3c95c Generalize toAccumulateType() (#108248)
Trying to address this comment: https://github.com/pytorch/pytorch/pull/106666#discussion_r1297397554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108248
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-10-02 16:34:36 +00:00
cyy
7853f8f6da Fix override warnings in nvfuser (#110350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110350
Approved by: https://github.com/Skylion007
2023-10-02 16:29:29 +00:00
e47e946bbf [aotinductor] Use dynamic_shape instead of constraints (#110360)
Summary:
Previously we used export's constraints to specify all batch-size dimensions being dynamic. This is done by creating 1 constraint `dynamic_dim(inp[0][0], lower, upper)`, followed by `dynamic_dim(inp[0][0]) == dynamic_dim(inp[i][0])` for every input `i`.

Through the new `dynamic_shapes` API, we can use `Dims("batch_size")` on every dimension to specify which dimensions are dynamic and equal to each other, and `None` otherwise: `{i: [Dims("batch_size", lower, upper), None] for every input i}`

Note: `dynamic_shapes` and `constraints` utilize the same "constraints" backend so this diff should be idempotent.

Test Plan: `buck2 run @//mode/dev-nosan //caffe2/torch/fb/model_transform/experimental/benchmark/test/aotinductor:test_aot_inductor_benchmark`

Reviewed By: chenyang78, aakhundov

Differential Revision: D49784351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110360
Approved by: https://github.com/desertfire
2023-10-02 16:09:37 +00:00
87f8bc65f8 Added new test sample to interpolate op in OpInfo (#104181)
Description:
- Added new test sample to interpolate op in OpInfo
- Fixed silent issue with zero tensor test sample for uint8 dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181
Approved by: https://github.com/pmeier, https://github.com/lezcano
2023-10-02 15:35:48 +00:00
175b626216 Enable torch.promote_types in Dynamo tracing (#110358)
Fixes #109508

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110358
Approved by: https://github.com/Skylion007
2023-10-02 15:20:36 +00:00
e0348ceceb Avoid undefined behavior in JIT-generated conversion code (#110212)
The inductor/dynamo JIT generator creates C++ code using `static_cast` for type conversions.
This is can be undefined behavior for e.g. `static_cast<uint8_t>(floatVal)` where `floatVal` is a negative value.

To avoid this in the "regular" C++ code `c10::convert` is used. So use it in the JIT generated code too.

Fixes #110077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110212
Approved by: https://github.com/ezyang, https://github.com/jgong5, https://github.com/desertfire
2023-10-02 12:56:41 +00:00
f7812cdbd9 [inductor][Optimus]Improve logging for Optimus (#110186)
Summary: It is based on the diff D49340843. We add more logs for better debug and logging purposes.

Test Plan:
```
[2023-09-27 20:35:53,844] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] Before group_batch fusion in pre grads pass. Print graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GEoA8xb22jibUNEEAPYecF9_RVM1br0LAAAz
[2023-09-27 20:35:55,001] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] Apply fusion BatchLinearFusion. Print graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPMR9BYffjwToEQCAFS7rgixMi0pbr0LAAAz
[2023-09-27 20:35:57,419] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] Apply fusion BatchLinearLHSFusion. Print graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GKiA8hNycGpBdAIDAOn0c1Hpef4sbr0LAAAz
[2023-09-27 20:35:57,585] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] BatchLayernormFusion: key = ('batch_layernorm', 'torch.Size([2048, 128])', 'torch.Size([128])', 'torch.Size([128])', '(128,)', '1e-05'); subset size = 7
[2023-09-27 20:35:58,493] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] Apply fusion BatchLayernormFusion. Print graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GKpftRa9Glxm-MYDAOZb_D80JHsYbr0LAAAz
[2023-09-27 20:35:59,754] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] Apply fusion BatchTanhFusion. Print graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPgh9BZQl4EKGckAAES094iV3Atrbr0LAAAz
I0927 20:36:00.532000 3750607 pre_grad.py:71] After group_batch_fusion_pre_grad_passes: https://www.internalfb.com/intern/everpaste/?color=0&handle=GBPb8xYxfrbXuCMDAI5d_a4YyhFBbr0LAAAz
```

Differential Revision: D49710166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110186
Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang
2023-10-02 07:29:25 +00:00
06464a3477 Change compiled_autograd tests to xfail instead of skip (#110348)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110348
Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/Skylion007
2023-10-01 23:03:36 +00:00
a588648759 [DCP] Fix 'torch.cpu' has no attribute 'current_device' in checkpoint/optimizer.py (#110299)
When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error.

```
-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example
    optim_state = load_sharded_optimizer_state_dict(
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict
    _alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor
    device=cast(torch.device, _get_device_module(device_type).current_device()),
AttributeError: module 'torch.cpu' has no attribute 'current_device'
```

This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299
Approved by: https://github.com/kumpera
2023-10-01 21:54:13 +00:00
13af952f94 [export] Add run_decomposition() function to ExportedProgram (#110236)
Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130

`exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.

Splitting up this diff with the following one (D49742989) to make migrating Executorch easier:
1. Land this diff
1. Wait for a pytorch nightly to include this diff
1. Update executorch's pytorch nightly
1. Land the following diff to have export() return no decomps

Test Plan: Tested in following diff

Differential Revision: D49743208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110236
Approved by: https://github.com/gmagogsfm
2023-10-01 18:18:27 +00:00
13681382d5 Add heuristic for when evict_first should be set (and some other minor things) (#108841)
Example of when the `evict_first` heuristic helps.
```
@torch.compile
def f(a, b):
    return (a * b).sum(dim=-1)

N = 512
inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0))
from torch._inductor.utils import do_bench
print(do_bench(lambda: f(*inps)))
```

This generates code like this: http://ix.io/4HFs

```
Original: 3.8 ms
This PR: 3.54 ms
Always `evict_first: 5.4ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-10-01 17:06:12 +00:00
e4414716d5 [onnx] support attn_mask fp16 type (#110306)
When users define customized `attention mask` using `dtype=torch.float16`, e.g.

```
from torch.nn import functional as F

float_min = torch.finfo(torch.float16).min

attention_mask_fp16 = (attention_mask * 1.0).masked_fill(attention_mask, float_min).to(torch.float16)

attn_output = F.scaled_dot_product_attention(
                 query_layer_, key_layer_, value_layer_, attention_mask_fp16, 0.0, is_causal=False
 )
```

 the onnx graph cannot be exported.

When q, k ,v have the fp16 type, we can support this `attn_mask` to be `fp16` type, by adding
```
elif (
        _type_utils.JitScalarType.from_value(attn_mask)
        == _type_utils.JitScalarType.FLOAT
        in (_type_utils.JitScalarType.FLOAT, _type_utils.JitScalarType.HALF)
```
This can export `.onnx` graph.

Fixes #109336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110306
Approved by: https://github.com/titaiwangms
2023-10-01 14:50:58 +00:00
cyy
55905c4a1a [2/N] Enable clang-tidy to c10/test/*cpp (#110270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110270
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-10-01 07:36:23 +00:00
cyy
ef5ff79019 [2/N] Clean up CMake target linking (#109986)
This PR cleans up more CMake target linking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109986
Approved by: https://github.com/malfet
2023-10-01 05:36:08 +00:00
669faab0ad [AOTInductor] Add non-default device test (#110024)
Differential Revision: D49604597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110024
Approved by: https://github.com/chenyang78
2023-10-01 05:08:23 +00:00
2bcae75513 [vision hash update] update the pinned vision hash (#110344)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110344
Approved by: https://github.com/pytorchbot
2023-10-01 04:20:06 +00:00
e8c0364f36 [AOTInductor] Add model runner to avoid using torch_extension (#110263)
Differential Revision: D49609669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110263
Approved by: https://github.com/chenyang78
2023-10-01 00:52:17 +00:00
898656e9d1 [AOTInductor] ProxyExecutor supports Tuple of Tensor and List[Tensor] in returns (#110187)
Summary:
ProxyExecutor supports custom ops that return a tuple mixed of Tensor and List[Tensor]
e.g. `"fn_with_mix_outputs(Tensor t, Tensor[] tensors) -> (Tensor, Tensor[])"`

Example:
`out7, [out8, out9] = torch.ops.fb.fn_with_mix_outputs(out5, [out6, out4])`
got compiled into
```
    AtenTensorHandle buf11_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf11_handle));
    RAIIAtenTensorHandle buf11(buf11_handle);
    AtenTensorHandle buf12_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf12_handle));
    RAIIAtenTensorHandle buf12(buf12_handle);
    AtenTensorHandle buf13_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf13_handle));
    RAIIAtenTensorHandle buf13(buf13_handle);
    AtenTensorHandle tensor_args_var_7[] = {buf8.get(), buf9.get(), buf6.get(), buf11.get(), buf12.get(), buf13.get()};
    int64_t int_args_var_8[] = {};
    aoti_torch_proxy_executor_call_function(proxy_executor, 3, 0, int_args_var_8, 6, tensor_args_var_7);
```

Serialized extern node
```
    {
      "name": "buf10",
      "node": {
        "target": "fb::fn_with_mix_outputs",
        "inputs": [
          {
            "name": "t",
            "arg": {
              "asTensor": {
                "name": "buf8"
              }
            }
          },
          {
            "name": "tensors",
            "arg": {
              "asTensors": [
                {
                  "name": "buf9"
                },
                {
                  "name": "buf6"
                }
              ]
            }
          }
        ],
        "outputs": [
          {
            "asTensor": {
              "name": "buf11"
            }
          },
          {
            "asTensors": [
              {
                "name": "buf12"
              },
              {
                "name": "buf13"
              }
            ]
          }
        ],
        "metadata": {}
      }
    }
```

Test Plan: Test

Differential Revision: D49710320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110187
Approved by: https://github.com/chenyang78
2023-09-30 19:47:01 +00:00
6bb448a2d3 [inductor][fbcode] Add -D C10_DISABLE_TENSORIMPL_EXTENSIBILITY to cpp_compile_command (#110122)
Summary:
## Why?

The .so and .h files are compiled seperately with different flags. The .so is compiled by AOTInductor and .h files (eg. c10/core/TensorImpl.h) are compiled by buck2.

Let's make sure the .so is also compiled with this macro in fbcode.

Differential Revision: D49664078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110122
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2023-09-30 16:34:59 +00:00
cyy
d0ad848aa5 Enable misc clang-tidy checks (#110283)
This PR enables the misc-XX checks in clang-tidy. Meanwhile, I excluded some of them that require a lot of code changes and have no immediate benefits. Some additional fixes and suppression were also given.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110283
Approved by: https://github.com/albanD
2023-09-30 10:39:52 +00:00
2ead6c2f6e Skip launching kernels with zero grid in AOT Inductor (#110312)
Summary: with the grid computed in terms of unbacked `SymInt`s, it can happen that the grid is zero size. This causes CUDA error on `cuLaunchKernel` in the AOT Inductor codegen.

In this PR, when the grid contains unbacked `SymInt`s, a check is added around the `launchKernel` in the AOT Inductor's C++ wrapper codegen to make sure that the grid is not zero-size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110312
Approved by: https://github.com/chenyang78
2023-09-30 09:12:56 +00:00
81a74457ca [BE] Clean up trymerge code handling flaky failures (#110133)
This is the 2nd part of https://github.com/pytorch/pytorch/pull/110054.  The flaky classification has been done on Dr.CI.  There is no need to download flaky rule files and do the check anymore.  Some tests are also updated with new examples because we mocked the list of flaky rules there.  Similar tests have been done on Dr.CI.

* [x] https://github.com/pytorch/pytorch/pull/110054
* [x] Clean up the flaky rules logic because it has already been implemented on Dr. CI
* [ ] Clean up the broken trunk logic for the same reason

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110133
Approved by: https://github.com/clee2000
2023-09-30 08:01:00 +00:00
f7ba3e85e2 [Dynamo] Add functional triton kernel wrapper (#110185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110185
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/bdhirsh
ghstack dependencies: #109623
2023-09-30 04:20:20 +00:00
eqy
6b84658433 [CUDA][cudaMallocAsync] Improve PYTORCH_CUDA_ALLOC_CONF error message (#104891)
Tiny fix to improve use-facing errors for issues like #104801

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104891
Approved by: https://github.com/kit1980
2023-09-30 02:59:02 +00:00
ad8aef0f98 [BE] [3/N] Use nested namespaces (#110314)
Mostly in torch/csrc/jit/runtime and in `ATen/cuda/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110314
Approved by: https://github.com/seemethere
2023-09-30 02:23:48 +00:00
8745d2d4f2 Small optimization to how we call flash-attention (#110324)
# Summary
Logging Mode is great, and helped me identify that we are doing an unnecessary slice sometimes.

### Numbers
For small sizes: ie. (16, 16, 32, 32)
This brings the timing from:

`flash_time: 29.344002110883594 micro seconds`

to

`flash_time: 26.971791498363018 micro seconds`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110324
Approved by: https://github.com/cpuhrsch
2023-09-30 02:15:07 +00:00
7eeb392eb3 [Inductor] Enable the item() and nonzero() codegen test on CPU (#110262)
**Summary**
Follow up https://github.com/pytorch/pytorch/pull/109893 which has issue in support of CPU as reported in https://github.com/pytorch/pytorch/issues/109897. This fix mainly includes 2 changes:

-  Current implementation of `rename_indexing`
10c646295d/torch/_inductor/codegen/common.py (L1023) only add symbol name start with `s` or `ps` into `kernel.args.sizevars`. However, `Unbacked symint` will start as `i`, so we extend the implementation of `rename_indexing` to support symbol start with `i`.
- Currently, the internal loop index also name start as `i`. Since `i` has has been used as `Unbacked symint`, change the name to start with `x` which should align with trition.

**Test Plan**
```
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_bool_mask_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_nonzero_size_factory_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_item_zeros_nobreak
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110262
Approved by: https://github.com/ezyang, https://github.com/jgong5
2023-09-30 00:13:20 +00:00
e0be9ebc18 Simplify the conditionals used for learning rate calculation for ConstantLR learning rate scheduler (#109785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109785
Approved by: https://github.com/janeyx99, https://github.com/kit1980
2023-09-29 23:11:23 +00:00
993eea0edd [aotinductor] Fix a missing schema issue for repeat_interleave (#110105)
Differential Revision: [D49686812](https://our.internmc.facebook.com/intern/diff/D49686812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110105
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/aakhundov
2023-09-29 23:01:37 +00:00
ee0bff209c [LTC] correct AdaptiveAvgPool3d channel dim index for shape inference (#109822)
Fixes #109821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109822
Approved by: https://github.com/mikaylagawarecki, https://github.com/alanwaketan
2023-09-29 22:54:12 +00:00
5a87477e3f [BE] Use std::make_unique (#110298)
Since C++14 `std::unique_ptr<type_t[]> x(new type_t[NUM])` is identical to `auto x = std::make_unique<type_t[]>(NUM);`

Leave two `std::unique_ptr<float[]> arr(new float[NUM]());` as statement not just allocates, but initializes it as well, se e below:
d04b35e7e3/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L700-L701)

On the other hand, from https://github.com/pytorch/pytorch/pull/60371 it's not at all clear, if it needs to be initialized to zero at that point...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110298
Approved by: https://github.com/kit1980
2023-09-29 22:46:30 +00:00
b083058e45 Revert "Make unbind() overrideable for NT subclass (#109122)"
This reverts commit f5a23ca78d13c5e536f5062325c815c50be5f4c2.

Reverted https://github.com/pytorch/pytorch/pull/109122 on behalf of https://github.com/PaliC due to breaking slow tests ([comment](https://github.com/pytorch/pytorch/pull/109122#issuecomment-1741555305))
2023-09-29 22:41:56 +00:00
1e95a1ae8c MAINT: pytorchify torch._numpy tests: core/ and fft/ (#109815)
1. Inherit from TestCase
2. Use pytorch parametrization
3. Use unittest.expectedFailure to mark xfails, also unittest skips

All this to make pytest-less invocation work:

$ python test/torch_np/test_basic.py

cross-ref #109593, #109718, #109775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109815
Approved by: https://github.com/lezcano
2023-09-29 22:36:13 +00:00
9c7071b0e3 [fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-use-after-free (size 8) in std::_Function_base::_M_empty() (#110289)
Summary: This diff fixes a heap UAF found by fuzzing in torch/csrc/jit/mobile/interpreter.cpp

Test Plan:
CI and
```
arc lionhead crash reproduce 1009060456885023
```
doesn't crash anymore.

Reviewed By: malfet

Differential Revision: D49538326

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110289
Approved by: https://github.com/malfet
2023-09-29 22:32:38 +00:00
f2d7faf4ba Revert "MAINT: pytorchify torch._numpy tests: core/ and fft/ (#109815)"
This reverts commit 132a138a01806b45bb050cbcacbaa782fcf2e2ae.

Reverted https://github.com/pytorch/pytorch/pull/109815 on behalf of https://github.com/PaliC due to causing various slow tests to fail ([comment](https://github.com/pytorch/pytorch/pull/109815#issuecomment-1741525574))
2023-09-29 21:53:36 +00:00
28d69d5256 Adding Backward Support for NestedTensors and FlashAttention (#97485)
# Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 318764f</samp>

This pull request implements the CUDA backend of the SDPA kernel for nested tensors, which enables efficient transformer models with variable-length sequences. It adds a new dispatch key, a backward function, a unit test, and some helper functions for the kernel. It modifies `test/test_transformers.py`, `aten/src/ATen/native/native_functions.yaml`, `aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctionsBackward.cpp`, and `aten/src/ATen/native/nested/cuda/NestedTensorTransformerUtils.h`.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ed4a773</samp>

> _Fused kernels of doom, unleash the flash attention_
> _Nested tensors on fire, reshape and pad with caution_
> _Backward pass of power, dispatch the CUDA key_
> _Test the gradients of hell, warn the user if they disagree_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97485
Approved by: https://github.com/jbschlosser
2023-09-29 21:34:47 +00:00
359c2a53f5 dynamic_shapes + retrace exported program (#110276)
An `ExportedProgram`'s `__call__` signature is different from the original module, so `dynamic_shapes` that follow the original signature would fail when applied to re-export an `ExportedProgram`.

This PR fixes this issue, in other words, the original `dynamic_shapes` should now work when re-exporting.

Differential Revision: D49764011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110276
Approved by: https://github.com/tugsbayasgalan
2023-09-29 21:06:46 +00:00
c2c7c4035f Revert "Simplify the conditionals used for learning rate calculation for ConstantLR learning rate scheduler (#109785)"
This reverts commit 83283b4f0dc2032a31f9a80c7aa40e3e552ec944.

Reverted https://github.com/pytorch/pytorch/pull/109785 on behalf of https://github.com/PaliC due to causing macos errors as per 83283b4f0d ([comment](https://github.com/pytorch/pytorch/pull/109785#issuecomment-1741471142))
2023-09-29 20:49:28 +00:00
b253fc9c93 Revert "[1/N] Dynamo skipfiles refactor (#109567)" (#110296)
This reverts commit 84c5435b296bf7361f0f3043f7e68b7ba13ffd70.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110296
Approved by: https://github.com/yanboliang
2023-09-29 20:35:46 +00:00
bc047ec906 [inductor] Make sure unfuse_addmm and addmm patterns don't overlap (#110235)
Inductor has two opposing patterns,
```
addmm -> add + mm
add + mm -> addmm
```

This uses the `extra_check` to disable the addmm fusion pattern when the
heuristic to unfuse add is met, for consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110235
Approved by: https://github.com/lezcano, https://github.com/eellison
ghstack dependencies: #110232
2023-09-29 19:35:29 +00:00
d04b35e7e3 [inductor] Fix bug in input mutation (#107614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107614
Approved by: https://github.com/jansel
2023-09-29 18:27:06 +00:00
d7de26804e [AOTInductor] ProxyExecutor supports List[Tensor] return type (#110182)
Summary:
Support custom ops returns List[Tensor] type, like `"fn_with_list_output(Tensor[] tensors, int i) -> Tensor[]"`

As an example
`out5, out6 = torch.ops.fb.fn_with_list_output([out3, out4], 1)`

got compiled into

```
    AtenTensorHandle buf8_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf8_handle));
    RAIIAtenTensorHandle buf8(buf8_handle);
    AtenTensorHandle buf9_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf9_handle));
    RAIIAtenTensorHandle buf9(buf9_handle);
    AtenTensorHandle tensor_args_var_5[] = {buf5.get(), buf6.get(), buf8.get(), buf9.get()};
    int64_t int_args_var_6[] = {1};
    aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, int_args_var_6, 4, tensor_args_var_5);
```

Test Plan: Test

Differential Revision: D49694691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110182
Approved by: https://github.com/chenyang78
2023-09-29 18:21:48 +00:00
d6d3f6cfe5 Add weight update for DSOModel. (#110273)
Summary: Add weight update for DSOModel and AOTInductorModel

Test Plan: buck2 test accelerators/workloads/models/slimdsnn:slimdsnn_dso_test - SlimDSNN.DSO_Update_Constants

Reviewed By: mikekgfb

Differential Revision: D49748685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110273
Approved by: https://github.com/hl475
2023-09-29 18:14:01 +00:00
6e2c14a0e8 [Codemod][[codemod] Replace third-party mock with unittest.mock] caffe2/caffe2 (#106541)
Reviewed By: thechrisu

Differential Revision: D47909974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106541
Approved by: https://github.com/thechrisu
2023-09-29 18:09:49 +00:00
88ef126a93 rename nanogpt_generate to nanogpt to also support train (#109746)
Differential Revision: [D49522940](https://our.internmc.facebook.com/intern/diff/D49522940)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109746
Approved by: https://github.com/msaroufim, https://github.com/malfet, https://github.com/xuzhao9
2023-09-29 17:36:48 +00:00
30759848fa [inductor] handle non-list/tuple outputs for FallbackKernel (#110145)
generate_output may return non-list/tuple outputs. Let's force
those to be list, because we will enumerate kernel.outputs
later in the codegen.

Also fixed a minor issue in an assertion message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110145
Approved by: https://github.com/aakhundov
2023-09-29 17:13:26 +00:00
defb364adf Clean up test_external_module_register (#110254)
caused by #109866

The test registers new device module, the above pr checks for xpu, sees that it got registered and uses it but its a dummy module.

This causes any test after it to fail so I "clean up" the registered module

Another possible solution would be to run this test last lol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110254
Approved by: https://github.com/huydhn
2023-09-29 17:02:13 +00:00
0ff1155d3a [aotinductor] Refactor test_aot_inductor to take different devices (#110216)
Summary: Replace hardcoded device to self.device, to make it easier to test both cpu and cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110216
Approved by: https://github.com/chenyang78, https://github.com/bertmaher
ghstack dependencies: #110215
2023-09-29 16:30:19 +00:00
ce6d09a775 [aotinductor] Refactor test_aot_inductor (#110215)
Summary: Remove the usage of output tensors in the test script, since AOTInductor now returns output tensors instead of taking in pre-allocated output tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110215
Approved by: https://github.com/angelayi, https://github.com/chenyang78
2023-09-29 16:30:19 +00:00
28f52f2f80 Fix aminmax on CUDA when input shape contains 0 (#107564)
The CUDA kernel asserts numel() > 0, the CPU kernel doesn't and returns empty values (as expected)

Fixes #95349 and #85439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107564
Approved by: https://github.com/lezcano
2023-09-29 16:18:08 +00:00
2d50a30d77 [Dynamo] Add native support for Triton Kernels to Dynamo (#109623)
This PR adds native support to Dynamo to detect Triton kernels and
create an FX graph node out of them. AOT eager and inductor modes will
be support in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109623
Approved by: https://github.com/jansel
2023-09-29 15:49:18 +00:00
3693777a86 Pickle support for NT (#110219)
Fixes #104198
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110219
Approved by: https://github.com/cpuhrsch
2023-09-29 15:30:06 +00:00
c9511e8ac9 [foreach][BE] cleaning up MultiTensorApply.cuh (#110228)
Followup edits to #109402 as suggested by @r-barnes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110228
Approved by: https://github.com/drisspg
2023-09-29 14:44:48 +00:00
92f4a7b663 [inductor] Add fbcode include path for cuda (#110240)
We missed the cuda include, leading to failures in cases where CUDA
was not installed locally but only provided via third-party/GVFS.

Differential Revision: [D49745585](https://our.internmc.facebook.com/intern/diff/D49745585/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110240
Approved by: https://github.com/hl475
2023-09-29 13:39:40 +00:00
758735b739 [dynamo] Convert dtype arguments as well as inputs in cast_to_fp64 (#110232)
Generating reference outputs somtimes fails because of type mismatches in the graph,
an issue which was noticed previously for `prims.convert_element_type` and fixed in #92036
but the same issue happens with other functions such as tensor constructors.

This expands the fix from #92036 to all dtype keyword arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110232
Approved by: https://github.com/ezyang
2023-09-29 12:42:14 +00:00
24e5d61af8 Log usage of optimizer in backward (#110206)
This will allow us to inspect and aggregate jobs that use optimizer in
backward

Differential Revision: [D48674740](https://our.internmc.facebook.com/intern/diff/D48674740/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110206
Approved by: https://github.com/awgu
2023-09-29 11:00:07 +00:00
acac92f806 [vision hash update] update the pinned vision hash (#110258)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110258
Approved by: https://github.com/pytorchbot
2023-09-29 04:17:27 +00:00
d615f0078c Updating documentation for PolynomialLR (#110151)
Docstring mentions the power parameter is `int`, when it should have been `float`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110151
Approved by: https://github.com/janeyx99
2023-09-29 03:50:11 +00:00
07ec95b17c TD: Fix sorting bug for historical correlations heuristic (#110257)
Fix bug where the historical correlations heuristic currently sorts heuristics in the opposite order, ranking the least relevant tests most highly

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 70333d1</samp>

> _`test_files` sorted_
> _by ratings, high to low_
> _a faster spring test_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110257
Approved by: https://github.com/clee2000
2023-09-29 03:29:08 +00:00
cyy
3dc479e70b [1/N] Apply clang-tidy to c10/test/*cpp (#109278)
This series of PR enables clang-tidy checks in c10/test. We aim to finally add the path to lintrunner.toml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109278
Approved by: https://github.com/kit1980
2023-09-29 02:20:57 +00:00
e6b5e0ecc6 removing the functionality of nvfuser python APIs (#110124)
Removing the functionalities from nvfuser python APIs.

Since the use of nvfuser has been deprecated before the last release cut. We are removing torch script support.

I'll have the next PR to actually remove the code base.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110124
Approved by: https://github.com/davidberard98
2023-09-29 01:45:00 +00:00
88de391692 [torch.library] Fix some docstrings (#110214)
Removed some erroneous colons

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110214
Approved by: https://github.com/ezyang
2023-09-29 01:44:49 +00:00
83283b4f0d Simplify the conditionals used for learning rate calculation for ConstantLR learning rate scheduler (#109785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109785
Approved by: https://github.com/janeyx99, https://github.com/kit1980
2023-09-29 01:19:05 +00:00
c9b8e06060 [quant] Enable quantization for wav2letter (#109830)
Summary:
Also added annotation support for conv1d_relu and conv1d in XNNPACKQuantizer, the quantized results still
matches fx quant path (didn't quantize conv1d) so tests are not disabled

Test Plan: with-proxy buck2 run executorch/examples/quantization:example -- -m=w2l --verify

Differential Revision: D49479546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109830
Approved by: https://github.com/kimishpatel
2023-09-29 00:47:34 +00:00
ce8b4f56d8 [dynamo] Dont put nn module guards on torch inbuilt nn modules (#110230)
This is one way to fix https://github.com/pytorch/pytorch/issues/110048

Looking for feedback.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110230
Approved by: https://github.com/ezyang
2023-09-29 00:43:16 +00:00
20dabea35d Inductor cpp wrapper: support MkldnnRnnLayer (#107858)
1. Directly use the `codegen` function of the parent class which already supported both python and cpp wrapper.
2. The output of the `at::mkldnn_rnn_layer` OP is actually a `std::tuple` 1491bae277/aten/src/ATen/native/mkldnn/RNN.cpp (L218) Fix the type when calling `MultiOutput`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107858
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-09-29 00:22:42 +00:00
d1a13129bb Add support for item() and nonzero() codegen in Inductor (#109893)
This is another version of
https://github.com/pytorch/pytorch/pull/109262 that I think is more
harmonious with inductor design.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109893
Approved by: https://github.com/jansel
2023-09-28 23:37:31 +00:00
3de42995e4 [quant][pt2e] Add quant API re-entrant test (#110125)
Summary:
Add the test to make sure we can call the quantize API multiple times

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_reentrant

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110125
Approved by: https://github.com/kimishpatel
ghstack dependencies: #110097
2023-09-28 22:41:59 +00:00
bbb95878e9 [LLVM] Update apis incompatible with llvm versions in codegen (#110200)
Opaque pointers support is disabled in llvm 14 and enabled by default from llvm 15 and above.
setOpaquePointers api usage is deprecated from llvm 16. Removed this API.

Update CreateMalloc and CreateFree apis for latest llvm release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110200
Approved by: https://github.com/Skylion007
2023-09-28 21:49:30 +00:00
ae546db562 [GHF] Update meregbot tests (#110221)
One should never edit `gql_mocks.json` by hand, as otherwise it does not validate mergebot behavior using the actual GitHub data, but rather snapshot of this data frozen in time.

Unfortunately, GitHub started to delete checkrun statuses against older
PR, so some tests needs to be updated.

For example https://github.com/pytorch/pytorch/pull/77700/checks committed on May 19th 2022 has no checks at the time of the writing (Sep 28th 2023)

Deleted `test_checksuites_pagination` as its checks are gone it tests the same functionality as `test_get_checkruns_many_runs`, which was updated to use more recent PR.

Deleted `test_get_classifications_pending_unstable`, because what it wants to test is inherently unreliable and therefore it must be rewritten using some different mechanisms.

Disabled `test_internal_changes` as the mechanism is broken at the moment, see https://github.com/pytorch/pytorch/issues/110218

Updated `test_pr_dependencies_ghstack` and `test_pr_dependencies` to generate `msg` using `pr.get_body()` rather than hardcode the text (that were updated after test was committed.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110221
Approved by: https://github.com/clee2000, https://github.com/huydhn
2023-09-28 21:29:17 +00:00
be3b16daad [decomp] Fix baddbmm decomposition (#109714)
The decomposition is currently registered without the pw_cast_for_opmath
decorator, due to the ordering of decorators being meaningful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714
Approved by: https://github.com/lezcano
2023-09-28 21:23:44 +00:00
41d6c29b19 [BE] Fix pointless comparison warning (#110227)
As Indeed `uint32_t(x) >= 0` is always true

Warning typically looks as follows:
```
[337/1379] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/EmbeddingBag.cu.o
../aten/src/ATen/core/ivalue.h(1283): warning #186-D: pointless comparison of unsigned integer with zero

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110227
Approved by: https://github.com/atalman, https://github.com/albanD
2023-09-28 20:21:26 +00:00
f82a29e32b [inductor] Add CI jobs to test AOTInductor (#108419)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108419
Approved by: https://github.com/angelayi, https://github.com/jansel
2023-09-28 20:19:25 +00:00
81da6db74a fix a missing keyword virtual (#110220)
# Motivate
fix a missing keyword `virtual`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110220
Approved by: https://github.com/ezyang
2023-09-28 19:45:34 +00:00
e0b035c220 Revert "[core IR] Add lift_fresh, split.Tensor, and unbind decompositions to core ATen decomp table (#110102)"
This reverts commit 22e706f76894a898036329256a3f2f58e79aee92.

Reverted https://github.com/pytorch/pytorch/pull/110102 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110102#issuecomment-1739856671))
2023-09-28 19:03:25 +00:00
aaaa3c1586 Fixed minor issues for bmm/mm decompositon (#109836)
Summary:
* Fixed minor issues for bmm/mm decompositon
* enabled addmm for inductor

Test Plan: ci

Reviewed By: mikekgfb

Differential Revision: D49522332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109836
Approved by: https://github.com/jansel, https://github.com/mikekgfb
2023-09-28 18:45:01 +00:00
cyy
168f516fae [3/N] Move c10::variant to std::variant (#110141)
This PR moves more c10::variant calls to std::variant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110141
Approved by: https://github.com/Skylion007
2023-09-28 18:43:55 +00:00
84c5435b29 [1/N] Dynamo skipfiles refactor (#109567)
This is 1/N of the dynamo skipfiles/allowed_functions refactor, the major change in this PR includes:
* Refactor & define the [skipfiles rules](https://github.com/pytorch/pytorch/pull/109567/files#diff-5aa3ce9db729bf0901ea97a5d3cc51924cc8575d9c516c1c8f572a35de92544aR56) and interface
* For every ```skipfiles.check```, we return both the check result and the skip/inline reason and log them for debugging.
* We found several latent issues/bugs and incorrect implementations in the codebase, but I'm planning to fix them in follow-up PRs to make the refactor decoupled with bug fixes.
* More details in the inline comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109567
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/anijain2305
2023-09-28 18:36:46 +00:00
e3eb1d92d8 [quant][docs] Add documentation for prepare_pt2e, prepare_qat_pt2e and convert_pt2e (#110097)
Summary:
att

Test Plan:
.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110097
Approved by: https://github.com/kimishpatel
2023-09-28 18:24:58 +00:00
3603f646eb BUG: fix torch._numpy.arange(5, dtype="float32") (#110005)
Make `np.arange` respect an explicitly provided dtype.

Also remove duplicated tests:
- torch_np/test_function_base.py::TestArange is a dupe of
- torch_np/numpy_tests/core/test_multiarray.py::TestArange

Fixes #109975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110005
Approved by: https://github.com/lezcano
2023-09-28 18:21:18 +00:00
5f7eff0adb Replace node.meta source_fn with source_fn_stack (#108595)
A resubmit of https://github.com/pytorch/pytorch/pull/108447. Copy over the descriptions:

This is a follow-up of the discussion in https://github.com/pytorch/pytorch/pull/108356, where we want to repalce source_fn with source_fn_stack

Before this PR, for the following example:
```python
backend = EagerAndRecordGraphs()

@torch.compile(backend=backend, fullgraph=True)
def cond_f(pred, pred2, x, y):
    def true_fn(pred2, x, y):
        return x + y

    def false_fn(pred2, x, y):
        def true_fn2(x, y):
            return x.sin() - y.cos()

        def false_fn2(x, y):
            return x.cos() - y.sin()

        return control_flow.cond(pred2, true_fn2, false_fn2, (x, y))

    return control_flow.cond(pred, true_fn, false_fn, (pred2, x, y))
```
The graph captured is shown below:
```python
class GraphModule(torch.nn.Module):
    def forward(self, L_pred_ : torch.Tensor, L_pred2_ : torch.Tensor, L_x_ : torch.Tensor, L_y_ : torch.Tensor):
        l_pred_ = L_pred_
        l_pred2_ = L_pred2_
        l_x_ = L_x_
        l_y_ = L_y_

        cond_true_1 = self.cond_true_1
        cond_false_1 = self.cond_false_1
        cond = torch.ops.higher_order.cond(l_pred_, cond_true_1, cond_false_1, [l_pred2_, l_x_, l_y_]);  l_pred_ = cond_true_1 = cond_false_1 = l_pred2_ = l_x_ = l_y_ = None
        return (cond,)

    class GraphModule(torch.nn.Module):
        def forward(self, l_pred2_, l_x_, l_y_):
            add = l_x_ + l_y_;  l_x_ = l_y_ = None
            return add

    class GraphModule(torch.nn.Module):
        def forward(self, l_pred2_, l_x_, l_y_):
            cond_true_0 = self.cond_true_0
            cond_false_0 = self.cond_false_0
            cond = torch.ops.higher_order.cond(l_pred2_, cond_true_0, cond_false_0, [l_x_, l_y_]);  l_pred2_ = cond_true_0 = cond_false_0 = l_x_ = l_y_ = None
            return cond

        class GraphModule(torch.nn.Module):
            def forward(self, l_x_, l_y_):
                sin = l_x_.sin();  l_x_ = None
                cos = l_y_.cos();  l_y_ = None
                sub = sin - cos;  sin = cos = None
                return sub

        class GraphModule(torch.nn.Module):
            def forward(self, l_x_, l_y_):
                cos = l_x_.cos();  l_x_ = None
                sin = l_y_.sin();  l_y_ = None
                sub = cos - sin;  cos = sin = None
                return sub
```
the source_fn for inner cond, sin, cos will be a (name, target) tuple:
```
('cond', <torch._ops.HigherOrderOperator object at xxx>)
('sin', 'sin')
('cos', 'cos')
('sub'. <built-in function sub>)
```

After this pr, the source_fn_stack will be a list of (name, target) tuple. The bottom of stack is the end of the list.
```
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>)],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sin', 'sin')],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cos', 'cos')]
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sub', <built-in function sub>)]
```

Test Plan:
See added tests in test_higher_order_ops.py and modify existing test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108595
Approved by: https://github.com/angelayi, https://github.com/zou3519
2023-09-28 18:18:36 +00:00
1d0a8eed5d [generate_opcheck_tests] Enable using same failures_dict for multiple testclasses (#110164)
This PR allows us to use the same failures_dict for multiple test
classes. This is helpful if you have a bunch of small TestCase(es) and
to centralize all the failures dict into one big one.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110164
Approved by: https://github.com/williamwen42
2023-09-28 17:56:45 +00:00
f2c360e3e5 Reorganize and rename COW files and APIs (#110191)
This PR does the following:
* Combine `cow/context.<h/cpp>` and `cow/deleter.<h/cpp>` into `cow/COWDeleter.<h/cpp>`
* Rename `Context` to `COWDeleterContext`
* Rename `delete_context` to `cow_deleter`
* Remove the separate `impl_cow_context` bazel library, combining it with the base c10 core library
* Rename `context_test.cpp` to `cow_test.cpp`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110191
Approved by: https://github.com/ezyang
2023-09-28 17:50:44 +00:00
c62be12061 Added batch rules for _upsample_bi*2d_aa and _upsample_bi*2d_aa_backward (#110172)
Description:
- Added batch rules for `_upsample_bi*2d_aa` and `_upsample_bi*2d_aa_backward`
- Added few more test cases into `sample_inputs_upsample_aten`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110172
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-09-28 17:42:48 +00:00
2a246c5259 update type() calling to not use unneeded device (#110163)
Previous code path was doing an unnecessary cuda init as well as causing an unnecessary "device" to occur in the jit trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110163
Approved by: https://github.com/henryhu6, https://github.com/albanD
2023-09-28 17:34:46 +00:00
cyy
7f5fd92372 Reland use std::make_unique after internal changes (#109742)
check internal
follow up of #109780
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109742
Approved by: https://github.com/ezyang
2023-09-28 17:24:08 +00:00
7f5737392d [FSDP] fix: fix for fsdp exec order pre fwd record (#110138)
When the sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch=True, during direct validation run, self.is_first_iter will always be True (because training=False, iter+1 is not executed). Additionally, the _pre_forward_order_index of the first handle entering the record_pre_forward function is 0. This causes the handle to have a False result in the if condition at line 166 when entering the record_pre_forward function again (the expected value should be True because _pre_forward_order_index has actually been assigned a value). As a result, the first handle is repetitively added to handles_pre_forward_order, leading to incorrect prefetching order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110138
Approved by: https://github.com/awgu
2023-09-28 15:45:05 +00:00
6f48d872d0 Re-land: Break graph on manual_seed. (#109109)
Re-landing: #108647 (old #107594)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109109
Approved by: https://github.com/lezcano
2023-09-28 15:28:40 +00:00
5f417fd710 [aot_inductor] Lightweight model runner (#110158)
It's useful to have a simple, lightweight way to run a model that adds
essentially no overhead to calling the model's generated `run_impl` method.
This C API is a super thin wrapper around AOTInductorModel: Create, Run, and
Delete are provided, and do very little work beyond dispatch to the appropriate
helpers.

Note the Create function also provides additional functionality beyond the
Container API; it allows the user to pass in a weight map defined in userland,
which is a requirement for several serving use cases.

Differential Revision: [D49670711](https://our.internmc.facebook.com/intern/diff/D49670711/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110158
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2023-09-28 14:59:41 +00:00
ad0ba5e187 [torchbench] Consistent accuracy results with dynamobench (#110189)
Summary:
Use the upstream `torch._dynamo.same` function in accuracy checking and remove the self-hosted version in torchbench.

Now cmf_10x and ads_dhen_5x can run in deterministic mode, enable deepcopy and deterministic mode.

Test Plan:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- cmf_10x -d cuda -t train --accuracy
Running train method from cmf_10x on cuda in eager mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```

```
$ buck2 run mode/opt //pytorch/benchmark:run -- cmf_10x -d cuda -t train --torchdynamo inductor --torchinductor_enable_batch_fusion --torchinductor_enable_split_cat_fx_pass --accuracy
Running train method from cmf_10x on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```

Without this PR, it will print:

```
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_dynamo/utils.py", line 190, in time_wrapper
    r = func(*args, **kwargs)
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/graph.py", line 464, in run
    return super().run(*args)
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/fx/interpreter.py", line 138, in run
    self.env[node] = self.run_node(node)
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/graph.py", line 826, in run_node
    result.realize_hint()
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5273, in realize_hint
    and self.is_pointwise_non_scalar_tensor_num_reads_larger_than_one()
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/utils.py", line 343, in wrapper
    setattr(self, key, fn(self))
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5332, in is_pointwise_non_scalar_tensor_num_reads_larger_than_one
    (sum(read.index != 0 for read in self.data.get_reads()) > 1)
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5332, in <genexpr>
    (sum(read.index != 0 for read in self.data.get_reads()) > 1)
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/dependencies.py", line 74, in index
    raise NotImplementedError("StarDep does not have an index")
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NotImplementedError: StarDep does not have an index
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```

Reviewed By: jackiexu1992, mengluy0125

Differential Revision: D49639733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110189
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-09-28 14:50:57 +00:00
8e14e76c34 [inductor] Enhance an input type assertion msg (#110176)
Summary: to address https://github.com/pytorch/pytorch/issues/110089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110176
Approved by: https://github.com/angelayi
2023-09-28 13:35:11 +00:00
248a1b7011 Revert "Enable function declaration check in Vulkan and Metal backends (#106762)"
This reverts commit bf8617c37d6b32a1aaf7e5d63e4f558637f8d84d.

Reverted https://github.com/pytorch/pytorch/pull/106762 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/106762#issuecomment-1739184482))
2023-09-28 13:32:10 +00:00
eb082ef604 [inductor] Decompose addmm if it's a dot product on cpu (#110010)
Generated code for dot product is often faster (on CPU) than
dispatching to aten, since it avoids op dispatch overhead and allows fusion
with surrounding ops, which in turn avoids allocations.

Differential Revision: [D49595876](https://our.internmc.facebook.com/intern/diff/D49595876/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110010
Approved by: https://github.com/chenyang78, https://github.com/jgong5, https://github.com/mikekgfb
2023-09-28 13:30:14 +00:00
ee8983da70 109605 dynamo scalar ndarray pow gen (#109953)
Fixes #109605

Generated code before:
```
def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (8, ), (1, ))
    buf0 = empty_strided((), (), device='cpu', dtype=torch.int64)
    cpp_fused_lift_fresh_0(c_void_p(buf0.data_ptr()))
    # Source Nodes: [wrapped_pow], Original ATen: [aten.lift_fresh, aten.pow]
    buf1 = aten.pow(arg0_1, reinterpret_tensor(buf0, (8, ), (0, ), 0))
    del arg0_1
    del buf0
    buf2 = buf1
    assert_size_stride(buf2, (8, ), (1, ))
    del buf1
    return (buf2, )
```

Generated code now:
```
def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (8, ), (1, ))
    buf0 = empty_strided((8, ), (1, ), device='cpu', dtype=torch.int64)
    cpp_fused_pow_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg0_1
    return (buf0, )
```
@lezcano What would be a good way to add a test for this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109953
Approved by: https://github.com/lezcano
2023-09-28 13:11:06 +00:00
5da5e068f3 deprecate constraints in favor of dynamic_shapes (#110143)
Recently we updated the `export` API to take an experimental `dynamic_shapes` argument that was meant to subsume the existing `constraints` argument.

This PR deprecates `constraints` (with a warning on its use, but without actually removing it). Simultaneously it replaces all uses of `constraints` in docs, examples, and tests with corresponding uses of `dynamic_shapes` (preserving behavior). This exercise fortunately revealed some minor bugs in the implementation which have also been fixed in this PR.

Some uses of `constraints` still remain, e.g., when `torch._dynamo.export` is called directly. (Meta-internal uses will be updated in a separate diff.)

Differential Revision: D49676049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110143
Approved by: https://github.com/tugsbayasgalan
2023-09-28 10:26:21 +00:00
419ec3b229 Enable pickling model prepared with QAT qconfig (#109288)
Summary:
Resolving error:

AttributeError: Can't pickle local object '_add_module_to_qconfig_obs_ctr.<locals>.get_factory_kwargs_based_on_module_device'

by moving nested function out to the main module

Test Plan: Added test to CI

Reviewed By: andrewor14

Differential Revision: D49187352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109288
Approved by: https://github.com/andrewor14
2023-09-28 09:51:19 +00:00
c71a64ccce [aotinductor] Rename if name is prefixed with integer (#110113)
Fixes https://github.com/pytorch/pytorch/issues/109894.
Since in c++ we cannot have variables that start with an integer, we can do some additional handling in inductor to not produce constant tensors with names starting with integers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110113
Approved by: https://github.com/desertfire
2023-09-28 07:26:28 +00:00
e20c35a53b Allow public access for imports (#108914)
Fixes #108776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108914
Approved by: https://github.com/wanchaol
2023-09-28 06:05:59 +00:00
fc1fcc4d17 Enable typechecking for _inductor/fx_passes/group_batch_fusion.py (#110111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110111
Approved by: https://github.com/eellison, https://github.com/Skylion007
ghstack dependencies: #110109
2023-09-28 04:53:09 +00:00
3e7f23e04f [inductor] Actually enable typing for sizevars.py and joint_graph.py (#110109)
The commit message of #107862 says it enabled mypy checking for
sizevars.py, but it seems that it neglected to update .lintrunner.toml.

New type errors appear to have crept in since then, so I've fixed them
accordingly.

A similar mistake happened with #109955 for joint_graph.py, though that
one is more recent and so hasn't had any new type errors to fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110109
Approved by: https://github.com/Skylion007
2023-09-28 04:53:09 +00:00
cyy
a81d083b1c [Reland] Add -Wdeprecated and related fixes (#110019)
This is reland of PRs #https://github.com/pytorch/pytorch/pull/108626 and #109564. We fixed the IOS build failure by changing
```
((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR)))
```
to
```
((CHECK) ? (EXPR) : ([] { assert(false); }(), (EXPR)))
```
in TR2_OPTIONAL_ASSERTED_EXPRESSION, since the former syntax was invalid on Apple Clang. Anyway, we could apply the simple fix hoping that c10::optional would be replaced by std::optional soon.
We also enabled -Wdeprecated on c10.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110019
Approved by: https://github.com/clee2000
2023-09-28 03:34:29 +00:00
7f2b51c668 [AOTInductor] ProxyExecutor supports custom op with tuple output (#110140)
Summary:
Extend ProxyExecutor to support custom ops with tuple outputs.

Generated wrapper code for `out3, out4 = torch.ops.fb.fn_with_tuple_output(out2, 1)`

```
    AtenTensorHandle buf5_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf5_handle));
    RAIIAtenTensorHandle buf5(buf5_handle);
    AtenTensorHandle buf6_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf6_handle));
    RAIIAtenTensorHandle buf6(buf6_handle);
    AtenTensorHandle tensor_args_var_3[] = {buf3.get(), buf5.get(), buf6.get()};
    int64_t int_args_var_4[] = {1};
    aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args_var_4, 3, tensor_args_var_3);
```

Test Plan: Test

Differential Revision: D49673994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110140
Approved by: https://github.com/chenyang78
2023-09-28 02:50:39 +00:00
75462fd870 Revert "[1/N] Dynamo skipfiles refactor (#109567)"
This reverts commit f8e0ebec8c6156922026fc2bf6e5a829097b4506.

Reverted https://github.com/pytorch/pytorch/pull/109567 on behalf of https://github.com/huydhn due to Many jobs are failing in trunk after this with FILENAME_ALLOWLIST is not defined error f8e0ebec8c. This looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/109567#issuecomment-1738344950))
2023-09-28 02:22:22 +00:00
68b0db1274 Define the public API for torch.distributed.fsdp (#109922)
Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation
Related: https://github.com/microsoft/pylance-release/issues/2953

This fixes pylance issues for these classes:

```
"FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp"
```

These classes all have public docs:

* [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch)
* [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload)
* [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel)
* [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision)
* [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy)

And it seems like all the newly added classes will have docs once they are released.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922
Approved by: https://github.com/wanchaol
2023-09-28 02:15:58 +00:00
1ca68c971c distributed doc fix (#110157)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110157
Approved by: https://github.com/awgu
2023-09-28 01:34:02 +00:00
f5a23ca78d Make unbind() overrideable for NT subclass (#109122)
Goal: avoid making unbind composite implicit so we can override it within `__torch_dispatch__()` for the NT subclass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109122
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-09-28 01:26:22 +00:00
f8e0ebec8c [1/N] Dynamo skipfiles refactor (#109567)
This is 1/N of the dynamo skipfiles/allowed_functions refactor, the major change in this PR includes:
* Refactor & define the [skipfiles rules](https://github.com/pytorch/pytorch/pull/109567/files#diff-5aa3ce9db729bf0901ea97a5d3cc51924cc8575d9c516c1c8f572a35de92544aR56) and interface
* For every ```skipfiles.check```, we return both the check result and the skip/inline reason and log them for debugging.
* We found several latent issues/bugs and incorrect implementations in the codebase, but I'm planning to fix them in follow-up PRs to make the refactor decoupled with bug fixes.
* More details in the inline comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109567
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/anijain2305
2023-09-28 01:21:59 +00:00
22e706f768 [core IR] Add lift_fresh, split.Tensor, and unbind decompositions to core ATen decomp table (#110102)
## Context

Add existing decomps for `lift_fresh`, `split.Tensor`, and `unbind` to the core ATen decomposition table. Do not use them in inductor, since Inductor currently lowers these directly.

One note though is that `lift_fresh`'s decomposition has a note saying it's not correct under autograd. However, my understanding is that these decompositions are registered to the `"post_autograd"` decomposition table, meaning autograd wouldn't be a factor. Would like some confirmation that this premise is correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110102
Approved by: https://github.com/jansel
2023-09-28 01:21:45 +00:00
840bb650f8 [AOTInductor] Update regex rule for symbol (#110184)
Summary:
Update regex rule to match _ letter.

Test Plan:
Included in commit

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110184
Approved by: https://github.com/desertfire
2023-09-28 01:13:18 +00:00
9399e0b1ff add fp16 support for gemm (#99498)
### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401

56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99498
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-09-28 01:03:50 +00:00
d796518485 [refs] Fix size check from #108360 (#109083)
PR #108360 uses the same default `last_dim_size` formula from complex-to-real (C2R) transforms for
complex-to-complex (C2C) and real-to-complex (R2C). However, this is not correct because for C2R
the input is only half the size of the full tensor, which is not the case for C2C and C2R.

This error is mostly benign since `last_dim_size` was only used for the `>= 1` condition which is
almost always met anyway.

For this PR I now use it as the argument to `_apply_norm` which makes it load-bearing for correctness
and so is thoroughly tested now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109083
Approved by: https://github.com/lezcano
2023-09-27 23:59:29 +00:00
85e408217a [ONNX] Move out onnx bench bash scripts (#103983)
Summary:
- Remove onnx bench related scripts and `_onnx` folder.
- Update `common.py` to include onnx related patches previously under `_onnx` folder.
- Update `merge_rules.json` to include bench files.
- Added quick sanity onnx bench test to onnx CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103983
Approved by: https://github.com/kit1980
2023-09-27 23:54:26 +00:00
60b46d7902 Add ROCm folks as CODEOWNERS for triton.txt (#110108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110108
Approved by: https://github.com/kit1980
2023-09-27 23:31:15 +00:00
40b83d98de fix bugs in export docstrings (#110169)
First error

```
Traceback (most recent call last):
  File "/home/ubuntu/exporty.py", line 8, in <module>
    ep = torch.export.export(MyModule(), torch.randn(5))
  File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/export/__init__.py", line 509, in export
    return export(f, args, kwargs, constraints)
  File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/_export/__init__.py", line 314, in export
    raise UserError(UserErrorType.INVALID_INPUT,
torch._dynamo.exc.UserError: Expecting `args` to be a tuple of example positional inputs, got <class 'torch.Tensor'>
```

Second error

```
(sam) ubuntu@ip-172-31-9-217:~$ python exporty.py
Traceback (most recent call last):
  File "/home/ubuntu/exporty.py", line 13, in <module>
    torch.export.save(ep, 'exported_program.pt2', extra_files=extra_files)
  File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/export/__init__.py", line 566, in save
    save(ep, f, extra_files=extra_files, opset_version=opset_version)
  File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/_export/__init__.py", line 595, in save
    encoded_content = content.encode('utf-8')
AttributeError: 'bytes' object has no attribute 'encode'. Did you mean: 'decode'?
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110169
Approved by: https://github.com/angelayi
2023-09-27 22:56:42 +00:00
bf7307adf8 Support inference_mode decorator (#109274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109274
Approved by: https://github.com/williamwen42
2023-09-27 22:21:42 +00:00
a200bb5e54 [BE] Do not use assert in unit tests (#110179)
One should always use `unittest.assert` methods rather than plain `assert` as later can be turned into a noop if Python runtime is invoked with optimizations enabled

Fixes use of `assert` introduced by https://github.com/pytorch/pytorch/pull/105251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110179
Approved by: https://github.com/huydhn
2023-09-27 21:53:18 +00:00
2ff9d1fda3 Add size to constant - type dispatche through BaseListVariable.cls_for (#110166)
Differential Revision: D49689895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110166
Approved by: https://github.com/anijain2305
2023-09-27 21:44:16 +00:00
7782108792 [AOTIndutor] Fix freeze for AOTInductor (#110055)
Summary:
Add test for freeze graph in AOTInductor.
Remove unused code path.

Test Plan:
Included in commit.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110055
Approved by: https://github.com/angelayi
2023-09-27 21:21:47 +00:00
955298bc40 Use Dr.CI results to classify flaky failures in trymerge (#110054)
After https://github.com/pytorch/test-infra/pull/4589, we can now query Dr.CI to get the list of flaky failures there.  This change queries Dr.CI API endpoint and check if the failure is a flaky one using `is_flaky` function.

Because the change is relatively large, I'm breaking it down to several smaller PRs in this order:

* [x] This PR queries Dr.CI and adds `is_flaky` check
* [ ] Clean up the flaky rules logic because it has already been implemented on Dr. CI
* [ ] Clean up the broken trunk logic for the same reason

### Testing

* Create a new `drci_mocks.json` file to catch the JSON response from Dr.CI API endpoint. The API requires `DRCI_BOT_KEY`.
*  `pytest -v test_trymerge.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110054
Approved by: https://github.com/clee2000
2023-09-27 21:21:29 +00:00
213badf632 [dynamo][guards-log] Add debug msg for nn_module_guards only when log is enabled (#110167)
I did not do any benchmarks, but there could be a small overhead of creating the debug_msg. Adding debug_msg only when guards log is enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110167
Approved by: https://github.com/ezyang
2023-09-27 21:11:44 +00:00
6aae636f69 chore(inductor): Simplify will_fusion_create_cycle and cleanup to node.ancestors (#109976)
recursive_predecessors == ancestors so rename.

Improve comments

Simplify `will_fusion_create_cycle` - make it easier to read and add detailed comments.

Diagram to illustrate clarification of shortcut.
![Inductor Deep Dive](https://github.com/pytorch/pytorch/assets/9093549/7a30e088-8a33-4a9c-a8a7-81199cd086e2)

CC: @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109976
Approved by: https://github.com/jansel
2023-09-27 20:48:53 +00:00
b123fd168a Higher order op for preserving leaf functions through trace, particularly for getting user defined hooks to compiled autograd (#109690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109690
Approved by: https://github.com/ezyang
2023-09-27 20:47:15 +00:00
fe11227764 [dynamo][higher order op] Fix minor bug in error msgs (#110099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110099
Approved by: https://github.com/zou3519
2023-09-27 20:28:17 +00:00
7c1702f099 Keep JSON mocks file in gzip format (#110173)
This is to keep them smaller than the file size limit enforced in fbcode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110173
Approved by: https://github.com/malfet
2023-09-27 20:16:58 +00:00
d4b06dc426 Pass S3 credentials to ios upload workflow (#109222)
This fixes the failed upload to S3 for nightly and release build. The credentials needs to be passed from the caller workflow.  We also need to setup the credential in D49291627 before merging this one.

### Testing

Upload successfully https://github.com/pytorch/pytorch/actions/runs/6190836578/job/17125308432?pr=109222#step:13:51

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109222
Approved by: https://github.com/atalman
2023-09-27 20:15:02 +00:00
21ff0cc3ac [xla hash update] update the pinned xla hash (#109999)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109999
Approved by: https://github.com/pytorchbot
2023-09-27 19:52:50 +00:00
ae064ad4c6 Fix XLA update rules (#110177)
Regression introduced during migration of `bionic` to `focal` by https://github.com/pytorch/pytorch/pull/105260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110177
Approved by: https://github.com/clee2000
2023-09-27 19:25:31 +00:00
5ef5f1ab9a [HigherOrderOp] wrap (and checkpoint) should accept pytree inputs (#109962)
Fixes #109250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109962
Approved by: https://github.com/zou3519
2023-09-27 18:51:09 +00:00
58c33789c6 Fix governance.rst link rendering (#110171)
By adding `__` to the end of the link decorator according to https://sublime-and-sphinx-guide.readthedocs.io/en/latest/references.html#links-to-external-web-pages

Fixes regression introduced by https://github.com/pytorch/pytorch/pull/106863

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110171
Approved by: https://github.com/seemethere, https://github.com/msaroufim, https://github.com/atalman
2023-09-27 18:49:03 +00:00
cyy
36eb1bb548 Use constexpr members in ConstantSymNodeImpl (#110142)
A simple refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110142
Approved by: https://github.com/Skylion007
2023-09-27 18:31:33 +00:00
a8bed7191b [Easy] use BaseListVariable cls_for for all list-y type dispatching (#110159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110159
Approved by: https://github.com/ezyang
2023-09-27 18:21:15 +00:00
ec5bbef8af [AOTInductor] Switch ProxyExecutor to use AtenTensorHandle (#109748)
Summary: Switch ProxyExecutor to use AtenTensorHandle.

Test Plan: E2E Test

Differential Revision: D49471659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109748
Approved by: https://github.com/yifuwang, https://github.com/desertfire, https://github.com/chenyang78
2023-09-27 17:51:30 +00:00
633bd0765e Integrate xpu into torch.Generator and torch.seed (#109866)
Integrate torch.xpu.Generator into torch.Generator
Integrate torch.xpu.seed into torch.seed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109866
Approved by: https://github.com/ezyang
2023-09-27 17:44:45 +00:00
0511df0ee9 [ROCM] enable skipped test_api cpp tests (#109817)
[ROCM] enable skipped  test_api cpp tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109817
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-09-27 16:52:46 +00:00
063d2572da Revert "Use Dr.CI results to classify flaky failures in trymerge (#110054)"
This reverts commit d0f82cd082fad7243226e0ab68fd995873ea7d76.

Reverted https://github.com/pytorch/pytorch/pull/110054 on behalf of https://github.com/huydhn due to The mock gql_mocks.json file is not bigger than the file size limit on fbcode ([comment](https://github.com/pytorch/pytorch/pull/110054#issuecomment-1737727552))
2023-09-27 16:33:10 +00:00
8791e8697a Print full stack trace on suppressed error (#110106)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110106
Approved by: https://github.com/zou3519, https://github.com/voznesenskym
2023-09-27 16:09:06 +00:00
0721a394b6 [executorch][kernel reg] Allow kernel manual registration (#110086)
Summary:
Exposing a codegen mode for generating a hook for user to register their kernels.

If we pass `--manual-registration` flag to `gen_executorch.py`, we will generate the following files:
1. RegisterKernels.h which declares a `register_all_kernels()` API inside `torch::executor` namespace.
2. RegisterKernelsEverything.cpp which implements `register_all_kernels()` by defining an array of generated kernels.

This way user can depend on the library declared by `executorch_generated_lib` macro (with `manual_registration=True`) and be able to include `RegisterKernels.h`. Then they can manually call `register_all_kernels()` instead of relying on C++ static initialization mechanism which is not available in some embedded systems.

Test Plan:
Rely on the unit test:

```
buck2 test fbcode//executorch/runtime/kernel/test:test_kernel_manual_registration
```

Reviewed By: cccclai

Differential Revision: D49439673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110086
Approved by: https://github.com/cccclai
2023-09-27 16:04:20 +00:00
1265400ba6 Revert "Reland: implement a function to convert a storage to copy-on-write (#110022)"
This reverts commit dddf07e56a9a798ae27d976d697c3d434cf63a5b.

Reverted https://github.com/pytorch/pytorch/pull/110022 on behalf of https://github.com/atalman due to New tests are failing in internal CI ([comment](https://github.com/pytorch/pytorch/pull/110022#issuecomment-1737584693))
2023-09-27 15:05:41 +00:00
7dbdf3be1e Fix inductor CI (by updating graph break count) (#110160)
There was a vision hash update which led to fewer graph breaks. This
seems expected to me (because the hash update included
https://github.com/pytorch/vision/pull/7944 and nms is used in maskrcnn).

Test Plan:
- wait for ci

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110160
Approved by: https://github.com/ezyang, https://github.com/Chillee
2023-09-27 14:37:36 +00:00
cyy
bf8617c37d Enable function declaration check in Vulkan and Metal backends (#106762)
This PR enables declaration check in Vulkan and Metal backends, so that we can identify unused functions more easily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106762
Approved by: https://github.com/ezyang
2023-09-27 14:29:24 +00:00
774137d506 Add torch.ops.import_module (#110090)
Generally, to extend PyTorch with custom operators, a user will
create a Python module whose import triggers registration of
the custom operators via a torch.ops.load_library call or a call
to one or more torch.library.* APIs.

It is unexpected for Python modules to have side effects, so some
linters and formatters will complain. Use torch.ops.import_module to
import the module without a linter or formatter complaining.

NB: A more robust API would actually check if a custom op was registered
or modified, but this is technically challenging to do. In the future we
can add a warning if a custom op wasn't registered or modified.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110090
Approved by: https://github.com/ezyang
2023-09-27 13:56:47 +00:00
34ded74399 [Dynamo] fix signature in dynamo types (#110081)
The type signature is obsolete. This PR fixes the type signature, leaves comments in the C code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110081
Approved by: https://github.com/jansel
2023-09-27 09:30:04 +00:00
a51b8df261 Add support for event_tracer in codegen layer (#109990)
Summary: Split out from D48975975, this handles the pytorch specific changes to add support for event_tracer in codegen layer.

Test Plan: CI

Reviewed By: dbort

Differential Revision: D49487710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109990
Approved by: https://github.com/Jack-Khuu
2023-09-27 09:09:03 +00:00
10c646295d When doing typed typecheck, also check signature with symint removed (#109727)
See the test case for what we didn't catch (SymInt vs const SymInt&
mismatch.)

It's necessary to test for both, because we will fall back to the
non-SymInt signature if there is no SymInt unboxed kernel available.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109727
Approved by: https://github.com/zou3519
2023-09-27 07:29:46 +00:00
1b51d29b66 [quant][pt2e] Enable constant folding for quantize ops (#109343)
Summary:
This PR added constant folding for quantize ops so that instead of storing fp32 weight in the
quantized model, we'll get int8/int16 etc. weight

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_fold_quantize

also will verify in executorch later

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D49399210](https://our.internmc.facebook.com/intern/diff/D49399210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109343
Approved by: https://github.com/kimishpatel, https://github.com/jgong5
2023-09-27 06:04:45 +00:00
6138750ab1 [vision hash update] update the pinned vision hash (#110127)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110127
Approved by: https://github.com/pytorchbot
2023-09-27 04:25:39 +00:00
ddbf1aab64 [export] Add dynamic_shapes to _export.aot_compile (#110101)
Summary: Following the new dynamic_shapes API (introduced in https://github.com/pytorch/pytorch/pull/108448), we will also add a dynamic_shapes API to _export.aot_compile

Test Plan: CI

Differential Revision: D49653815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110101
Approved by: https://github.com/gmagogsfm
2023-09-27 04:10:22 +00:00
f7c9ef88f5 Add masked_select abstract impl (#110103)
Fixes https://github.com/pytorch/pytorch/issues/109871

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110103
Approved by: https://github.com/bdhirsh
2023-09-27 04:07:58 +00:00
33d8f5f73e fix typo (#109965)
fix typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109965
Approved by: https://github.com/zou3519, https://github.com/kit1980
2023-09-27 03:32:04 +00:00
869226bf94 Avoid passing generator to parametrize (#110104)
Fixes

```
ValueError: <function TestMeta.test_layer_norm_backward at 0x7f555f56e440>: An empty arg_values was passed to @parametrize. Note that this may result from reuse of a generator.
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110104
Approved by: https://github.com/malfet, https://github.com/jbschlosser, https://github.com/voznesenskym
2023-09-27 02:52:48 +00:00
dec140f1ea [core IR] Add a core decomposition for aten.all (#110093)
## Context

Change the ref implementation of `aten.all` to only use other `torch` operators such that we can use it for the core ATen decomposition table. This will replace the decomposition for `aten.all` that was used specifically by Inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110093
Approved by: https://github.com/manuelcandales, https://github.com/peterbell10, https://github.com/lezcano
2023-09-27 01:31:41 +00:00
51a8c166a6 Add test for ShapeEnv recording fallback. (#109944)
This PR adds a test for the previous PR in this stack: #109904. In summary, it calls
functions decorated with `@record_shapeenv_event`, that don't have an explicit `ShapeEnv`
parameter, with arguments that don't hold a `ShapeEnv` instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109944
Approved by: https://github.com/ezyang
2023-09-27 00:50:14 +00:00
9928c10e71 [core IR] Add glu as a core decomposition (#110043)
## Context

Add the decomposition for `aten.glu` as a decomposition in the core ATen decomposition table. Don't use it in the Inductor decomposition table since Inductor has a lowering for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110043
Approved by: https://github.com/peterbell10, https://github.com/lezcano
ghstack dependencies: #110046
2023-09-27 00:23:05 +00:00
4d0ae7c9da [inductor] support _scaled_dot_product_flash_attention fallback (#110085)
Summary:
This PR supports _scaled_dot_product_flash_attention fallback kernel.
Note that in the abi_compatible mode, we retrieve outputs by passing
output argument pointers rather than relying on std::get.

It also fixes an issue related to dynamic shapes, where we wrongfully
query undefined dynamic symbols.

Test Plan: ci

Reviewed By: frank-wei

Differential Revision: D49620191

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110085
Approved by: https://github.com/desertfire
2023-09-27 00:09:56 +00:00
19ca883f8b [pytorch][jit] allow passing in obj loader in unpickle api (#109730)
Summary: We are trying to use wired message to pass python objects like KJT. In order to make JIT be able to unpickle it, we need to provide a type resolver as well as an obj loader. This diff modify the interface to let we be able to do that.

Test Plan:
Rely on current CI to make sure existing usage doesn't break.

In the next diff, test e2e

Differential Revision: D49438569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109730
Approved by: https://github.com/davidberard98
2023-09-26 23:50:20 +00:00
3262c5358f Use _check_is_size for validate_dim_length (#109849)
_check_is_size has some extra juice for unbacked SymInts, use it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109849
Approved by: https://github.com/yanboliang
2023-09-26 23:33:31 +00:00
27443eadeb [dtensor][7/n] remove reduction rule (#109144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109144
Approved by: https://github.com/fduwjj
ghstack dependencies: #108263, #108264
2023-09-26 22:24:50 +00:00
2dd9a79d22 [dtensor][6/n] refactor reduction to use op strategy (#108264)
This PR refactors the reduction op to use strategy based propagation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108264
Approved by: https://github.com/fduwjj
ghstack dependencies: #108263
2023-09-26 22:24:50 +00:00
986d255db2 [dtensor][5/n] switch random ops to op strategy (#108263)
This PR switches the random ops to use op strategy instead of rule
based, this is a first series of PRs to refactor ops after we refactor
op dispatch logic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108263
Approved by: https://github.com/fduwjj
2023-09-26 22:24:42 +00:00
d0f82cd082 Use Dr.CI results to classify flaky failures in trymerge (#110054)
After https://github.com/pytorch/test-infra/pull/4589, we can now query Dr.CI to get the list of flaky failures there.  This change queries Dr.CI API endpoint and check if the failure is a flaky one using `is_flaky` function.

Because the change is relatively large, I'm breaking it down to several smaller PRs in this order:

* [x] This PR queries Dr.CI and adds `is_flaky` check
* [ ] Clean up the flaky rules logic because it has already been implemented on Dr. CI
* [ ] Clean up the broken trunk logic for the same reason

### Testing

* Create a new `drci_mocks.json` file to catch the JSON response from Dr.CI API endpoint. The API requires `DRCI_BOT_KEY`.
*  `pytest -v test_trymerge.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110054
Approved by: https://github.com/clee2000
2023-09-26 21:24:21 +00:00
bb9779ecd2 Revert D49640259: Revert D49615962: [optests] Test names in failure dicts should be prefixed with test class (#110094)
Summary: Revert D49640259: Revert D49615962: [optests] Test names in failure dicts should

Test Plan: revert-hammer

Differential Revision: D49645397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110094
Approved by: https://github.com/izaitsevfb
2023-09-26 21:16:36 +00:00
ac3190c52c [cpu] vectorize atanh (#107786)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107786
Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/ezyang
2023-09-26 20:20:46 +00:00
194d9aa0f2 Revert "[Dynamo] Match closures by code ID (#109427)"
This reverts commit 3de08575031bc0ea770b5935dec13046d8ba7992.

Reverted https://github.com/pytorch/pytorch/pull/109427 on behalf of https://github.com/voznesenskym due to Fails test `PYTORCH_TEST_WITH_DYNAMO=1 python test_ops.py -k test_out_warning__refs_cat_cpu ([comment](https://github.com/pytorch/pytorch/pull/109427#issuecomment-1736101561))
2023-09-26 18:54:36 +00:00
a7409695bb [export] Verifier for exported program (#109519)
Summary:
X-link: https://github.com/pytorch/executorch/pull/292

Added a verifier for the graph signature in a exported program

Test Plan: CI

Differential Revision: D48926643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109519
Approved by: https://github.com/zhxchen17
2023-09-26 18:47:43 +00:00
0a60219fe3 [foreach] Fix 0-size handling for real for real (#109402)
@crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in https://github.com/pytorch/pytorch/issues/100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because:
1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed.
2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through.

We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by:
- removing the finagling of chunks when the tail tensor is 0-sized
- adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still.

## test plan

As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109402
Approved by: https://github.com/albanD
2023-09-26 17:38:20 +00:00
317e39a8ad [C10d] Cleanup collective sequence number. (#109136)
Sequence numbers must be associated with a Work object
if we want to use it as a way to report collective progress.

The API surface change is introducing Work::getSequenceNumber, which
should eventually be exposed to python.

The bulk of this change is changing gloo to make the sequence number
be always in use and weave it to the dozens subclasses of Work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109136
Approved by: https://github.com/fduwjj
2023-09-26 17:17:04 +00:00
818f2297e6 Ensure fill_ works when value is a view of self (#109835)
# Summary
Introduced a BC breaking change in #109533 when self is a view of the value. By using the copy_() op inside fill_ we were hitting `assert_no_partial_overlap` in tensor iterator.

Ideal we would be able to avoid this check if value.numel() ==1 . But rather than monkeying around with tensor iterator I just clone the input instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109835
Approved by: https://github.com/mikaylagawarecki
2023-09-26 17:12:48 +00:00
3705e65254 Add pin_memory to torch.Tensor type annotation args (#109797)
Test Plan: Sandcastle

Differential Revision: D49504528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109797
Approved by: https://github.com/jianyuh
2023-09-26 17:12:37 +00:00
1277d0e834 [BE] Add sharding data by default to metrics (#110035)
Extend metric library to allow setting global metrics on a process level which will always be emitted.

Current use case for them is to include shard information every time a metric is emitted by run_test.py

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 0cae92c</samp>

> _`run_test` refactored_
> _Sharding metrics in Rockset_
> _Autumn of testing_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035
Approved by: https://github.com/clee2000
2023-09-26 17:06:49 +00:00
d91492a7a4 [MPS] Fix sort with empty tensor. (#109584)
Fixes #107284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109584
Approved by: https://github.com/kulinseth, https://github.com/albanD
ghstack dependencies: #109557, #109574
2023-09-26 16:30:38 +00:00
993530ee4f [aotinductor] Relax the CUDAGuard device index check (#110030)
Summary: Although AOTInductor only supports running on a single cuda device, it does work in the case where there is a mix of cpu and cuda ops. So instead of asserting if a CUDA index appears for the first time, we check if there is only one cuda device index. This solves https://github.com/pytorch/pytorch/issues/109655

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110030
Approved by: https://github.com/jansel
2023-09-26 16:23:23 +00:00
47adcd412f Increase timeout for slow tests (#109206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109206
Approved by: https://github.com/huydhn
2023-09-26 16:18:38 +00:00
0dcea70bfd fix sfdp patern 13 accuracy issue (#110001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110001
Approved by: https://github.com/eellison
2023-09-26 15:23:45 +00:00
2393864070 Revert "[optests] Test names in failure dicts should be prefixed with test class (#110045)"
This reverts commit 76fcec74c413af22186f0782f02aca49ab61dc20.

Reverted https://github.com/pytorch/pytorch/pull/110045 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/110045#issuecomment-1735711094))
2023-09-26 14:56:08 +00:00
a5de10d7a5 Remove linux.t4g.2xlarge Usage (#110064)
Switched from linux.t4g.2xlarge to linux.arm64.2xlarge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110064
Approved by: https://github.com/atalman, https://github.com/malfet
2023-09-26 14:30:35 +00:00
ea20db8aa0 [optests] Excise unused operator_compile_check (#110011)
The recommendation is to just use `opcheck`, which has superceded all
uses of `operator_compile_check`.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110011
Approved by: https://github.com/ezyang
ghstack dependencies: #109912
2023-09-26 13:24:21 +00:00
812bf847b7 Revert "Add test for ShapeEnv recording fallback. (#109944)"
This reverts commit a4dec8d306d96637aa4dc1ee9deba289b128c148.

Reverted https://github.com/pytorch/pytorch/pull/109944 on behalf of https://github.com/atalman due to New test failing internally ([comment](https://github.com/pytorch/pytorch/pull/109944#issuecomment-1735512734))
2023-09-26 13:11:22 +00:00
e05eb69c93 Don't link to libcpuinfo on s390x (#109875)
Don't even build it.
It does not support s390x.

This is a follow up for https://github.com/pytorch/pytorch/pull/109496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109875
Approved by: https://github.com/kit1980
2023-09-26 12:43:35 +00:00
92d86cd1ad [inductor] Fix triton compiler error in multilayer any (#109325)
Fixes #109196

When we have a split reduction and the tensor is not an even multiple of the split size,
we use `ops.masked` to pad to an even multiple. In the case here we generated:
```python
tmp5 = tl.where(mask, tmp4, 0)
```

which implicitly promotes our boolean value to `int32`. The fix is to give the default
value the same dtype as `result`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109325
Approved by: https://github.com/lezcano
2023-09-26 12:29:29 +00:00
1b90f07f5a Revert "Reland "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)" (#109906)"
This reverts commit d0fe8fa5db6dd06adfe1246a72b6d3a5215ff86e.

Reverted https://github.com/pytorch/pytorch/pull/109906 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/109906#issuecomment-1735416852))
2023-09-26 12:10:25 +00:00
132a138a01 MAINT: pytorchify torch._numpy tests: core/ and fft/ (#109815)
1. Inherit from TestCase
2. Use pytorch parametrization
3. Use unittest.expectedFailure to mark xfails, also unittest skips

All this to make pytest-less invocation work:

$ python test/torch_np/test_basic.py

cross-ref #109593, #109718, #109775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109815
Approved by: https://github.com/lezcano
2023-09-26 11:04:24 +00:00
8140494afd [3/N][2D] Enable training with new 2D flow (#110034)
Replacing https://github.com/pytorch/pytorch/pull/109553 as it gets reverted.

This PR enables training with new 2D flow and adds associated test. In addition, this PR moves the tensor/parallel/_data_parallel_utils.py that are fsdp specific back to tensor/parallel/fsdp.py to avoid circular dependency for ddp.py and test/distributed/tensor/parallel/test_ddp_2d_parallel.py.

state_dict related changes would be in later PRs.

cc. @fegin, @fduwjj, @wanchaol, @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110034
Approved by: https://github.com/fduwjj
2023-09-26 09:14:15 +00:00
0673aa3d28 [dynamo][guards-log] Print nn module guard saved dict versions for debugging (#110028)
This is the output for nn module guards

~~~
[DEBUG] GUARDS:
[DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False           # _dynamo/variables/builder.py:1356 in wrap_fx_proxy_cls
[DEBUG] ___check_obj_id(L['self'], 139820807110912)                   # for mod in self.mods:  # examples/graph_break.py:35 in forward
[DEBUG] __nn_module_guard_0(L['self']) # versions(mod=9998, _parameters=1194395, _buffers=1194397, _modules=1194423, _forward_hooks=1194405, _forward_pre_hooks=1194411, _backward_hooks=1194402, _backward_pre_hooks=1194400)  # for mod in self.mods:  # examples/graph_break.py:35 in forward
[DEBUG] ___check_obj_id(L['self'].mods[0], 139817945727568)           # for mod in self.mods:  # examples/graph_break.py:35 in forward
[DEBUG] __nn_module_guard_1(L['self'].mods[0]) # versions(mod=10001, _parameters=1194428, _buffers=1194430, _modules=1194522, _forward_hooks=1194438, _forward_pre_hooks=1194444, _backward_hooks=1194435, _backward_pre_hooks=1194433)  # for mod in self.mods:  # examples/graph_break.py:35 in forward
[DEBUG] ___check_obj_id(L['self'].mods[1], 139817945560640)           # for mod in self.mods:  # examples/graph_break.py:35 in forward
[DEBUG] __nn_module_guard_2(L['self'].mods[1]) # versions(mod=10001, _parameters=1194660, _buffers=1194662, _modules=1194753, _forward_hooks=1194670, _forward_pre_hooks=1194676, _backward_hooks=1194667, _backward_pre_hooks=1194665)  # for mod in self.mods:  # examples/graph_break.py:35 in forward
[DEBUG] ___check_obj_id(L['self'].mods[0].linear, 139817945727856)    # return self.linear(a)  # examples/graph_break.py:24 in helper
[DEBUG] __nn_module_guard_3(L['self'].mods[0].linear) # versions(mod=10004, _parameters=1470004, _buffers=1194467, _modules=1194493, _forward_hooks=1194475, _forward_pre_hooks=1194481, _backward_hooks=1194472, _backward_pre_hooks=1194470)  # return self.linear(a)  # examples/graph_break.py:24 in helper
[DEBUG] ___check_obj_id(L['self'].mods[1].linear, 139817945561120)    # return self.linear(a)  # examples/graph_break.py:24 in helper
[DEBUG] __nn_module_guard_4(L['self'].mods[1].linear) # versions(mod=10004, _parameters=1470008, _buffers=1194699, _modules=1194725, _forward_hooks=1194707, _forward_pre_hooks=1194713, _backward_hooks=1194704, _backward_pre_hooks=1194702)  # return self.linear(a)  # examples/graph_break.py:24 in helper
[DEBUG] utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:373 in init_ambient_guards
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110028
Approved by: https://github.com/ezyang
ghstack dependencies: #110023, #110039
2023-09-26 08:53:07 +00:00
5df8aca994 [core IR] Add a core decomposition for floor_divide (#110046)
## Context

Introduce a core decomposition for `aten.floor_divide` into other `aten` ops, and add it to the core ATen decomposition table.

This replaces the decomposition of `floor_divide` that was used by Inductor. I noticed there was a note on that decomposition

```
# TorchInductor-only decomposition. It should not be taken to core.
# See https://github.com/pytorch/torchdynamo/pull/1120
```

but couldn't discern the reason why this is the case. cc: @lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110046
Approved by: https://github.com/peterbell10
2023-09-26 08:39:21 +00:00
26e8cc0465 Add test for ShapeEnv state when not recording. (#109945)
This PR adds a test for checking `ShapeEnv` state when it's built with
`should_record_events=False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109945
Approved by: https://github.com/ezyang
ghstack dependencies: #109904, #109944
2023-09-26 07:20:46 +00:00
2ac7e52d34 [dynamo][nn_module_guards] Config flag to disable nn_module_guards (#110039)
This flag is requested by @Chillee who is seeing recompilations with simple gpt experiments. We are observing recompilations because `_parameters` ordered dict keeps changing from run to run, and its unclear why that is happening.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110039
Approved by: https://github.com/Chillee
ghstack dependencies: #110023
2023-09-26 06:35:23 +00:00
dd819138da [pytorch vulkan] add tensor vulkan check for at::cat (#109936)
Summary:
Saw this issue when running pytorch vulkan on a LSTM model:

https://www.internalfb.com/phabricator/paste/view/P834993118

Found that we don't always to the vulkan transfer on `at::cat`

Test Plan:
(Not running the LSTM model yet. Since there are other crahses.)

```
[yipjustin@47884.od /data/sandcastle/boxes/fbsource (3fd2308f8|remote/fbcode/warm_fbcode_od_stable...)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*cat*"
Building: finished in 0.1 sec (100%) 330/330 jobs, 0/330 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *cat*
[==========] Running 43 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 43 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.replication_pad2d
[       OK ] VulkanAPITest.replication_pad2d (102 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_invalidinputs_exceptions
[       OK ] VulkanAPITest.cat_4d_dim0_invalidinputs_exceptions (67 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_samebatch_success
[       OK ] VulkanAPITest.cat_4d_dim0_samebatch_success (111 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_diffbatch_success
[       OK ] VulkanAPITest.cat_4d_dim0_diffbatch_success (76 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_singledepth_success
[       OK ] VulkanAPITest.cat_4d_dim0_singledepth_success (40 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_singletensor_success
[       OK ] VulkanAPITest.cat_4d_dim0_singletensor_success (7 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_twotensors_success
[       OK ] VulkanAPITest.cat_4d_dim0_twotensors_success (30 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_negdim_success
[       OK ] VulkanAPITest.cat_4d_dim0_negdim_success (78 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_negdim_success
[       OK ] VulkanAPITest.cat_4d_dim1_negdim_success (130 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim2_negdim_success
[       OK ] VulkanAPITest.cat_4d_dim2_negdim_success (75 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim3_negdim_success
[       OK ] VulkanAPITest.cat_4d_dim3_negdim_success (68 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_texture2d_success
[       OK ] VulkanAPITest.cat_4d_dim1_texture2d_success (2 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_singledepth_success
[       OK ] VulkanAPITest.cat_4d_dim1_singledepth_success (65 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_singletensor_success
[       OK ] VulkanAPITest.cat_4d_dim1_singletensor_success (8 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_bat1_mult4ch_success
[       OK ] VulkanAPITest.cat_4d_dim1_bat1_mult4ch_success (9 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_bat2_mult4ch_success
[       OK ] VulkanAPITest.cat_4d_dim1_bat2_mult4ch_success (18 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_mult4ch_mixed_success
[       OK ] VulkanAPITest.cat_4d_dim1_mult4ch_mixed_success (60 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim2_sameheight_success
[       OK ] VulkanAPITest.cat_4d_dim2_sameheight_success (80 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim2_diffheight_success
[       OK ] VulkanAPITest.cat_4d_dim2_diffheight_success (69 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim2_singledepth_success
[       OK ] VulkanAPITest.cat_4d_dim2_singledepth_success (12 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim2_invalidinputs_exceptions
[       OK ] VulkanAPITest.cat_4d_dim2_invalidinputs_exceptions (63 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim3_invalidinputs_exceptions
[       OK ] VulkanAPITest.cat_4d_dim3_invalidinputs_exceptions (86 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim3_samewidth_success
[       OK ] VulkanAPITest.cat_4d_dim3_samewidth_success (117 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim3_diffwidth_success
[       OK ] VulkanAPITest.cat_4d_dim3_diffwidth_success (72 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim0_mult4ch_success
[       OK ] VulkanAPITest.cat_3d_dim0_mult4ch_success (12 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim0_diff_channel_success
[       OK ] VulkanAPITest.cat_3d_dim0_diff_channel_success (28 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim0_same_channel_success
[       OK ] VulkanAPITest.cat_3d_dim0_same_channel_success (15 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim1_diffheight_success
[       OK ] VulkanAPITest.cat_3d_dim1_diffheight_success (21 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim1_same_height_success
[       OK ] VulkanAPITest.cat_3d_dim1_same_height_success (10 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim2_diffwidth_success
[       OK ] VulkanAPITest.cat_3d_dim2_diffwidth_success (21 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim2_samewidth_success
[       OK ] VulkanAPITest.cat_3d_dim2_samewidth_success (11 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim0_negdim_success
[       OK ] VulkanAPITest.cat_3d_dim0_negdim_success (25 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim1_negdim_success
[       OK ] VulkanAPITest.cat_3d_dim1_negdim_success (23 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim2_negdim_success
[       OK ] VulkanAPITest.cat_3d_dim2_negdim_success (10 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim0_same_height_success
[       OK ] VulkanAPITest.cat_2d_dim0_same_height_success (3 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim0_diff_height_success
[       OK ] VulkanAPITest.cat_2d_dim0_diff_height_success (2 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim1_same_width_success
[       OK ] VulkanAPITest.cat_2d_dim1_same_width_success (3 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim1_diff_width_success
[       OK ] VulkanAPITest.cat_2d_dim1_diff_width_success (4 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim0_negdim_success
[       OK ] VulkanAPITest.cat_2d_dim0_negdim_success (3 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim1_negdim_success
[       OK ] VulkanAPITest.cat_2d_dim1_negdim_success (3 ms)
[ RUN      ] VulkanAPITest.cat_1d_dim0_same_width_success
[       OK ] VulkanAPITest.cat_1d_dim0_same_width_success (52 ms)
[ RUN      ] VulkanAPITest.cat_1d_dim0_diff_width_success
[       OK ] VulkanAPITest.cat_1d_dim0_diff_width_success (0 ms)
[ RUN      ] VulkanAPITest.cat_1d_dim0_negdim_success
[       OK ] VulkanAPITest.cat_1d_dim0_negdim_success (0 ms)
[----------] 43 tests from VulkanAPITest (1717 ms total)

[----------] Global test environment tear-down
[==========] 43 tests from 1 test suite ran. (1717 ms total)
[  PASSED  ] 43 tests.

  YOU HAVE 4 DISABLED TESTS
```

Differential Revision: D49566743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109936
Approved by: https://github.com/SS-JIA
2023-09-26 06:08:17 +00:00
5dcee01c2b Monitor baseline for TD prioritizations (#110031)
For tests that TD prioritizes, we should track what their ordering _would have been_ if none of the TD heuristics had applied to it.

This is useful for two reasons:
1. It lets us better understand TD may have contributed to that test running sooner
2. it's possible that heuristics actually mark a test as less important than the default sorting would have claimed (the default sorts tests in a fixed order). This will let us track how often that happens
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110031
Approved by: https://github.com/clee2000
2023-09-26 04:27:16 +00:00
ac1e85161e [MPS] Fix nll_loss with default ignore_index (#109574)
`-100` should be a valid `ignore_index` as indicated in the linked issue. This PR also cleans up some unnecessary MPSTensor copies.

Fixes #108148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109574
Approved by: https://github.com/kulinseth
ghstack dependencies: #109557
2023-09-26 04:13:09 +00:00
0087118997 [MPS] Fix mps to cpu copy with storage offset (#109557)
Fix #108978

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109557
Approved by: https://github.com/DenisVieriu97
2023-09-26 04:13:08 +00:00
129f535778 [VMAP] Add linspace and logspace batch rules (#105451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105451
Approved by: https://github.com/zou3519
ghstack dependencies: #107958, #104889
2023-09-26 04:08:24 +00:00
5589b81173 Remove redundant change for gloo (#106750)
HIP deprecated symbols are removed by d74270ece2 and fe2ad9c328 which is included in pytorch gloo already.

gloo in pytorch master: 597accfd79

There is no need to fix it in pytorch now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106750
Approved by: https://github.com/jithunnair-amd, https://github.com/kit1980
2023-09-26 03:46:14 +00:00
dddf07e56a Reland: implement a function to convert a storage to copy-on-write (#110022)
Relands #100819

In addition, the `impl_cow_context` library is combined into the base c10 core library, and COW unit tests are combined into just one binary.

Part of #109833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110022
Approved by: https://github.com/ezyang
2023-09-26 03:33:18 +00:00
76fcec74c4 [optests] Test names in failure dicts should be prefixed with test class (#110045)
We want to use the same failures dict for multiple TestCase. This happens
common in e.g. fbgemm. To move towards that, we need to prefix each test name
with their test class to avoid ambiguity

Differential Revision: [D49615962](https://our.internmc.facebook.com/intern/diff/D49615962/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110045
Approved by: https://github.com/williamwen42
2023-09-26 03:21:12 +00:00
41bb5c27a2 Enable typechecking for _inductor/fx_passes/joint_graph.py (#109955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109955
Approved by: https://github.com/Skylion007
ghstack dependencies: #109951, #109952, #109954
2023-09-26 02:49:43 +00:00
86762f33d1 Enable typechecking for _inductor/fx_passes/pad_mm.py (#109954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109954
Approved by: https://github.com/Skylion007
ghstack dependencies: #109951, #109952
2023-09-26 02:49:43 +00:00
55f8553078 Enable typechecking for _inductor/fx_passes/pre_grad.py (#109952)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109952
Approved by: https://github.com/Skylion007
ghstack dependencies: #109951
2023-09-26 02:49:42 +00:00
89fc66fb36 Enable typechecking for _inductor/fx_passes/split_cat.py (#109951)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109951
Approved by: https://github.com/Skylion007
2023-09-26 02:49:40 +00:00
ac60638c6c [ndk] Clean up LLVM and libc++ 12 and 13 (#107326)
Reviewed By: simpleton

Differential Revision: D48410595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107326
Approved by: https://github.com/yozhu
2023-09-26 02:05:27 +00:00
f8fcc54f70 Add torch.library.impl_abstract (#109912)
Changelog:
- torch.library.impl_abstract optionally accepts a torch.library.Library
  object. If passed in, then the lifetime of the registration is tied to
  the Library object.
- we've also changed torch.library.impl_abstract to work on all
  operators, including overloads.
- we refactored the `torch._custom_ops.*` and `torch._custom_op.*`
  impl_abstract APIs and put them under torch._library. This is the
  final resting place for them. I will follow-up with deleting
  all the `torch._custom_ops.*` stuff later.
- There is a new "SimpleOperatorRegistry" where we actually collect the
  abstract_impl. We will expand this to also hold the other
  torch._custom_ops.* APIs when we move those to torch.library

NB: Previously we had designed
`impl_abstract` assuming a very high-level Python-only custom op API.
We've revisited that since; now, impl_abstract works for all custom ops,
no matter python or C++, no matter the schema. The new refactored design
reflects this better.

Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109912
Approved by: https://github.com/ezyang
2023-09-26 01:59:50 +00:00
b481349d3c [dynamo][guards-log] Do not print duplicate guard entries (#110023)
Cleans up logs for nn module guards. They always get duplicated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110023
Approved by: https://github.com/ezyang
2023-09-26 01:59:25 +00:00
56659844f9 [profiler] Show shapes for lists of tensors in chrome traces #109263 (#109751)
Summary:
https://github.com/pytorch/pytorch/issues/109263
Show the shape of tensorlist when the length is < 30.

Test Plan:
{F1097707985}
and unit tests

Reviewed By: davidberard98

Differential Revision: D49351902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109751
Approved by: https://github.com/davidberard98
2023-09-26 01:03:54 +00:00
4bf1cd6961 [aotinductor] Rename aot_runtime to aoti_runtime (#110007)
Summary: Make the naming more explicit

Differential Revision: D49593528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110007
Approved by: https://github.com/houseroad
2023-09-26 00:46:54 +00:00
b07bebd4bd Add default arguments to sym_constrain_range_for_size (#109858)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109858
Approved by: https://github.com/williamwen42
2023-09-26 00:35:33 +00:00
cyy
bcedbac96a Re-enable more Windows tests (#109847)
Follows the work of #108930.

The commented test_custom_classes.py was removed since the file doesn't exist.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109847
Approved by: https://github.com/kit1980
2023-09-26 00:29:31 +00:00
a81cb0de16 [Dynamo] Support python class member_descriptor (#109956)
Fixes Meta internal cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109956
Approved by: https://github.com/jansel
2023-09-26 00:03:41 +00:00
5f6216b12c Add torch.fx.experimental.recording to uninteresting_files() (#109887)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109887
Approved by: https://github.com/Chillee
2023-09-25 23:22:29 +00:00
7af30ea54c [AOTInductor] Bug fix for redefining symbol name (#110041)
Summary:
Bug fix for redefining symbol name.

Test Plan:
python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy --inference --device cuda --export-aot-inductor --cold-start-latency --only OPTForCausalLM

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110041
Approved by: https://github.com/desertfire
2023-09-25 23:03:06 +00:00
6275f91654 Improved DDP checkpoint documentation (#106985)
Amended the documentation for the specified case.

Fixes #84589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106985
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-09-25 22:54:24 +00:00
7ed06e8317 [inductor] enable mypy checking in torch/_inductor/codegen/cpp.py (#109729)
Summary: Add enough typehints / ignores to enable mypy checking in torch/_inductor/codegen/cpp.py

Test Plan: lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109729
Approved by: https://github.com/Skylion007
2023-09-25 22:53:05 +00:00
f87863335c [BE]s/DEFINE_ENUM/DEFINE_ST_ENUM_VAL_/ (#109917)
To avoid potential collisions with other libraries that can define such enum globally (which is a bad practice, but happens sometimes)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109917
Approved by: https://github.com/Skylion007
2023-09-25 22:19:09 +00:00
57cdad2396 [aotinductor] Update benchmark to include compilation time (#109998)
Fixes [comment](https://github.com/pytorch/pytorch/pull/109820#pullrequestreview-1638629777)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109998
Approved by: https://github.com/desertfire
2023-09-25 21:30:22 +00:00
ab70183c53 [RFC] Allow "spawn" start method for torchinductor workers. (#108850)
Context: https://github.com/pytorch/pytorch/issues/108586

This PR adds a config to torchinductor such that users can specify the multiprocessing context for TorchInductor workers in codecache.

This would allow users a choice of using "spawn" in multithreaded environments instead of "fork" being hardcoded as the default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108850
Approved by: https://github.com/ezyang, https://github.com/zdevito
2023-09-25 21:30:17 +00:00
a4dec8d306 Add test for ShapeEnv recording fallback. (#109944)
This PR adds a test for the previous PR in this stack: #109904. In summary, it calls
functions decorated with `@record_shapeenv_event`, that don't have an explicit `ShapeEnv`
parameter, with arguments that don't hold a `ShapeEnv` instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109944
Approved by: https://github.com/ezyang
ghstack dependencies: #109904
2023-09-25 20:59:41 +00:00
5c4b5baf21 Fix python decomps for OpOverloadPackets and add tests (#107707)
- Extend `test_torch_dispatch_meta_outplace` to test torch ops that do not have an out parameter but have aten op overloads that have out parameters. Additionally, Python decompositions may register `OpOverloadPacket`'s so decompositions need to be tested to ensure all `OpOverloads` still function for the `Meta` key (e.g. if a python decomposition is registered for an aten op `aten.foo` with overloads `[default, out]`, the python function needs to support receiving out arguments)

- Add out parameter wrappers to python decomps for aten ops that have out overloads

CC. @ezyang @albanD @lezcano

Fixes #107713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107707
Approved by: https://github.com/lezcano
2023-09-25 20:53:30 +00:00
c1a2f35805 Revert "Disallow skipping dynamo (#109476)"
This reverts commit 7bb1d10c2ff06116506fb190c1b816a5b75f46ff.

Reverted https://github.com/pytorch/pytorch/pull/109476 on behalf of https://github.com/atalman due to Failing internal CI ([comment](https://github.com/pytorch/pytorch/pull/109476#issuecomment-1734402581))
2023-09-25 20:20:50 +00:00
c4f2b6dbd2 [profiler] use PyCFunction_Check to check both PyCMethod_Type and PyC… (#110002)
At https://github.com/pytorch/pytorch/blob/main/torch/csrc/autograd/profiler_python.cpp#L1096, when what is PyTrace_C_CALL, Py_TYPE(arg) only can be PyCFunction_Type before python3.9. But in python3.9 or later, Py_TYPE(arg) also can be PyCMethod_Type.
PyCMethod_Type is subtype of PyCFunction_Type, ref to
f2eaa92b0c/Objects/methodobject.c (L372).
So there should use PyCFunction_Check to check arg->ob_type.

Fixes #109877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110002
Approved by: https://github.com/ezyang
2023-09-25 20:17:25 +00:00
83deaa16ed Revert "[1/N] Cleanup header inclusions in torch_cpu by iwyu (#101178)"
This reverts commit b7a95f4fdb8a79dc459cc757dafcdbd0953b1a62.

Reverted https://github.com/pytorch/pytorch/pull/101178 on behalf of https://github.com/atalman due to Break internal CI ([comment](https://github.com/pytorch/pytorch/pull/101178#issuecomment-1734384645))
2023-09-25 20:05:25 +00:00
d6cc3ac8b2 Add PR number to metrics when available (#109406)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 780bfa6</samp>

Add a new metric for pull request number in `tools/stats/upload_metrics.py`. This allows tracking the CI performance of pull requests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109406
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/clee2000
2023-09-25 19:57:34 +00:00
3de0857503 [Dynamo] Match closures by code ID (#109427)
Closes https://github.com/pytorch/pytorch/issues/107866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109427
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-09-25 19:10:35 +00:00
09c598745c Rename torch._C._TensorBase to TensorBase (#109940)
I have gone ahead and implemented the renaming of the type `torch._C._TensorBase` to a non-private class name `TensorBase`.
The changes also include leaving `torch._C._TensorBase` as an alias to the new type: 70458768fb/torch/csrc/autograd/python_variable.cpp (L2196-L2197) both in the c++ code and in the corresponding `__init__.pyi.in` file:
70458768fb/torch/_C/__init__.pyi.in (L1522)

Fixes #109438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109940
Approved by: https://github.com/ezyang
2023-09-25 19:10:22 +00:00
a565f1bee6 [aotinductor] Skip benchmarks with control flow (#109661)
Since AOTInductor doesn't support control flow yet, we will skip over tests that are currently failing due to containing control flow in the code. Logs taken from https://hud.pytorch.org/benchmark/compilers?startTime=Tue%2C%2012%20Sep%202023%2022%3A56%3A40%20GMT&stopTime=Tue%2C%2019%20Sep%202023%2022%3A56%3A40%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=main&lCommit=2c1554a0323107d821be3ff13df7833b9f0b960d&rBranch=main&rCommit=47be61e12bd51df27182343d312dc3df485d5559

Errors documented in https://github.com/pytorch/pytorch/issues/105217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109661
Approved by: https://github.com/desertfire
2023-09-25 18:49:06 +00:00
6b39cf863f Fix invalid arg to getLogger in torch distributed checkpoint (#110008)
Ran the experimental LOG002 ruff check and found a bug in our codebase. Logger should not be instantiated from `__file__`, it should be instantiated from `__name__`

https://docs.astral.sh/ruff/rules/invalid-get-logger-argument/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110008
Approved by: https://github.com/ezyang
2023-09-25 18:21:18 +00:00
7de669f2f9 [core IR] Remove trunc decomp and add trunc to core (#109902)
Following up from [this comment](https://github.com/pytorch/pytorch/pull/109319#discussion_r1330803226). Remove the decomposition for `trunc`, and add it as a core operator.

Going forward, provide similar treatment for operators that map cleanly to hardware instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109902
Approved by: https://github.com/peterbell10
2023-09-25 18:18:06 +00:00
fe5e63f5db [inductor] Do type promotion in pointless cumsum pattern replacement (#109960)
Fixes #109925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109960
Approved by: https://github.com/Fidget-Spinner, https://github.com/lezcano
2023-09-25 18:17:15 +00:00
4734496a0c Extend storage access error api for untyped_storage() (#109750)
In cudagraph trees, we invalidate tensors at some point and drop their storage. Then, when they are accessed with .data_ptr(), a custom error message is thrown. Previously, this invalidation didn't also make untyped_storage()/storage() error which could result in a segfault.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109750
Approved by: https://github.com/zou3519
2023-09-25 17:51:27 +00:00
a5364b12bb Revert "[ONNX] Remove the depreacated function _export (#109763)"
This reverts commit d7c05bb2e8de24386664c01e887357ff50a09842.

Reverted https://github.com/pytorch/pytorch/pull/109763 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/109763#issuecomment-1734201053))
2023-09-25 17:47:21 +00:00
52e14787ae Revert "MAINT: pytorchify torch._numpy tests: core/ and fft/ (#109815)"
This reverts commit 5ad1baf6fa036690786cc45dafb79c6a4656cec5.

Reverted https://github.com/pytorch/pytorch/pull/109815 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing a slow test in trunk 5ad1baf6fa.  Please help fix and reland the change ([comment](https://github.com/pytorch/pytorch/pull/109815#issuecomment-1734137821))
2023-09-25 17:01:27 +00:00
f5886bf352 Revert "[3/N][2D] Enable training with new 2D flow (#109553)"
This reverts commit 217b37c023d58854a7a6117c3726ed44786c9d03.

Reverted https://github.com/pytorch/pytorch/pull/109553 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but those distributed failures look legit and they are failing in trunk https://hud.pytorch.org/pr/109553 ([comment](https://github.com/pytorch/pytorch/pull/109553#issuecomment-1734100546))
2023-09-25 16:37:19 +00:00
837272f150 Python 3.10 Union operator | support for JIT (#109293)
Fixes #101777

- [x] Duplicated the tests from `test/jit/test_union.py` into [`test/jit/test_union_pep604.py`](https://github.com/pytorch/pytorch/pull/109293/files#diff-b981f6493093482b43b0e62057b0c01b004b3e932d4e63a1166c3808c0172b83), using PEP604 style Unions
- [x] Exchanged custom `get_args` and `get_origin`  with `typing.get_args` and `typing.get_origin` which have the same functionality and became part of the standard library in 3.8
- [x] Added utility function `pep604union_to_union` in `tree_views.h` which converts a `BinOP("|")` node into the corresponding `Union`. This function intercepts `ScriptTypeParser::parseTypeFromExpr` and `ScriptTypeParser::parseTypeFromExprImpl` and patches the expression.
- [ ] There is a single failing test, I commented it out for the moment to see if CI complains about anything else. I tried several hours to figure out how to patch it, but I am not experienced with C++ development and debugging.

From what I could gather, the following fails:

```python
    def test_union_optional_of_union_return(self):
        @torch.jit.script
        def fn() -> None | str | int:
            y: Optional[int | str] = "foo"
            return y
```

In the section:

75b954b715/torch/csrc/jit/frontend/script_type_parser.cpp (L232-L243)

When using regular `Union`, the `resolver` path is taken, whereas with the patch pep604 union, `resolveType` doesn't work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109293
Approved by: https://github.com/ezyang
2023-09-25 15:35:54 +00:00
d0fe8fa5db Reland "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)" (#109906)
I'm pretty sure this is fixed but I'll run inductor and trunk CI. The failing test in trunk previously was that the selective activation checkpointing code that landed recently assumes that it can detect whether or not AOTAutograd is running by seeing if the inputs to SAC are C++ `FunctionalTensorWrapper`s

previous land broke some inductor trunk tests

This reverts commit 629a628cc8bb1f62e2cce11bf0c8a00d3d06f896.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109906
Approved by: https://github.com/ezyang
2023-09-25 14:53:54 +00:00
3beed41e12 [Easy] Remove hook warning where source is always guaranteed (#109898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109898
Approved by: https://github.com/ezyang
2023-09-25 14:36:28 +00:00
5565a29568 Release GIL in torch.cuda ops wherever possible. (#109159)
Most `torch.cuda` ops (ex: `torch.cuda.synchronize`) do not release GIL in C++ land. This has the potential of causing deadlocks and freeze the python process. For example, `torch.cuda.synchronize` could hold GIL and get blocked on some operation. However, that operation might never complete in python land since GIL is held by `torch.cuda.synchronize`.

In this PR, I've tried to release GIL as much as possible in `torch.cuda` ops.

See https://github.com/pytorch/pytorch/issues/109074 for an example of how holding GIL causes a deadlock.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109159
Approved by: https://github.com/ezyang
2023-09-25 14:35:31 +00:00
96a3a7cc82 [pytorch] make IterableDataset of Iterable type (#109645)
Summary: Makes `IterableDataset` of `Iterable` type.

Test Plan: tests next diff in the stack are all green

Differential Revision: D49420146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109645
Approved by: https://github.com/DanilBaibak, https://github.com/Skylion007
2023-09-25 14:18:15 +00:00
6a202c36af Minor fixes in semi-structured sparse code (#105595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105595
Approved by: https://github.com/jcaip
2023-09-25 14:06:08 +00:00
829b5c0949 Revert "[Dynamo] Support python class member_descriptor (#109956)"
This reverts commit 12cd776d902dea1ee3f0ef7980bea62ff64096d2.

Reverted https://github.com/pytorch/pytorch/pull/109956 on behalf of https://github.com/jeanschmidt due to multiple slow jobs broken ([comment](https://github.com/pytorch/pytorch/pull/109956#issuecomment-1733706269))
2023-09-25 13:25:45 +00:00
217b37c023 [3/N][2D] Enable training with new 2D flow (#109553)
This PR enables training with new 2D flow and adds associated test.

state_dict related changes would be in later PRs.

cc. @fegin, @fduwjj, @wanchaol, @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109553
Approved by: https://github.com/fegin, https://github.com/awgu
2023-09-25 05:32:07 +00:00
12cd776d90 [Dynamo] Support python class member_descriptor (#109956)
Fixes Meta internal cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109956
Approved by: https://github.com/jansel
2023-09-25 03:15:39 +00:00
cyy
265acd4bea Clean up CMake target linking (#109959)
This PR cleans up more CMake target linking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109959
Approved by: https://github.com/ezyang
2023-09-25 01:37:14 +00:00
7c9052165a add fp16 support for native conv and deconv on CPU (#99497)
### Testing

Native conv vs. mkldnn conv on SPR (with avx512_fp16 support)

Single core:

Input | Naïve impl   / us | oneDNN /   us | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 34676789 | 524199.8 | 66.15185
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 33454125 | 349844.4 | 95.62573
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 317650.1 | 2317.677 | 137.0554
IC: 128, OC: 256, kernel: 3, stride: 1,   N: 1, L: 64 | 15334.68 | 167.264 | 91.67952

56 cores:
Input | Naïve impl   / us | oneDNN /   us | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 1032064 | 11073.58 | 93.20061
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 1000097 | 16371.19 | 61.08883
IC:   256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 981813.4 | 9008.908 | 108.9825
IC: 1024, OC: 256, kernel: 1, stride: 1,   N: 256, H: 14, W: 14, G: 1, pad: 0 | 1082606 | 10150.47 | 106.6558
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 319980.6 | 181.598 | 1762.027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99497
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-09-25 01:31:26 +00:00
ca5f3a7436 TST: test that numpy dtypes do not graph break (#109974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109974
Approved by: https://github.com/lezcano
2023-09-25 01:00:39 +00:00
84a67c0665 Use wrapper instead of V.graph.wrapper_code (#109883)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109883
Approved by: https://github.com/voznesenskym, https://github.com/jansel
2023-09-24 23:11:11 +00:00
10f9edc99d Don't -Werror on cast-function-type (#109796)
I recently built PyTorch with clang and we are apparently
not warnings clean on this.  Since we don't have any contbuild
that catches this situation, just get rid of it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109796
Approved by: https://github.com/cpuhrsch
2023-09-24 23:05:10 +00:00
bb74d9104f [PTD][TP] Refactor the test and temporary disable one test case (#109919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109919
Approved by: https://github.com/wz337
2023-09-24 22:52:20 +00:00
5ad1baf6fa MAINT: pytorchify torch._numpy tests: core/ and fft/ (#109815)
1. Inherit from TestCase
2. Use pytorch parametrization
3. Use unittest.expectedFailure to mark xfails, also unittest skips

All this to make pytest-less invocation work:

$ python test/torch_np/test_basic.py

cross-ref #109593, #109718, #109775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109815
Approved by: https://github.com/lezcano
2023-09-24 16:46:01 +00:00
0d3db1048a remove nvfuser test in upstream pytorch (#109918)
Removing nvfuser related tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109918
Approved by: https://github.com/msaroufim
2023-09-24 13:49:37 +00:00
befe60afc2 TST: pytorchify test/torch_np/test_dtype.py (#109967)
This file was missing from https://github.com/pytorch/pytorch/pull/109593

NB: This PR only mechanically converts the test. Will add more tests to see what's going on with `dtype=np.float64` etc under dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109967
Approved by: https://github.com/lezcano
2023-09-24 13:34:02 +00:00
95e2eec9bf Better invariants - always route list/tuple to their requisite VTs instead of ConstantVariable (#109869)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109869
Approved by: https://github.com/jansel
2023-09-24 08:52:42 +00:00
e9c9b1ed59 [Inductor] Generalize inductor triton backend device agnostic (#109486)
# Motivation
@jansel As discussed before, we expected to generalize some cuda-specific code. This can make inductor more friendly to third-party backend so that we can leverage inductor code as much as possible.

# Solution
To implement this, we give a solution to introduce device runtime abstraction. We wrapper them inside `DeviceInterface` and use `register_interface_for_device` to register each kind of device to inductor. Then use `get_interface_for_device` to fetch the corresponding runtime from device type. Then usage is like this:
```python
device_interface = get_interface_for_device("xpu")
device_interface .is_available() # to check if XPU is available
device_interface .device_count() # to check how much XPU device is available
```
The `DeviceInterface` is a simple abstraction, which enables third-party backends that implement CUDA-like semantics to be integrated with inductor. This can prevent third-party backend from using monkey patch to override some utility functions, like `decode_device` that is hard-coded with CUDA.

# Additional Context
The main code change:
- To leverage AsyncCompile, make it device-agnostic
- Avoid monkey patches, make some utility functions device-agnostic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109486
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/EikanWang
2023-09-24 07:49:20 +00:00
cyy
b7a95f4fdb [1/N] Cleanup header inclusions in torch_cpu by iwyu (#101178)
Following our previous IWYU work  #100304 on C10, it makes more sense to try IWYU on torch_cpu. This PR does exactly that. Meanwhile, it fixes issue #48684.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101178
Approved by: https://github.com/ezyang
2023-09-24 05:01:20 +00:00
cyy
dee100945e [2/N] Move c10::variant to std::variant (#109723)
This PR moves most of c10::variant calls to std::variant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109723
Approved by: https://github.com/ezyang
2023-09-24 02:47:43 +00:00
c13177f2cb [FSDP] Propagate requires_grad attribute to unsharded params (#109892)
Summary:
This preserves `requires_grad` in the case where all parameters within a `FlatParameter` have the same `requires_grad` value.

Currently, unsharded parameters have `requires_grad=True` in some cases where the `FlatParameter` and all original parameters have `requires_grad=False`.

This could be extended to support `FlatParameters` with a mix of `requires_grad` states by extending `ParamInfo` to capture `requires_grad` for each parameter.

Test Plan: test added

Differential Revision: D49517155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109892
Approved by: https://github.com/awgu
2023-09-24 01:30:50 +00:00
ebb30bdd6f Revert "Better invariants - always route list/tuple to their requisite VTs instead of ConstantVariable (#109869)"
This reverts commit 06aa6966a88586d34e6470cc2149121d17971056.

Reverted https://github.com/pytorch/pytorch/pull/109869 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the failed test looks legit as it is also failing in trunk 06aa6966a8 ([comment](https://github.com/pytorch/pytorch/pull/109869#issuecomment-1732424765))
2023-09-23 22:42:23 +00:00
d9627c4264 Revert "[inductor] fix a max-autotune rng state related bug (#109828)"
This reverts commit 3663436db31bd3cebcb76efe05d8355553a05c57.

Reverted https://github.com/pytorch/pytorch/pull/109828 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the rocm failure looks legit. There is also another numpy import error when running dynamo test on CPU ([comment](https://github.com/pytorch/pytorch/pull/109828#issuecomment-1732423883))
2023-09-23 22:35:37 +00:00
b89ce814c0 [FSDP] Remove _set_use_dtensor in post_load_state_dict_hook (#109924)
This is a follow up for https://github.com/pytorch/pytorch/pull/109767.
We only need _set_use_dtensor in pre_state_dict_hook() and pre_load_state_dict_hook() and we do not need _set_use_dtensor in _post_load_state_dict_hook(). This PR removes _set_use_dtensor in post_load_state_dict_hook().

In addition, this PR adjusts the test cases in test_hsdp_dtensor_state_dict.py to capture changes in https://github.com/pytorch/pytorch/pull/109767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109924
Approved by: https://github.com/fegin
2023-09-23 22:34:36 +00:00
7bb1d10c2f Disallow skipping dynamo (#109476)
Based on William's recent diff on preserving node metadata on retracing, we no longer need to skip dynamo on retracing. This softens our previous restriction of not allowing any new constraints from user side because we can utilize dynamo to analyze through constraints now. As a result, re-export can technically happen with any new constraints. This opens up another problem that "Is it ok to use more loose constraints on the retracing?" If we allow loose constraints, we can technically diverge from eager behaviour because for example we could have eliminated unsafe control flow based on previous assumption. But we can also argue this is ok because we can say we treat the Exported callable to be an independent callable from its' original source code.
We can technically ban loose constraints inside export, but my concern is we are breaking abstraction by doing special case checks on ExportedProgram.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109476
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2023-09-23 22:15:18 +00:00
460fc9da62 Disabled UserWarnings for some public functions in torch.overrides (#109890)
Fixes #109842.

This disables the implicit `UserWarning`s that were raised for deprecated `torch` attributes. The filtering was designed to be as specific as possible, in order to not filter any other warnings that may be raised.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109890
Approved by: https://github.com/ezyang
2023-09-23 20:40:04 +00:00
f35cc0fb6f Don't record function call if ShapeEnv is not found. (#109904)
Fix: #109844

- Redirecting execution to original function if `ShapeEnv` instance is not found in its arguments
- Removed `dont_record_shape_env_events`, as it wasn't being used anywhere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109904
Approved by: https://github.com/ezyang
2023-09-23 19:48:24 +00:00
92c49e2168 MAINT/TST: pytorch-ify torch._numpy tests (added tests only, not vendored) (#109593)
1. Inherit from TestCase
2. Use pytorch parametrization
3. Use unittest.expectedFailure to mark xfails

All this to make pytest-less invocation work:

$ python test/torch_np/test_basic.py

Furthermor, tests can now be run under dynamo, and we see first errors:

```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/torch_np/test_basic.py -k test_toscalar_list_func
.E.
======================================================================
ERROR: test_toscalar_list_func_<function shape at 0x7f9b83a4fc10>_np_func_<function shape at 0x7f9a8dd38af0> (__main__.TestOneArrToScalar)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ev-br/repos/pytorch/torch/testing/_internal/common_utils.py", line 356, in instantiated_test
    test(self, **param_kwargs)
  File "test/torch_np/test_basic.py", line 232, in test_toscalar_list
    @parametrize("func, np_func", one_arg_scalar_funcs)
  File "/home/ev-br/repos/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ev-br/repos/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ev-br/repos/pytorch/torch/_dynamo/eval_frame.py", line 406, in _fn
    return fn(*args, **kwargs)
  File "/home/ev-br/repos/pytorch/torch/fx/graph_module.py", line 726, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/ev-br/repos/pytorch/torch/fx/graph_module.py", line 305, in __call__
    raise e
  File "/home/ev-br/repos/pytorch/torch/fx/graph_module.py", line 292, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/ev-br/repos/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ev-br/repos/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.2", line 5, in forward
    shape = torch._numpy._funcs_impl.shape([[1, 2, 3], [4, 5, 6]])
  File "/home/ev-br/repos/pytorch/torch/_numpy/_funcs_impl.py", line 655, in shape
    return tuple(a.shape)
AttributeError: 'list' object has no attribute 'shape'

----------------------------------------------------------------------
Ran 3 tests in 0.915s

FAILED (errors=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109593
Approved by: https://github.com/lezcano
2023-09-23 18:18:50 +00:00
8d47f90e50 Pytorchify numpy vendored tests in torch_np/lib/ (#109718)
1. Inherit from TestCase
2. Use pytorch parametrization
3. Use unittest.expectedFailure to mark xfails, also unittest skips

All this to make pytest-less invocation work:

$ python test/torch_np/test_basic.py

cross-ref https://github.com/pytorch/pytorch/pull/109593, https://github.com/pytorch/pytorch/pull/109775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109718
Approved by: https://github.com/ezyang
2023-09-23 15:31:03 +00:00
835c18e7ea Avoid saving self for mean.backward (#109935)
Fixes https://github.com/pytorch/pytorch/issues/109876
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109935
Approved by: https://github.com/soulitzer
2023-09-23 11:50:54 +00:00
a13201e857 [DCP] Add unit test for FSDP -> TP checkpoint conversion (#109899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109899
Approved by: https://github.com/rohan-varma
2023-09-23 09:19:45 +00:00
sdp
2872f788aa add path for DPC++ SYCL device code in Float8_e4m3fn (#109911)
Building IPEX-XPU with PyTorch fails with `error: builtin is not supported on this target _BitScanReverse` on Windows.

The root cause of the error is due to `_BitScanReverse` compiler intrinsic function not being supported in SYCL target device code with DPC++ compiler, while being supported in host code with MSVC compiler. Thanks to @gujinghui, @xuhancn for the help in identifying the root cause and debugging.

A minimal reproducible script:
```cpp
#include <CL/sycl.hpp>
#include <chrono>
#include <iostream>

#ifdef _MSC_VER
#include <intrin.h>
#endif

void test(
  sycl::queue& q) {

  uint8_t input = 123;
  const uint32_t w = (uint32_t)input << 24;
  const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
  unsigned long nonsign_bsr;
  _BitScanReverse(&nonsign_bsr, (unsigned long)nonsign); // host code, no error

  sycl::range<2> global_range{1, 1};
  sycl::range<2> local_range{1, 1};

  auto e = q.submit([&](auto& h) {
    sycl::stream out(100000, 256, h);
    h.parallel_for(sycl::nd_range<2>{global_range, local_range},
      [=](sycl::nd_item<2> item) {

        #if defined(_MSC_VER)
          uint8_t input = 123;
          const uint32_t w = (uint32_t)input << 24;
          unsigned long nonsign_bsr;
          _BitScanReverse(&nonsign_bsr, (unsigned long)nonsign); // device code, error: builtin is not supported on this target
        #else
          __builtin_clz(nonsign);
        #endif

      // Fix to add a check for SYCL device code:
      /*
      #if defined(__SYCL_DEVICE_ONLY__)
          out << "DPC++ SYCL" << sycl::endl;
          __builtin_clz(nonsign);
      #elif defined(_MSC_VER)
          out << "MSVC" << sycl::endl;
          uint8_t input = 123;
          const uint32_t w = (uint32_t)input << 24;
          unsigned long nonsign_bsr;
          _BitScanReverse(&nonsign_bsr, (unsigned long)nonsign);
      #endif
      */

      });
    });
  q.wait();
}

int main() {
  #if defined(__SYCL_DEVICE_ONLY__)
    std::cout << "DPC++ SYCL" << std::endl;
  #elif defined(_MSC_VER)
    std::cout << "MSVC" << std::endl;
  #endif

  sycl::queue q(sycl::default_selector_v);
  test(q);

  return 0;
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109911
Approved by: https://github.com/ezyang
2023-09-23 07:07:22 +00:00
85ddc985d0 Back out "[pytorch][PR] [Inductor] Extend Pattern Matcher to Match Equivalent Function Invocation" (#109931)
Summary:
Original commit changeset: 3466b85fe0a1

Original Phabricator Diff: D49433268

More context D49536556

bypass-github-pytorch-ci-checks

Test Plan: revertreverthammer

Differential Revision: D49565384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109931
Approved by: https://github.com/houseroad
2023-09-23 05:58:08 +00:00
54faedf5f2 [AOTInductor] Load model on arbitrary device (#109816)
Reviewed By: desertfire

Differential Revision: D49402404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109816
Approved by: https://github.com/chenyang78
2023-09-23 04:45:20 +00:00
bbdce93571 Basic fp8 support in Inductor (#109168)
Add basic fp8 support in Inductor, including:
* Fix fp8 Triton codegen issues;
* Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from 10f59d8ce0/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp (L10).

Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109168
Approved by: https://github.com/drisspg
2023-09-23 04:41:41 +00:00
ff7af15e80 Re-enable max_autotune tests for the CUTLASS backend. (#109831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109831
Approved by: https://github.com/aakhundov
2023-09-23 04:27:40 +00:00
c0d746c90e [ONNX] Relax getting module attributes in ONNX export (#109759)
### Description

This PR fixes a bug with getting module attributes during `torch.onnx.export` when `export_modules_as_functions` is used. With this fix, we can compare the LLaMA-2 models produced by the TorchScript exporter and the [Dynamo exporter](https://github.com/pytorch/pytorch/issues/104903).

### Context
When exporting LLaMA-2 from Hugging Face with `export_modules_as_functions`, the `Embedding` object does not have the `freeze` attribute.
```
  File "/home/kvaishnavi/.local/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 662, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1558, in _call_impl
    args_result = hook(self, args)
  File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1394, in _track_module_attributes_forward_pre_hook
    setattr(module, attr_name, _get_module_attributes(module))
  File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1474, in _get_module_attributes
    return {k: getattr(module, k) for k in annotations}
  File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1474, in <dictcomp>
    return {k: getattr(module, k) for k in annotations}
  File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1696, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Embedding' object has no attribute 'freeze'
```
To get around this issue, we can skip adding the keys in the dictionary when the object does not have the attribute.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109759
Approved by: https://github.com/BowenBao
2023-09-23 02:47:51 +00:00
c789ed6e62 [Inductor][FX]support nn.Linear nn.ConvTransposeNd for efficient_conv_bn_eval (#109722)
Using the `functional_call` API, we can deal with nn.Linear and nn.ConvTransposeNd, just like normal conv.

Thanks for @albanD pointing out the API in https://github.com/pytorch/pytorch/issues/109596 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109722
Approved by: https://github.com/jansel
2023-09-23 01:12:34 +00:00
3663436db3 [inductor] fix a max-autotune rng state related bug (#109828)
Fix https://github.com/pytorch/pytorch/issues/109736 .

HF pin move causes regression on accuracy check for HF models on the dashboard. Manually reverting the HF PR ( https://github.com/huggingface/transformers/pull/24696/files ) could recover, but this may hide some real issue. I happen to found that using a warm matmul max-autotune cache can work around the issue. Or putting it in another way:
- make all calls to check_cache cache miss repro the issue
- make all cals to check_cache cache hit works around the issue

I did some sort of 'bisect' to force halving the amount of cache miss each time while still make sure we can repro. Luckily reducing to a single cache miss still repro the issue. With more debugging, it turns out that it's the call to `torch.randn` on cuda device causing the problem.

The fix is to make sure  we restore the rng state when we generate random inputs for max-autotune benchmarking.

TBH, I can not fully explain the root cause although I know it's caused by rng state change.  AOTAutograd already has some logic to preserve rng state. And I can not repro the issue in unit tests. I have a few guess why the RNG state is not restored in the first place after we generate random inputs for max-autotune:
- maybe AOTAutograd misses some corner case to preserve the rng state
- maybe for the failed models, there are some eager fallback that's not handled by inductor. And if those fallback calles random number related APIs, we will see the issue. But again I don't find a good way to simulate this.

Repro:

```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 CUDA_VISIBLE_DEVICES=3 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only PLBartForCausalLM --training --cold-start-latency
```

We always repro the issue without the PR but pass the accuracy check with the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109828
Approved by: https://github.com/eellison
2023-09-23 00:58:10 +00:00
d7f3986314 Fix S367052 to unblock ICVR MC3 (#109853)
Summary: Somehow "getitem" started to get Tensor starting from ads_ranking:996 and broke SDD pipelining FX-transformer. We need to skip the Tensor node in annotation.

Test Plan:
N4326037

# Before
 {F1099052907}

# With this diff

 {F1099052270}

Differential Revision: D49528046

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109853
Approved by: https://github.com/jackiexu1992, https://github.com/lanza, https://github.com/xush6528
2023-09-23 00:23:42 +00:00
06aa6966a8 Better invariants - always route list/tuple to their requisite VTs instead of ConstantVariable (#109869)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109869
Approved by: https://github.com/jansel
ghstack dependencies: #109896
2023-09-22 22:46:29 +00:00
691f8ca4f4 faster build instructions CONTRIBUTING.md (#109900)
Discovered this as I was building pytorch on a fresh g5.4x instance on aws, building flash attnetion was bricking my machine

```
Building wheel torch-2.2.0a0+gitd0c8e82
-- Building version 2.2.0a0+gitd0c8e82
cmake --build . --target install --config Release
[1/748] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim96_fp16_sm80.cu.o
FAILED: caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim96_fp16_sm80.cu.o
/opt/conda/envs/torchbench/bin/ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXPERIMENTAL_CUDNN_V8_API -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ubuntu/pytorch/build/aten/src -I/home/ubuntu/pytorch/aten/src -I/home/ubuntu/pytorch/build -I/home/ubuntu/pytorch -I/home/ubuntu/pytorch/cmake/../third_party/benchmark/include -I/home/ubuntu/pytorch/third_party/onnx -I/home/ubuntu/pytorch/build/third_party/onnx -I/home/ubuntu/pytorch/third_party/foxi -I/home/ubuntu/pytorch/build/third_party/foxi -I/home/ubuntu/pytorch/aten/src/THC -I/home/ubuntu/pytorch/aten/src/ATen/cuda -I/home/ubuntu/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ubuntu/pytorch/build/caffe2/aten/src -I/home/ubuntu/pytorch/aten/src/ATen/.. -I/home/ubuntu/pytorch/build/nccl/include -I/home/ubuntu/pytorch/c10/cuda/../.. -I/home/ubuntu/pytorch/c10/.. -I/home/ubuntu/pytorch/third_party/tensorpipe -I/home/ubuntu/pytorch/build/third_party/tensorpipe -I/home/ubuntu/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ubuntu/pytorch/torch/csrc/api -I/home/ubuntu/pytorch/torch/csrc/api/include -isystem /home/ubuntu/pytorch/build/third_party/gloo -isystem /home/ubuntu/pytorch/cmake/../third_party/gloo -isystem /home/ubuntu/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ubuntu/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ubuntu/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ubuntu/pytorch/third_party/protobuf/src -isystem /home/ubuntu/pytorch/third_party/gemmlowp -isystem /home/ubuntu/pytorch/third_party/neon2sse -isystem /home/ubuntu/pytorch/third_party/XNNPACK/include -isystem /home/ubuntu/pytorch/third_party/ittapi/include -isystem /home/ubuntu/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ubuntu/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ubuntu/pytorch/third_party/ideep/include -isystem /home/ubuntu/pytorch/cmake/../third_party/cudnn_frontend/include -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_86,code=sm_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler=-Wall,-Wextra,-Wno-unused-parameter,-Wno-unused-function,-Wno-unused-result,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-missing-braces,-Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim96_fp16_sm80.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim96_fp16_sm80.cu.o.d -x cu -c /home/ubuntu/pytorch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim96_fp16_sm80.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim96_fp16_sm80.cu.o
Killed
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109900
Approved by: https://github.com/drisspg
2023-09-22 22:39:51 +00:00
8ed08e5a7c [dynamo] Graph break on rng get/set state - remove GeneratorStateSource (#109410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109410
Approved by: https://github.com/ezyang
ghstack dependencies: #109411
2023-09-22 22:31:55 +00:00
a902150a1e [Easy] ConstantVariable() -> .create (#109896)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109896
Approved by: https://github.com/ezyang
2023-09-22 22:30:15 +00:00
e42d450a55 [core IR] Add div.Tensor_mode, div.Scalar_mode, and copy as core operators (#109812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109812
Approved by: https://github.com/kirklandsign
2023-09-22 22:05:49 +00:00
334ead04a9 Back out "[decomp] Fix baddbmm decomposition (#109714)" (#109855)
Summary:
Original commit changeset: 95c462a380c9

Original Phabricator Diff: D49484954

this diff cause test failure for deterministic ne test see:https://www.internalfb.com/sandcastle/job/18014399565419856/

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test -- --exact 'aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test - aps_models.ads.icvr.tests.icvr_fm_e2e_deterministic_ne_test.ICVR_FM_E2EDeterministicNeTest: test_e2e_deterministic_icvr_fm_pt2_fsdp_multi_gpus'

https://www.internalfb.com/intern/testinfra/testrun/16888498605839953

Differential Revision: D49527271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109855
Approved by: https://github.com/yanboliang
2023-09-22 22:01:38 +00:00
f0d71de4ac Update caffe2 with LLVM-18 API change (#109408)
Summary: https://github.com/llvm/llvm-project/pull/66295 modified some internal LLVM APIs, update these places with the changes under LLVM version guard

Test Plan: CI

Differential Revision: D49340871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109408
Approved by: https://github.com/Skylion007
2023-09-22 21:40:58 +00:00
c26270c733 [C10D] Even more store scalability work. (#109218)
Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks.

Make the minimum wait time in _store_based_barrier to be adaptative based on
the number of ranks.

Longer timeouts give more room for the store to do productive work when swamped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218
Approved by: https://github.com/XilunWu
ghstack dependencies: #109217
2023-09-22 21:27:09 +00:00
de1b00abda inductor: tigher upperbound for rblock scaling (#109839)
Previously when we deciding if dynamically scaling down rblock, we use the following formule to compute the upper bound of number of blocks per sm:
```
max_threads_per_multi_processo / (32 * num_warps)
```

This is correct but it's a bit loose and some times because of the loose upper bound, we skip some optimization opportunities.

The new upper bound is: 65536 / n_reg_used_by_each_block . This is a tighter upper bound and can be helpful if the kernel uses too many registers (i.e. much larger than 32).

For kernel https://gist.github.com/shunting314/59aeafd297ed8ff03aa12030a2dd41ae (this is a real kernel inductor generates for HF), the change improve its perf from:
0.485ms    0.332GB    684.29GB/s
to
0.240ms    0.332GB    1382.70GB/s

. The perf is bad previsouly because of register spills

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109839
Approved by: https://github.com/jansel
2023-09-22 20:55:51 +00:00
e2cfbca5ab Add clip to dynamo runners (#109840)
CLIP was moved to canary models because we use the multimodal version which depends on torchtext which torchbench deprecated https://github.com/pytorch/benchmark/pull/1837

This issue didn't show up before because we hadn't updated the torchbench pin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109840
Approved by: https://github.com/cpuhrsch
2023-09-22 20:50:57 +00:00
2895fbd857 Enable typechecking for _inductor/pattern_matcher.py (#109613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109613
Approved by: https://github.com/Skylion007
2023-09-22 20:50:21 +00:00
411ca10e74 [Pytorch][Vulkan] Add baddbmm (#109851)
Summary:
Similar implementation like BMM & ADDMM, the bias tensor is using the packed weights, similar to MM, but increases the index via the z-dim to get more matrices in the batch.

Packed bias (input of MM):
```
ivec3 pos(k_, j_, 0);
float v = texelFetch(uInput, pos, 0)
# v.xyzw are 4 numbers in one matrix
# no batch
# k_, j_ has only 1/4 of the range as the original matrix size (H*W matrix i=> H/2*W/2*1 3D Image).

```
Packed bias (input of BMM):
```
ivec3 pos(k_, j_, i);
float v = texelFetch(uInput, pos, 0)
# v.xyzw are 4 numbers in one matrix
# i as batch id
```

**To support broadcasting**, the bias packing of `mm` is slightly different than weight packing, which repeats the single element in height-dim twice to fill the 4 planes (see code for details). The width-dim doesn’t repeat twice, but the code still works, because stacking 3 planes together with the last one empty yields the same 3D image.
However, this doesn’t work for `bmm`, since it’s a series of `{4 planes} {4 planes} … {4 planes}`, and each `{4 planes}` represents a matrix, so only 3 planes completely mess up the indexing. Thus, I repeat the single element in width-dim as well to fill all 4 planes to have the correct indexing.

https://pytorch.org/docs/stable/generated/torch.baddbmm.html

Test Plan:
```
[ttingchulin@27298.od /data/sandcastle/boxes/fbsource (bmm)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin
```

Reviewed By: yipjustin

Differential Revision: D49402181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109851
Approved by: https://github.com/yipjustin
2023-09-22 20:34:38 +00:00
1df14f1bf8 Move has_triton to top level triton utils so that dynamo can also access (#109832)
it without creating cyclic dependencies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109832
Approved by: https://github.com/zou3519
2023-09-22 19:33:41 +00:00
4b0281b32c [BE][foreach] name tests correctly. noncontiguous inputs != fastpath (#109771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109771
Approved by: https://github.com/soulitzer
2023-09-22 19:16:14 +00:00
92de1d3222 [C10D] Push store scalability a bit further. (#109217)
This is a bunch of small changes to improve store scalability:

- stagger client connection to avoid a stampede.
- warn if somaxconn is too small.
- increase the backlog to 16k.

Differential Revision: [D49238587](https://our.internmc.facebook.com/intern/diff/D49238587)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109217
Approved by: https://github.com/XilunWu
2023-09-22 17:23:46 +00:00
c27c56a5c4 [inductor] Add back a missing header include (#109845)
Summary: It was removed in https://github.com/pytorch/pytorch/pull/109678, which regressed GoogleFnet in HF.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109845
Approved by: https://github.com/angelayi, https://github.com/chenyang78
2023-09-22 17:06:06 +00:00
d0c8e8240d Revert "When doing typed typecheck, also check signature with symint removed (#109727)"
This reverts commit 56ef200c2dc8a1f1e269861b7a6e02e99d3b56a1.

Reverted https://github.com/pytorch/pytorch/pull/109727 on behalf of https://github.com/ezyang due to yolov3 problem ([comment](https://github.com/pytorch/pytorch/pull/109727#issuecomment-1731585002))
2023-09-22 15:11:27 +00:00
629a628cc8 Revert "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)"
This reverts commit b5d6e831a9ecbd5b8c126cace5ea8567156365c8.

Reverted https://github.com/pytorch/pytorch/pull/106406 on behalf of https://github.com/malfet due to Broke lots of tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/106406#issuecomment-1731524917))
2023-09-22 14:32:34 +00:00
2512017814 Fix for out of bounds read in torch mobile flatbuffer loader (#108439)
Remove redundant (and unsafe) `mobile::serialization::ModuleBufferHasIdentifier(data)` as ` mobile::serialization::VerifyModuleBuffer(verifier)` validates the same thing but in boundary-check safe manner.

Test Plan: Out of bounds read crash no longer reproduces

Differential Revision: D48914114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108439
Approved by: https://github.com/manuelcandales, https://github.com/malfet
2023-09-22 14:26:33 +00:00
93ce6df931 Fix torch.utils.benchmark API while use privateuse1. (#108548)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108548
Approved by: https://github.com/aaronenyeshi
2023-09-22 14:24:18 +00:00
f092eecc92 Handle C++ exceptions raised during finfo/iinfo calls (#109743)
Partially fixes https://github.com/pytorch/pytorch/issues/109737
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109743
Approved by: https://github.com/albanD
ghstack dependencies: #109744
2023-09-22 14:17:58 +00:00
d7dfa91e12 [inductor] Refactor some libtorch c shim interfaces (#109834)
Summary: Change the returned values to be in the back of the parameters, because 1) it is more consistent with AOTInductor runtime API convention; 2) because the out-variant ops have the out tensor at the beginning of parameters, this makes the return values more distinguished from those

Test Plan:
```
buck test mode/opt caffe2/torch/fb/model_transform/experimental/benchmark/test/aotinductor:test_aot_inductor_benchmark
```

Differential Revision: D49522928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109834
Approved by: https://github.com/chenyang78
2023-09-22 12:45:23 +00:00
098d62d278 Add global_step parameter to SummaryWriter.add_hparams (#109572)
Fixes #37738 where all hparam metrics can only be plotted at step 0. This is basically just a resubmission of #50653.

before:
<img width="345" alt="Screenshot 2023-09-18 at 8 09 13 PM" src="https://github.com/pytorch/pytorch/assets/27844407/89ebb327-9d0f-4e4e-a77a-27067a9d4ca0">

after:
<img width="346" alt="Screenshot 2023-09-18 at 7 56 52 PM" src="https://github.com/pytorch/pytorch/assets/27844407/059e0465-c6bf-47fe-974b-2175e57aa62d">

@ngimel @J0Nreynolds @ezyang @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109572
Approved by: https://github.com/ezyang
2023-09-22 12:37:01 +00:00
b4ede53776 Use constrain_range_as_size for nonzero/repeat_interleave (#109857)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109857
Approved by: https://github.com/tugsbayasgalan
2023-09-22 12:14:46 +00:00
56ef200c2d When doing typed typecheck, also check signature with symint removed (#109727)
See the test case for what we didn't catch (SymInt vs const SymInt&
mismatch.)

It's necessary to test for both, because we will fall back to the
non-SymInt signature if there is no SymInt unboxed kernel available.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109727
Approved by: https://github.com/zou3519
2023-09-22 12:12:10 +00:00
24ba4b7059 [dynamo][__torch_function__ 1/n] Add getset descriptor and __get__ vars (#109542)
Adds the MethodWrapperVariable and GetSetDescriptor variable types. These are used in `__torch_function__` tracing to represent attribute reads (`__get__`) and for comparing unbound methods. (the func argument when `__torch_function__` is dispatched from a method call)

towards tracing for https://github.com/pytorch/pytorch/issues/93723

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109542
Approved by: https://github.com/jansel
2023-09-22 10:39:15 +00:00
d7c05bb2e8 [ONNX] Remove the depreacated function _export (#109763)
`_export` API was depreacated and should be removed after 2.0.

See: https://github.com/pytorch/pytorch/pull/107208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109763
Approved by: https://github.com/thiagocrepaldi
2023-09-22 07:14:13 +00:00
b5d6e831a9 Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)
Now that FunctionalTensor and `FunctionalTensorMode` are lower down in this stack, the changes in this PR are more mechanical: Everywhere in AOTAutograd that I used to use the C++ functionalization API, I now use the python functionalization API.

Note that this doesn't actually cause functionalization to run underneath torch_dispatch. I'm saving that re-ordering for later in the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106406
Approved by: https://github.com/ezyang
ghstack dependencies: #108654, #109662, #109632, #109023
2023-09-22 07:09:04 +00:00
63526a63f5 Make FunctionalTensor subclass to be more like functorch (interaction with ZeroTensor + Conjugate key) (#109023)
I added some tests for Conj, Neg and ZeroTensor for both python and C++ functionalization. This also fixes a nasty segfult when running a functorch `jacfwd` test with `torch.compile`, once AOTAutograd is using `FunctionalTensor`.

Changes:

(1) I use Jeffrey's `make_wrapper_subclass(extra_dispatch_keys)` kwarg to plumb extra dispatch keys ontoto the wrapper, mirroring what C++ functionalization does (C++ functionalization will mirror all dispatch keys from the inner tensor to the wrapper, except for python and functorch keys).

(2) FunctionalTensorMode will decompose CompositeImplicitAutograd ops, since (for example) ZeroTensor kernels can send ops like `.to()` directly to the Python key. We'll need a way to toggle this later for pre-dispatch functionalization

(3) Bound `_ForceDispatchKeyGuard` and BatchedTensorImpl's dispatch keyset to python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109023
Approved by: https://github.com/zou3519
ghstack dependencies: #108654, #109662, #109632
2023-09-22 07:09:04 +00:00
7a21e960c6 fix infinite loop with primtorch and .to(meta) (#109632)
Fixes https://github.com/pytorch/pytorch/issues/103532, which I needed in order to more easily/properly test that python functionalization is at parity with C++ functionalization for conj/neg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109632
Approved by: https://github.com/ezyang
ghstack dependencies: #108654, #109662
2023-09-22 07:09:04 +00:00
46b0b7bff7 _return_and_correct_aliasing: fix for schemas with mutable tensor in kwargs (#109662)
I missed a few tests the first time around - this fixes out= op handling for `_return_and_correct_aliasing`, which failed a few tests in the python functionalization <> AOTAutograd PR above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109662
Approved by: https://github.com/ezyang
ghstack dependencies: #108654
2023-09-22 07:09:04 +00:00
dae9aa8925 fix subclass custom sizes dynamic shapes caching (#108654)
This PR fixes the ownership/lifetime handling for tensor subclasses that override sizes/strides, when tensors get resized.

This is needed now, because `FunctionalTensor` is a subclass that has a custom size/stride (so it can plumb requests to its inner tensor), and is also a core piece of infra (it's used during tracing in AOTAutograd, which means that metadata mutation and resizing that happens to work with torch.compile today needs to work with FunctionalTensor).

After a bunch of discussion with @ezyang and @soulitzer, I updated `PyInterpreter::sym_sizes()` (and friends) so that:
(1) They allocate a py::capsule buffer and stash it on the tensor on the first call to size/stride
(2) On a size/stride call where we noticed that the number of **dimensions** on the tensor has changed (so our buffer it stale), we re-allocate the buffer
(3) On a size/strude cal where we notice that the number of dimensions is the same, but the values are different (this happens whenever a tensor experiences a metadata mutation, like `.transpose_()`), we inplace-modify the buffer and put the new ints/symints into it

I also ended up doing the SmallVector optimization, which was required to fix some tests in AOTAutograd. Ideally we should look into those tests, and nail down the parts of our codebase that rely on SmallVector not re-allocating on a resize... but I'm saving this for a followup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108654
Approved by: https://github.com/ezyang
2023-09-22 07:09:04 +00:00
ebc7039bcb New export API with dynamic shape specifications instead of constraints (#108448)
Our experience using `constraints` / `dynamic_dim` with the existing export API has found it to be (subjectively) clunky and (objectively) verbose in common cases.

This PR implements a new design for the export API that replaces the use of `constraints` / `dynamic_dim` with a new way of specifying dynamic shapes, involving the following concepts:
* a constructor `Dim` for first-class named dynamic dimensions with ranges (similar to `functorch.dim`, and analogous to internal symbolic sizes)
* a mechanism that uses the above in `export` calls to associate inputs to their dynamic shape specifications (`dynamic_shapes`)

Design doc: https://docs.google.com/presentation/d/168U7XK72C_WSsZpGESP6Cho9udh193fi0gfjxCNcJ4E/edit#slide=id.p (Meta-only). Note that we only implement Option 1 in that doc. An older version of this PR also implemented Option 3, which is an alternative way of specifying dynamic shapes using tensor type annotations on the exported callable; but we have moved that to future work for now.

See docs for these new features in `torch.export`. The existing `torch.export.export` is modified to use the new API, `torch._export.export__RC__`, whenever `constraints=None`. We have not deprecated the existing API yet, but will do in a follow-up.

Constraint violation errors arising through use of the new API will now contain suggested fixes using the new API. No longer do we need to report all specializations for static dimensions and suggest all constraints over dynamic dimensions to fix such errors. Instead, due to the redesign, the suggested fixes are much more concise, only involving modifying the definitions of relevant `Dim`s.

Differential Revision: [D48919204](https://our.internmc.facebook.com/intern/diff/D48919204/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108448
Approved by: https://github.com/suo, https://github.com/gmagogsfm
2023-09-22 06:58:26 +00:00
cyy
cd99cdc3af fix std::move warnings from gcc (#105780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105780
Approved by: https://github.com/Skylion007
2023-09-22 05:55:21 +00:00
4ff294522a [Inductor] Extend Pattern Matcher to Match Equivalent Function Invocation (#107832)
Fixes #104391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107832
Approved by: https://github.com/jansel
2023-09-22 05:26:08 +00:00
8124a6c40c [TORCH_LIBRARY] Add impl_abstract_pystub (#109529)
We want users to be able to define custom ops in C++ but put the
abstract impl in Python (since it is easier to write them in Python and
the abstract impl better models device semantics and data-dependent
operators).

`m.impl_abstract_pystub(opname, python_module, context)` declares the
abstract_impl of the operator to exist in the given python module.
When the abstract_impl needs to be accessed (either via FakeTensor or
Meta), and it does not exist, the PyTorch Dispatcher will yell
with a descriptive error message.

Some details:
- We construct a new global AbstractImplPyStub mapping in
  Dispatcher.cpp. Read/write to this map is protected by the Dispatcher
  lock.
- We add a new Meta Tensor fallback kernel. The fallback errors out if there is
  no meta kernel, but also offers a nicer error message if we see that there is
  a pystub.
- We create a `torch._utils_internal.throw_abstract_impl_not_imported_error`
  helper function to throw errors. This way, we can throw different error
  messages in OSS PyTorch vs internal PyTorch. To invoke this from C++, we
  added a PyInterpreter::throw_abstract_impl_not_imported_error.

Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753/)

Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109529
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-09-22 04:55:36 +00:00
3268b039ec Handle unbacked symints in Triton size hints (#109609)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109609
Approved by: https://github.com/yf225
2023-09-22 03:16:53 +00:00
abd9b763ca [RFC] Add debug log as we lower each FX node (#109602)
I found this useful for orienting myself when I threw an error
mid-lowering.  What do other people think?

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109602
Approved by: https://github.com/malfet, https://github.com/voznesenskym
2023-09-22 03:10:22 +00:00
e1d71231e2 [Pytorch][Vulkan] Add bmm op (#109360)
Summary:
BMM is developed on top of MM methodology, the main difference is the 1st input matrix changed from a standard Vulkan 2D tensor to a Vulkan 3D tenser, so the indexing changed quite differently. The matrices of a batch are appended on z-dimension and the channel of size 4 (texel).

The 2nd input matrix remains the same as a packed format (fit a `H*W` matrix into a `H/2 * W/2 * 1` 3D image texture by utilizing all 4 values in the texel), but appends more matrices in the batch on z-dimension (only has 1 element in the case of MM).

**Vulkan 2D Basic (1st input of MM & output):**
```
ivec3 pos(j, i, 0);
float v = texelFetch(uInput, pos, 0)[0];
# no batch
```

**Vulkan 3D Basic (1st input of BMM & output):**
```
ivec3 pos(k, j, i/4);
float v = texelFetch(uInput, pos, 0)[i % 4];
# i as batch id
```

**Packed weights (2nd input of MM):**
```
ivec3 pos(k_, j_, 0);
float v = texelFetch(uInput, pos, 0)
# v.xyzw are 4 numbers in one matrix
# no batch
# k_, j_ has only 1/4 of the range as the original matrix size (H*W matrix i=> H/2*W/2*1 3D Image).
```

**Packed weights (2nd input of BMM):**
```
ivec3 pos(k_, j_, i);
float v = texelFetch(uInput, pos, 0)
# v.xyzw are 4 numbers in one matrix
# i as batch id
```

Based on the different indexing of MM & BMM. I modified the MM methodology to produce the desired output image.

Test Plan:
```
[ttingchulin@27298.od /data/sandcastle/boxes/fbsource (bmm)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*<test>*" eg.  -- --gtest_filter="*mm*"
Building: finished in 0.1 sec (100%) 328/3361 jobs, 0/3361 updated
  Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *mm*
[==========] Running 8 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 8 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.addmm
[       OK ] VulkanAPITest.addmm (125 ms)
[ RUN      ] VulkanAPITest.addmm_expand
[       OK ] VulkanAPITest.addmm_expand (76 ms)
[ RUN      ] VulkanAPITest.addmm_expand2
[       OK ] VulkanAPITest.addmm_expand2 (0 ms)
[ RUN      ] VulkanAPITest.bmm
[       OK ] VulkanAPITest.bmm (152 ms)
[ RUN      ] VulkanAPITest.bmm_large
[       OK ] VulkanAPITest.bmm_large (4818 ms)
[ RUN      ] VulkanAPITest.bmm_small
[       OK ] VulkanAPITest.bmm_small (4 ms)
[ RUN      ] VulkanAPITest.bmm_one
[       OK ] VulkanAPITest.bmm_one (0 ms)
[ RUN      ] VulkanAPITest.mm
[       OK ] VulkanAPITest.mm (55 ms)
[----------] 8 tests from VulkanAPITest (5233 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (5233 ms total)
[  PASSED  ] 8 tests.

```

Differential Revision: D49306279

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109360
Approved by: https://github.com/yipjustin
2023-09-22 02:52:45 +00:00
8856c1628e [inductor] Change AOTInductor to return output tensors (#109790)
Summary:
Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits:

* It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable.
* As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance.
* With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability.

This change also combines D49494954 from Yang and https://github.com/pytorch/pytorch/pull/109560 from Angela.

Differential Revision: D49502318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109790
Approved by: https://github.com/chenyang78
2023-09-22 02:31:52 +00:00
d43f9f7707 Add redirect links to the contributor wiki (#106863)
* Update contribution guide links to the wiki page

---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2023-09-21 22:01:20 -04:00
8dcdc74915 torch->onnx export support: quantized::linear_relu (#109755)
- Adds support for quantized::linear_relu
  - Adds weight unpacking pattern matcher
  - Adds to export for opset 10 and 13.
- Adds QAT test modeled after conv2d+relu fusion test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109755
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-09-21 23:24:20 +00:00
175ccfc4c8 Verify flatbuffer module fields are initialized (#109794)
Fixes #109793

Add validation on flatbuffer module field to prevent segfault

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109794
Approved by: https://github.com/malfet
2023-09-21 23:19:17 +00:00
d65e067baa Updates to attn_mask handiling in mem_eff (#109620)
# Summar
Align internal changes to what is xformers: a67cd57531

We have actually already removed the bias 4d view so this is in theory is a no op and really just increased safety checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109620
Approved by: https://github.com/cpuhrsch
2023-09-21 22:40:58 +00:00
b5fde4c382 Revert "[Reland] Remove calls of c10::either (#109708)"
This reverts commit 0735f6c0d5857d9ae7893d23c5a4b53bdf887967.

Reverted https://github.com/pytorch/pytorch/pull/109708 on behalf of https://github.com/atalman due to Broke windows periodic tests ([comment](https://github.com/pytorch/pytorch/pull/109708#issuecomment-1730356321))
2023-09-21 22:04:25 +00:00
255d1a776a [MPS] Add support for Mish to MPS backend (#109786)
Fixes [#ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/77764#issuecomment-1712894444)

Adds the mish activation function to the mps backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109786
Approved by: https://github.com/kulinseth
2023-09-21 21:01:20 +00:00
f7ddc54503 [aotinductor] Update performance benchmark code (109560) (#109820)
Summary: Same as #109560, made a new PR because we need to land from internal

Previously during performance benchmark testing, we would create an AOTInductorModelContainerHandle every time the compiled function is run with new inputs. However after https://github.com/pytorch/pytorch/pull/108473 we now load the constants needed in the runtime when initializing the AOTInductorModelContainerHandle. This resulted in our benchmarks displaying a ~0.4x speedup.

This diff moves the initialization of AOTInductorModelContainerHandle outside of the code where we run the compiled function with different inputs.

For example,
```
python benchmarks/dynamo/huggingface.py --performance --cold-start-latency --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda --total-partitions 3 --partition-id 0 --only AlbertForMaskedLM
```
results in `1.359x` speedup.

Specifically, this adds a `create_container_handle` and `delete_container_handle` function which need to called before `run`. We call `create_container_handle` to initialize the AOTInductorModelContainerHandle, call `run` to run the compiled .so with different inputs, and then `delete_container_handle` to delete it.

[Updated dashboard results](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2013%20Sep%202023%2021%3A03%3A55%20GMT&stopTime=Wed%2C%2020%20Sep%202023%2021%3A03%3A55%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/aot_inductor_benchmark&lCommit=f9aa49c4c9a1a140b6f0c4520d1d6d99b57e12fa&rBranch=main&rCommit=015be4cedba357eb931e24bf188479235db7c5c8)

Test Plan: CI

Differential Revision: D49513934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109820
Approved by: https://github.com/desertfire
2023-09-21 20:49:41 +00:00
8dedc9dd9b Add meta tests for layer/group/batch norm backward (#109591)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109591
Approved by: https://github.com/ezyang
2023-09-21 18:58:51 +00:00
83b4aab5bc Allow zero sized tensors to be resized with meta_randperm (#109721)
Failure will be handled by `_maybe_resize_out`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109721
Approved by: https://github.com/ezyang
2023-09-21 18:41:29 +00:00
8207118d55 MAINT/TST: pytorch-ify test_linalg, vendored from NumPy (#109775)
1. Inherit from TestCase
2. Use pytorch parametrization
2. Use unittest.expectedFailure to mark xfails, also unittest skips

All this to make pytest-less invocation work:

$ python test/torch_np/test_basic.py

cross-ref https://github.com/pytorch/pytorch/pull/109593, https://github.com/pytorch/pytorch/pull/109718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109775
Approved by: https://github.com/ezyang
2023-09-21 18:36:19 +00:00
cyy
e9e93c5350 [Reland] Move torch::make_unique to std::make_unique (#109780)
We can first try to move torch::make_unique to std::make_unique despite reverting of #108866 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109780
Approved by: https://github.com/ezyang
2023-09-21 18:30:21 +00:00
c6b9481c15 Update type hint for Tensor.__getitem__. (#109531)
Better type-hint that's similar in spirit to `numpy.ndarray.__getitem__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109531
Approved by: https://github.com/ezyang
2023-09-21 18:19:38 +00:00
b1f1b39feb Revert "Add PR number to metrics when available (#109406)"
This reverts commit 5e19216a6e0e6ee322b7416f9a793a51b1ff8c82.

Reverted https://github.com/pytorch/pytorch/pull/109406 on behalf of https://github.com/atalman due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/109406#issuecomment-1730049340))
2023-09-21 17:59:12 +00:00
09622d8d49 Allow inferring size-nature from sizes passed to empty constructor (#109720)
This removes the need for many constrain_as_size calls as we now
infer them from error checking for sizes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109720
Approved by: https://github.com/aakhundov
2023-09-21 17:57:40 +00:00
6ca964b410 Remove torchtext from Build Official Docker images (#109799)
Fixes nightly official Docker image build.
Failures: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Build%20Official

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 8671bfc</samp>

Remove `torchtext` installation from `Dockerfile` for arm64. This fixes the arm64 build of the PyTorch Docker image.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109799
Approved by: https://github.com/seemethere
2023-09-21 17:07:45 +00:00
0351e2042b Avoid throwing exception in ClosingTHPObjectPtr (#109758)
Previously, if ClosingTHPObjectPtr was destructed because we
were unwinding the stack from an exception, we would attempt to call
close() which just isn't going to work.  Two fixes:

1. Detect if we're unwinding due to a Python error, and don't try
   to do more Python stuff if so.

2. If close() fails somehow, write an unraisable exception, don't
   try to throw because that will terminate if you're in an
   exception.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109758
Approved by: https://github.com/jansel
2023-09-21 17:04:14 +00:00
2cd0b94533 Hide __getattr__ from type checkers (#109683)
Visibility of this causes type checkers to conservatively assume that all attributes are defined on torch module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109683
Approved by: https://github.com/ngimel, https://github.com/ezyang, https://github.com/malfet
2023-09-21 17:01:23 +00:00
ef8d461b09 Fix torchbench --multiprocess (#109657)
`python benchmarks/dynamo/torchbench.py --multiprocess` currently fails due to initializing distributed multiple times:

```
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:6789 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:6789
 (errno: 98 - Address already in use).
```

Because torchbench calls itself via mp.spawn, there is the parent run (with `--multiprocess`) and child runs (with `--multiprocess --only <model>`).

This PR addresses this by fixing two issues:
1) distributed is initialized once in parent run and once in child runs, it should be initialized only in child runs where we have accurate rank and world size info
2) torchbench overrides CUDA_VISIBLE_DEVICES/world_size sometimes, but it shouldn't for distributed use cases where we want to use all available gpus

I am also adding a CI test to cover this type of issue in #109311

### Test plan
parent run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess`
child run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess --only simple_gpt`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109657
Approved by: https://github.com/H-Huang
2023-09-21 16:53:07 +00:00
cyy
ba0362a09e Remove unused build system checks and definitions (#109711)
Remove some outdated checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109711
Approved by: https://github.com/ezyang
2023-09-21 16:52:16 +00:00
5e19216a6e Add PR number to metrics when available (#109406)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 780bfa6</samp>

Add a new metric for pull request number in `tools/stats/upload_metrics.py`. This allows tracking the CI performance of pull requests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109406
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/clee2000
2023-09-21 16:47:05 +00:00
6b7b9c796e Fix registering jit decompositions for jvp for out wrapped decomps (#109367)
Python decompositions wrapped by `out_wrapper` need to be unwrapped before compiling with TorchScript since:
- `out_wrapper` extends the decompositions signature with an out parameter, however this `out` parameter is not present in the source code of the original decomposition so the resulting `ScriptFunction` will not have an `out` parameter
- `out_wrapper` is in the `torch._prims_common.wrappers` module so its `globals()` are different to the globals of the decomposition to be wrapped. This may cause symbol resolution to fail with the TorchScript compiler since it is compiling the unwrapped decomps source code rather than the wrapper

The python decomposition for `aten.trace` is wrapped as an example, other decompositions are to be fixed in https://github.com/pytorch/pytorch/pull/107707
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109367
Approved by: https://github.com/lezcano
2023-09-21 16:36:51 +00:00
406b8412c2 Revert "[inductor] Use _unsafe_view decompostion (#109669)"
This reverts commit 90a2026cd12065994eb234e8c5f332143d9d9468.

Reverted https://github.com/pytorch/pytorch/pull/109669 on behalf of https://github.com/clee2000 due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/109669#issuecomment-1729906056))
2023-09-21 16:25:00 +00:00
3f3e353885 torch.compile + selective activation checkpointing (#105489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105489

NOTE: this PR is tagged "not user facing", because it's not ready to be announced externally yet.

This PR implements torch.compile + selective activation checkpoint (SAC) integration, by using `TagActivationCheckpoint` (same backend as torch.compile + full activation checkpoint integration).

TorchDispatchMode based implementation cannot support including inplace ops in the checkpointed region at the moment (the reason for this needs investigation), and there is also no way to ban them (because TorchDispatchMode now only sees "after-functionalization" ops, so can't detect if an op is in-place). Hence we hide torch.compile + SAC behind a flag (`torch._dynamo.config._experimental_support_context_fn_in_torch_utils_checkpoint`) and will only use it internally for cases that are known to not have in-place ops. This state won't last too long, because in-place op will at least be able to be detected after Brian's mode reordering and related functionalization changes.
So next steps after this PR:
1. Wait for Brian's mode reordering and related functionalization changes to land, and then try to enable the "inplace ops" unit test for torch.compile + selective activation checkpoint (if it doesn't work, investigate why).
2. Unify selective- and full-checkpoint under TorchDispatchMode based implementation.

Differential Revision: D47497145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105489
Approved by: https://github.com/anijain2305
2023-09-21 16:24:11 +00:00
a5145364d9 [FSDP] Fix _use_dtensor not automatically turn on for model state dict when using DeviceMesh (#109767)
Fixes #109648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109767
Approved by: https://github.com/fegin
2023-09-21 15:15:45 +00:00
62555930a0 [inductor] Enable mypy checking for codegen/triton_foreach (#109643)
Summary: Add enough typehints to enable mypy checking for codegen/triton_foreach. Also fixed a bug in a dtype param.

Test Plan:
* `python test/inductor/test_foreach.py`
* `lintrunner`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109643
Approved by: https://github.com/mlazos
ghstack dependencies: #109146
2023-09-21 14:30:00 +00:00
4eada253e1 [inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning (#109500)
Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.

Test Plan:
* New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
* Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes.

Snippet:
```
    1    3442314     C     2     1     -     -   python
    2    3442318     C     2     1     -     -   python
    3    3442320     C     8     2     -     -   python
    4    3442323     C     9     4     -     -   python
    5    3442325     C    10     4     -     -   python
    6    3442327     C    10     4     -     -   python
    7    3442329     C     2     0     -     -   python
    0    3434906     C     0     0     -     -   python
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109500
Approved by: https://github.com/eellison, https://github.com/shunting314
2023-09-21 14:29:30 +00:00
169ae7540d Revert "Handle unbacked symints in Triton size hints (#109609)"
This reverts commit 654731a52b6bbe0b12f7c5aaac005f8a08c6816f.

Reverted https://github.com/pytorch/pytorch/pull/109609 on behalf of https://github.com/ezyang due to this seems to regress HF perf ([comment](https://github.com/pytorch/pytorch/pull/109609#issuecomment-1729688883))
2023-09-21 14:25:42 +00:00
ac967e9dad [export] Fix tree spec matching behavior. (#109679)
Summary:

Test Plan:
Internal test.
Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109679
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2023-09-21 14:24:09 +00:00
d38379f9f1 Update dynamic shapes documentation (#109764)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109764
Approved by: https://github.com/gchanan
2023-09-21 13:53:43 +00:00
86a9534165 Upgrade nightly wheels to rocm5.7 (#109571)
Follow-up to https://github.com/pytorch/builder/pull/1541

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109571
Approved by: https://github.com/ezyang
2023-09-21 13:41:23 +00:00
600d0d0284 Add "cuda" to MPI backend capabilities (#109614)
Summary: Fixes https://github.com/pytorch/pytorch/issues/109543

Test Plan: We need to run CUDA aware MPI in PyTorch to actually test this change, we currently have no MPI tests.

Differential Revision: D49420438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109614
Approved by: https://github.com/XilunWu
2023-09-21 13:34:58 +00:00
b91ba226ce Don't use cpuinfo on s390x (#109496)
It doesn't support s390x and just crashes pytorch on init.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109496
Approved by: https://github.com/huydhn
2023-09-21 12:20:49 +00:00
772e104dfd [inductor] visualize fused ops in svg graph (#107752)
example usage
* `TORCH_COMPILE_DEBUG=1 INDUCTOR_ORIG_FX_SVG=1 INDUCTOR_POST_FUSION_SVG=1 python trig.py`: show original fx node name, file, and code. see snapshot 2 where we have origin_0, 1, 2
* trig.py can be found in P816304818

Implementation
* keep original fx graph in GraphLowering, ```self.orig_gm: torch.fx.GraphModule = gm.__copy__()```
* draw original fx graph with origins ir_post_fusion ```V.debug.draw_orig_fx_graph(self.orig_gm, self.scheduler.nodes)```. node.meta["buff_meta"] tracks buf_name

<img width="350" alt="Screenshot 2023-08-29 at 12 40 24 PM" src="https://github.com/pytorch/pytorch/assets/134637289/c4e197cb-ab3b-4a09-a584-c1356376accb">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107752
Approved by: https://github.com/mlazos
2023-09-21 08:03:05 +00:00
cyy
f5b753bab1 Fix inline_container_test on Windows (#109754)
Fix the failure mentioned in https://github.com/pytorch/pytorch/pull/109393. The reason is that IO streams were not opened in binary mode while binary data was written and read. Interestingly, the test passed on Linux.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109754
Approved by: https://github.com/malfet
2023-09-21 07:46:25 +00:00
b780b246eb Use a reduction implementation for unique when dtype is bool on CPU (#109695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109695
Approved by: https://github.com/lezcano
2023-09-21 06:56:10 +00:00
cddd0db241 Add finfo properties for float8 dtypes (#109744)
Add float8 finfo checks to `test_type_info.py`
Fixes https://github.com/pytorch/pytorch/issues/109737
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109744
Approved by: https://github.com/drisspg
2023-09-21 03:41:48 +00:00
e2e9d15726 Unblock float16 dtype for xla autocasting (#109554)
`torch.autocast` with `xla` backend has been restricted to `torch.bfloat16`. This shouldn't be the case anymore.

This works with `xla::cast( ..., type=f16)`
```
IR {
  %0 = f32[] prim::Constant(), xla_shape=f32[], value=1
  %1 = f32[3,2]{1,0} aten::expand(%0), xla_shape=f32[3,2]{1,0}, size=(3, 2), dynamic_dims=(0, 0)
  %2 = f16[3,2]{1,0} xla::cast(%1), xla_shape=f16[3,2]{1,0}, type=f16, dtype=Half, stype=Float
  %3 = f32[] prim::Constant(), xla_shape=f32[], value=1
  %4 = f32[2,3]{1,0} aten::expand(%3), xla_shape=f32[2,3]{1,0}, size=(2, 3), dynamic_dims=(0, 0)
  %5 = f16[2,3]{1,0} xla::cast(%4), xla_shape=f16[2,3]{1,0}, type=f16, dtype=Half, stype=Float
  %6 = f16[2,2]{1,0} aten::mm(%5, %2), xla_shape=f16[2,2]{1,0}, ROOT=0
}
```

This will allow PyTorch/XLA to extend its autocast implementation to use `xla` backend for `float16` type as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109554
Approved by: https://github.com/JackCaoG, https://github.com/bdhirsh
2023-09-21 03:19:44 +00:00
13bd4ed933 Add docs for torch.compile(numpy) (#109710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109710
Approved by: https://github.com/ev-br, https://github.com/gchanan, https://github.com/peterbell10
2023-09-21 03:05:21 +00:00
7a04ae6fba [export] Remove redundant no_grad() for exported program execution. (#109686)
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109686
Approved by: https://github.com/angelayi
2023-09-21 01:20:54 +00:00
e4d8ec9fe8 inductor: only do the conv+bn folding for the freezing path (#109587)
Re-enable PR: https://github.com/pytorch/pytorch/pull/109270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109587
Approved by: https://github.com/eellison
2023-09-21 00:47:37 +00:00
9e2b07ac9d [Inductor] Break the loop fusion when node2 depends on node1 mutations (#109172)
**Summary**
Fix the issue: https://github.com/pytorch/pytorch/issues/108963. After this PR, loop fusion should break when node2 depends on node1's buffer mutation. Take the UT as example:

- Before this PR, the generated code is:
```
cpp_fused_div_index_add_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const double* in_ptr0,
                       const long* in_ptr1,
                       const double* in_ptr2,
                       double* out_ptr0,
                       double* out_ptr1)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        out_ptr0[static_cast<long>(0L)] = tmp0;
    }
    {
        auto tmp0 = in_ptr1[static_cast<long>(0L)];
        auto tmp1 = in_ptr2[static_cast<long>(0L)];
        auto tmp4 = out_ptr0[static_cast<long>(0L)];
        auto tmp2 = static_cast<double>(2.0);
        auto tmp3 = decltype(tmp1)(tmp1 * tmp2);
        auto tmp5 = tmp4 / tmp2;
        atomic_add(&out_ptr0[static_cast<long>(0L)], tmp3);
        out_ptr1[static_cast<long>(0L)] = tmp5;
    }
}
''')
```

- After this PR, the generated code is:
```
cpp_fused_div_index_add_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const double* in_ptr0,
                       const long* in_ptr1,
                       const double* in_ptr2,
                       double* out_ptr0,
                       double* out_ptr1)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        out_ptr0[static_cast<long>(0L)] = tmp0;
    }
    {
        auto tmp0 = in_ptr1[static_cast<long>(0L)];
        auto tmp1 = in_ptr2[static_cast<long>(0L)];
        auto tmp2 = static_cast<double>(2.0);
        auto tmp3 = decltype(tmp1)(tmp1 * tmp2);
        atomic_add(&out_ptr0[static_cast<long>(0L)], tmp3);
    }
    {
        auto tmp0 = out_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<double>(2.0);
        auto tmp2 = tmp0 / tmp1;
        out_ptr1[static_cast<long>(0L)] = tmp2;
    }
}
''')
```

**Test Plan**
```
python -u -m pytest -s -v test_torchinductor.py -k test_mutations_loop_fusion
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109172
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-09-21 00:30:51 +00:00
9c2715bbb2 [inductor] Clean up AOTInductor runtime ABI (#109678)
Summary: Change the AOTInductor runtime interface to avoid referring to aten data structures directly, mostly at::Tensor and ProxyExecutor. This a combination of https://github.com/pytorch/pytorch/pull/109436,  https://github.com/pytorch/pytorch/pull/109498, https://github.com/pytorch/pytorch/pull/109450, https://github.com/pytorch/pytorch/pull/109606, plus a few internal build changes.

Reviewed By: frank-wei

Differential Revision: D49374820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109678
Approved by: https://github.com/frank-wei, https://github.com/chenyang78
2023-09-21 00:25:24 +00:00
4e3b03217d [BE] Replace 8 with CHAR_BIT (#109740)
Defined in [limits.h](https://en.cppreference.com/w/c/types/limits) as number of bits per byte

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109740
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2023-09-20 23:42:25 +00:00
6e3a7473cf Trace calls with Python Enum values. (#109507)
Fix: #82135
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109507
Approved by: https://github.com/ezyang
2023-09-20 22:18:11 +00:00
55685d57c0 [JIT] Fix typed enum handling in 3.11 (#109717)
In Python-3.11+ typed enums (such as `enum.IntEnum`) retain `__new__`,`__str__` and so on method of the base class via `__init__subclass__()` method (see https://docs.python.org/3/whatsnew/3.11.html#enum ), i.e. following code
```python
import sys
import inspect
from enum import Enum

class IntColor(int, Enum):
    RED = 1
    GREEN = 2

class Color(Enum):
    RED = 1
    GREEN = 2

def get_methods(cls):
    def predicate(m):
        if not inspect.isfunction(m) and not inspect.ismethod(m):
            return False
        return m.__name__ in cls.__dict__
    return inspect.getmembers(cls, predicate=predicate)

if __name__ == "__main__":
    print(sys.version)
    print(f"IntColor methods {get_methods(IntColor)}")
    print(f"Color methods {get_methods(Color)}")
```

Returns empty list for both cases for older Python, but on Python-3.11+ it returns list contains of enum constructors and others:
```shell
% conda run -n py310 python bar.py
3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:41:52) [Clang 15.0.7 ]
IntColor methods []
Color methods []
% conda run -n py311 python bar.py
3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:21:25) [Clang 14.0.4 ]
IntColor methods [('__format__', <function Enum.__format__ at 0x105006ac0>), ('__new__', <function Enum.__new__ at 0x105006660>), ('__repr__', <function Enum.__repr__ at 0x1050068e0>)]
Color methods []
```

This change allows typed enums to be scriptable on 3.11, by explicitly marking several `enum.Enum` method to be dropped by jit script and adds test that typed enums are jit-scriptable.

Fixes https://github.com/pytorch/pytorch/issues/108933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109717
Approved by: https://github.com/atalman, https://github.com/davidberard98
2023-09-20 22:09:41 +00:00
7ce69d5dbe [RELAND] Remove some unnecessary <iostream> includes from headers (#108150)
In almost all cases this is only included for writing the output formatter, which
only uses `std::ostream` so including `<ostream>` is sufficient.

The istream header is ~1000 lines so the difference is non-trivial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108150
Approved by: https://github.com/albanD, https://github.com/malfet
ghstack dependencies: #108149
2023-09-20 21:55:15 +00:00
05b3a4dd88 Fix test_libtorch.bat not exiting on error (#109393)
For some weird reason, the batch file gets rid of the `exit /b 1` inside the for loop, so failures never actually get surfaced.  Add skips for the tests that were failing.
Also don't run the windows cpu build on main since it's in trunk.  This is what currently works for the rocm build.

The temp file failure originates from https://github.com/pytorch/pytorch/pull/108508 (got fixed before I merged this PR)

I'm not sure when the ChunkRecordIteratorTest started failing, but it was after the above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109393
Approved by: https://github.com/malfet
2023-09-20 21:34:40 +00:00
cyy
0735f6c0d5 [Reland] Remove calls of c10::either (#109708)
While there were FB issues encountered when removing c10::either #109299 , we should be able to change OSS code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109708
Approved by: https://github.com/clee2000
2023-09-20 21:23:10 +00:00
cadb566bbc [RELAND] [ATen] Update pre-compiled header (#108149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108149
Approved by: https://github.com/albanD
2023-09-20 20:38:30 +00:00
8bc00dfffd Hashing for constant and singleton SymInt/SymBool (#109170)
Bugfix:
- previously, SymBool does not implement `__eq__`, Python falls back to default `__eq__ `and `__hash__`
- in this PR, we make SymBool implement `__eq__`
- symbolic SymBool now raises an error when hashed just like SymInt/SymFloat

New feature:
- previously, SymInt and SymFloat are unhashable (even if you are singleton or constant)
- in this PR, SymInt and SymBool are hashable if singleton/constant

Stay the same:
- SymNode are hashable due to default Python behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109170
Approved by: https://github.com/ezyang
ghstack dependencies: #109169
2023-09-20 20:37:15 +00:00
5252fcb133 Handle constant SymBool in unary and binary operations (#109169)
In this PR:
- When Constant SymNode are detected in unary/binary ops demote them to plain int/bool before proceeding. Sometimes this means doing a unary op with a Constant SymNode would result in a plain bool.
- Introduce an is_symbolic method, only available from Python. We need this because isinstance(x, SymInt) is no longer sufficient to check whether a given int/SymInt is symbolic or not. See later PR in the stack to see how this is used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109169
Approved by: https://github.com/ezyang
2023-09-20 20:37:15 +00:00
8597d37536 Implement numpy(force=True) (#109636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109636
Approved by: https://github.com/ezyang
ghstack dependencies: #109634
2023-09-20 20:06:13 +00:00
1f6828ca99 Fix numpy(force=False) (#109634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109634
Approved by: https://github.com/ezyang
2023-09-20 20:06:13 +00:00
9a1b6d44bb [C10d] Add PG::enableCollectivesTiming to make it dynamically enabled. (#108814)
Collectives timing gates the tracking when a collective starts on device.

Currently it's enabled by set the NCCL_ENABLE_TIMING env var.

The goal of this PR is to make it possible to dynamically enable that flag so users of the PG hooks don't have to set that flag in order to have their hooks work.

The design is that once set, all new collectives will have such behavior so we track it on each Work object.

We make enableTiming_ atomic in PGNCCL to avoid races on non-TSO hardware.

To ensure consistency, we copy its value during Work construction and replace all previous usage of enableTiming_ from the PG with usages from the Work, which now has an immutable value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108814
Approved by: https://github.com/wconstab, https://github.com/fduwjj
ghstack dependencies: #108813
2023-09-20 19:47:41 +00:00
3add22b716 Created nested utils.cpp (#109304)
# Summary
This refactors the preprocessing for nestedtensors that glue into SDPA. This is done in order to aid with reviewing:
https://github.com/pytorch/pytorch/pull/97485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109304
Approved by: https://github.com/cpuhrsch
2023-09-20 19:38:34 +00:00
559d1f94a0 Revert "[Dynamo][Test] reland testcase with state (#109713)"
This reverts commit 5c897eacff8bc8f559d336d02f5c627c0045ac9d.

Reverted https://github.com/pytorch/pytorch/pull/109713 on behalf of https://github.com/PaliC due to creates a out of memory error for macos tests ([comment](https://github.com/pytorch/pytorch/pull/109713#issuecomment-1728314478))
2023-09-20 19:34:07 +00:00
f9947830bb [ONNX] Remove the depreacated function in symbolic_helper (#109681)
These three functions in symbolic_helper are depreacated and should be removed after pytorch 2.0.

The clean up job will be separated into several patches to ensure the safety. See: https://github.com/pytorch/pytorch/pull/107208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109681
Approved by: https://github.com/thiagocrepaldi
2023-09-20 19:31:39 +00:00
f3c12f5aa2 [DCP][test] Update test_dtensor_resharding.py (#109619)
Remove @parametrize and replace it with a for loop. This is due to @parametrize gets too complicated for internal test to recognize the test name.

@fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109619
Approved by: https://github.com/fegin
2023-09-20 19:05:07 +00:00
7e05cd4eca [autotuning] move logging logic into logging function (#109155)
Summary: move check for use_global_cache into logging functions

Test Plan: sandcastle/ci

Differential Revision: D49211797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109155
Approved by: https://github.com/jansel
2023-09-20 18:53:59 +00:00
90a2026cd1 [inductor] Use _unsafe_view decompostion (#109669)
As per the old comment, decomposing is better than lowering because patterns for
`view` would apply to `_unsafe_view` as well.

fc47ba2794/torch/_inductor/decomposition.py (L89)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109669
Approved by: https://github.com/lezcano
ghstack dependencies: #109667, #109668
2023-09-20 18:45:56 +00:00
6f0cf5a837 [decomp] Decompose unsafe_split{,_with_sizes} into safe variants (#109668)
The "safety" aspect refers to the output not being registered as aliasing the
input, but after AOTAutograd I don't think this distinction matters. However,
we shouldn't use the same decomposition as the safe variant in case the backend
doesn't want to decompose split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109668
Approved by: https://github.com/lezcano
ghstack dependencies: #109667
2023-09-20 18:45:56 +00:00
9e629dd73c [decomp] Add all std and std_mean overloads to core decompostions (#109667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109667
Approved by: https://github.com/lezcano
2023-09-20 18:45:56 +00:00
36a8105f54 [decomp] Fix baddbmm decomposition (#109714)
The decomposition is currently registered without the pw_cast_for_opmath
decorator, due to the ordering of decorators being meaningful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714
Approved by: https://github.com/lezcano
2023-09-20 18:40:21 +00:00
b60a7c59ea Refactor check_fast_path_restriction in preparation for has_empty_tensor variant (#109534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109534
Approved by: https://github.com/albanD
2023-09-20 18:24:30 +00:00
5c897eacff [Dynamo][Test] reland testcase with state (#109713)
Reland the PR https://github.com/pytorch/pytorch/pull/108750 reverted by https://github.com/pytorch/pytorch/issues/108838 , since https://github.com/pytorch/pytorch/pull/108969 has been merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109713
Approved by: https://github.com/anijain2305
2023-09-20 18:19:18 +00:00
be712a02e9 Trace pytree calls inside vmap implementation. (#109107)
This PR fixes the `expectedFailure` introduced in the previous PR.

**Problem:** container variables, such as `ConstDictVariable`, aren't registered nodes anymore. But, we have to process
the tensors inside them, anyway.

**Solution:** wrap the pytree functions in a `UserFunctionVariable`, and call it. This should inline the given pytree function, and return the expected processed arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109107
Approved by: https://github.com/zou3519
ghstack dependencies: #109201, #108533
2023-09-20 18:11:10 +00:00
654731a52b Handle unbacked symints in Triton size hints (#109609)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109609
Approved by: https://github.com/yf225
ghstack dependencies: #109603
2023-09-20 18:03:54 +00:00
1c4e811565 replace data_ptr with aoti_torch_get_data_ptr for cpp codegen (#109615)
Summary:
in cpp codege, we should use aoti_torch_get_data_ptr
for retrieving aten tensor pointers if abi_compatible is true

Test Plan: ci

Reviewed By: bertmaher

Differential Revision: D49411392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109615
Approved by: https://github.com/bertmaher, https://github.com/desertfire, https://github.com/jansel
2023-09-20 17:26:17 +00:00
cdb51d2ad0 Revert "[2/N] Add -Wdeprecated and related fixes (#109564)"
This reverts commit 5b50641bac49e00ad05060f0b9fe3dcc5d73bc9b.

Reverted https://github.com/pytorch/pytorch/pull/109564 on behalf of https://github.com/atalman due to Need to revert as followup revert of first PR 108626 ([comment](https://github.com/pytorch/pytorch/pull/109564#issuecomment-1728137207))
2023-09-20 17:15:57 +00:00
af3741745c [CI] Add torch.compile works without numpy test (#109624)
Fixes https://github.com/pytorch/pytorch/issues/109387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109624
Approved by: https://github.com/albanD
2023-09-20 17:07:20 +00:00
b771c04d6e Handle unbacked symints in buffer reuse calculation (#109603)
This is rewritten from https://github.com/pytorch/pytorch/pull/106655 to land faster, with peterbell10's comments.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109603
Approved by: https://github.com/yf225
2023-09-20 16:54:57 +00:00
63025d4218 Do not redundantly min start with new_size[dim], since end is already min'ed with it (#109599)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109599
Approved by: https://github.com/aakhundov
2023-09-20 16:53:42 +00:00
1cc052bcab Revert "[1/N] Add -Wdeprecated and related fixes (#108626)"
This reverts commit a53a677b4d8b9f4b9abbfeed2a6d4c00e9ee2252.

Reverted https://github.com/pytorch/pytorch/pull/108626 on behalf of https://github.com/clee2000 due to I'm getting errors internally that look like the below on x86_64-apple-ios-simulator with clang 16 ([comment](https://github.com/pytorch/pytorch/pull/108626#issuecomment-1728102447))
2023-09-20 16:49:11 +00:00
db6e9f66f1 Use pretty print for checking no duplicated pattern (#109066)
The pretty print is faster and more concise because it memoizes objects.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109066
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663, #108894, #108917, #109142, #109156
2023-09-20 16:44:09 +00:00
d24ba7a634 Add 3d Attn Pattern to match HF Whisper (#109156)
Adds a 3d pattern that improves perf of HF Whisper from 1.3 -> 4.1. We could be matching more generally on 3d, but i'll leave that for another pr.

Thanks to @drisspg for helping me write the pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109156
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663, #108894, #108917, #109142
2023-09-20 16:39:31 +00:00
881bfbf21d [c10d] Add tests for usig libuv through init_process_group. (#108661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108661
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-09-20 16:02:20 +00:00
cyy
567e8ebf94 [1/N] Move c10::variant to std::variant (#103675)
This PR moves some calls of c10::variant to std::variant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103675
Approved by: https://github.com/ezyang
2023-09-20 15:21:24 +00:00
e87bd9f588 [aot inductor] Make unit tests work on CPU (#109625)
Summary: AOT inductor is only sort-of supported on CPU right now, but it works
with a few hacks (the .so needs to be compiled and run with CUDA present,
because we haven't excised the CUDA deps; also there's an `is_cpu` flag that
needs to be plumbed into the call, or else all the weights are erroneously
allocated on GPU).

But, with those hacks in place, it currently works, so it's worth having the
current state of it continue working (and at some point we'll remove the
hacks).

Test Plan:
```
python test_aot_inductor -k test_simple_cpu
```

Reviewers: binbao

Subscribers:

Tasks:

Tags:

Differential Revision: [D49427400](https://our.internmc.facebook.com/intern/diff/D49427400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109625
Approved by: https://github.com/mikekgfb, https://github.com/chenyang78, https://github.com/desertfire
2023-09-20 14:51:44 +00:00
e73efbffab [Test][ShardedTensor] Add test for corner case for chunk sharding spec (#109626)
## Description
Add a test case to cover the corner case of empty shards when creating ShardedTensor.
Original fix contributed by a user.
https://github.com/pytorch/pytorch/pull/108915

## Test
With the fix, the test added runs fine.
Without the fix in https://github.com/pytorch/pytorch/pull/108915, the test case added would throw the following assertion error.
```
(/home/irisz/local/a/pytorch-env) [irisz@devgpu051.cln3 ~/local/pytorch (add_test_for_corner_case_for_chunk_sharding_spec)]$ python3 test/distributed/_shard/sharded_tensor/test_sharded_tensor.py TestShardTensor.test_shard_tensor_with_empty_shard
Fail to import hypothesis in common_utils, tests are not derandomized
INFO:numba.cuda.cudadrv.driver:init
Fail to import hypothesis in common_utils, tests are not derandomized
Fail to import hypothesis in common_utils, tests are not derandomized
Fail to import hypothesis in common_utils, tests are not derandomized
Fail to import hypothesis in common_utils, tests are not derandomized
NCCL version 2.18.3+cuda12.0
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] Caught exception:
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 658, in run_test
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 544, in wrapper
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]     fn()
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/testing/_internal/common_utils.py", line 2406, in wrapper
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 94, in wrapper
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]     func(self, *args, **kwargs)
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py", line 258, in test_shard_tensor_with_empty_shard
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]     st = _shard_tensor(tensor, spec)
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/distributed/_shard/api.py", line 68, in _shard_tensor
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]     st = sharding_spec.shard(tensor, src_rank=src_rank, process_group=process_group)
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py", line 170, in shard
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]     assert local_tensor is not None
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] AssertionError
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR]  exiting process 3 with exit code: 10
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] Caught exception:
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 658, in run_test
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 544, in wrapper
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     fn()
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/testing/_internal/common_utils.py", line 2406, in wrapper
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 94, in wrapper
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     func(self, *args, **kwargs)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py", line 258, in test_shard_tensor_with_empty_shard
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     st = _shard_tensor(tensor, spec)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/distributed/_shard/api.py", line 68, in _shard_tensor
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     st = sharding_spec.shard(tensor, src_rank=src_rank, process_group=process_group)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py", line 179, in shard
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     dist.scatter(
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/distributed/c10d_logger.py", line 68, in wrapper
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/distributed/distributed_c10d.py", line 3143, in scatter
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     _check_tensor_list(scatter_list, "scatter_list")
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]   File "/data/users/irisz/pytorch/torch/distributed/distributed_c10d.py", line 808, in _check_tensor_list
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]     raise TypeError(
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] TypeError: Invalid function argument. Expected parameter `scatter_list` to be of type List[torch.Tensor].
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/_shard/sharded_tensor/test_sharded_tensor.py -k test_shard_tensor_with_empty_shard
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]  exiting process 0 with exit code: 10
Process 3 terminated with exit code 10, terminating remaining processes.
E
======================================================================
ERROR: test_shard_tensor_with_empty_shard (__main__.TestShardTensor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 542, in wrapper
    self._join_processes(fn)
  File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 761, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 811, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 658, in run_test
    getattr(self, test_name)()
  File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 544, in wrapper
    fn()
  File "/data/users/irisz/pytorch/torch/testing/_internal/common_utils.py", line 2406, in wrapper
    method(*args, **kwargs)
  File "/data/users/irisz/pytorch/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 94, in wrapper
    func(self, *args, **kwargs)
  File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper
    return func(*args, **kwargs)
  File "/data/users/irisz/pytorch/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py", line 258, in test_shard_tensor_with_empty_shard
    st = _shard_tensor(tensor, spec)
  File "/data/users/irisz/pytorch/torch/distributed/_shard/api.py", line 68, in _shard_tensor
    st = sharding_spec.shard(tensor, src_rank=src_rank, process_group=process_group)
  File "/data/users/irisz/pytorch/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py", line 170, in shard
    assert local_tensor is not None
AssertionError
----------------------------------------------------------------------
Ran 1 test in 21.207s

FAILED (errors=1)
```

cc. @fduwjj @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109626
Approved by: https://github.com/fduwjj
2023-09-20 14:40:07 +00:00
a019e5cbff s390x onnx: byteswap data when serializing it (#107963)
This change fixes test_pad, test_pad_with_dynamic_input_shape, test_reshape, test_resize and test_resize_after_concat in test/onnx/test_pytorch_onnx_shape_inference.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107963
Approved by: https://github.com/justinchuby
2023-09-20 14:27:45 +00:00
40b2c796dc [Decomposition] baddbmm (#108534)
Summary:
Moving decomposition of baddbmm from _inductor/decomposition.py and include it in core_aten_decompositions

ff38c0e2f9/torch/_inductor/decomposition.py (L203)

Test Plan: Phabricator + OSS Tests

Differential Revision: D48871741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108534
Approved by: https://github.com/SherlockNoMad
2023-09-20 12:49:32 +00:00
b30ee35a6f [Inductor][FX]Support efficient conv bn eval (#108757)
This PR adds an `efficient_conv_bn_eval_graph_transform` pass to the inductor. It tries to identify consecutive conv + bn **computation** with bn in eval mode, and changes it to a more efficient implementation. It does not modify parameters, which makes it **support training** without any pain. If no such patterns are identified, it does nothing. Therefore, it is backward compatible.

It has great benefit in terms of memory footprint:

For resnet50 with input batchsize 64, image size 224, forward + backward training:

| Technique                   | Memory Footprint (GB)      | Remarks                                   |
|-------------------------------|----------------------------|-------------------------------------------|
| Eager Mode  | 5.18          |                                           |
| torch.compile                 | 5.46         | Strangely, not saving memory              |
| torch.compile with this PR                       | 2.88          | **Saves about 50% memory! **         |

The script to measure the memory footprint:

```python
from torchvision.models.resnet import resnet50
import torch

net = resnet50().eval().cuda()

input = torch.randn(64, 3, 224, 224).cuda()

opt_net = torch.compile(net) # Use torch.compile
# opt_net = net # Eager mode

current_memory = torch.cuda.memory_allocated()
torch.cuda.reset_peak_memory_stats()

for i in range(10):
    opt_net.zero_grad()
    output = opt_net(input)
    output.sum().backward()
    del output

peak_memory = torch.cuda.max_memory_allocated()
additional_peak_memory = peak_memory - current_memory
print(f"Additional peak memory used: {additional_peak_memory / (1024 ** 3)} GB")
```

More results can be found in the corresponding paper: (this method is called Tune Mode in the tables).

<img width="709" alt="image" src="https://github.com/pytorch/pytorch/assets/23236638/db4815b0-d93e-4726-b1d5-e6651f256484">

<img width="653" alt="image" src="https://github.com/pytorch/pytorch/assets/23236638/22e5e1ab-6129-4c3d-a875-3c7343293b2e">

Note: the difference between this PR and https://github.com/pytorch/pytorch/pull/106372 is that, https://github.com/pytorch/pytorch/pull/106372 tries to fix and change the implementation of `torch.fx.experimental.optimization.fuse`, which causes compatibility issues; this PR only introduces a new graph transform passes, and does not break the previous code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108757
Approved by: https://github.com/jansel
2023-09-20 08:10:02 +00:00
595af261b2 [ao] Support Subclasses of FloatFunctional in eager mode prepare (#109646)
Summary: As title, if a module is subclassing `nnq.FloatFunctional`, also adding observers to it like `nnq.FloatFunctional`

Test Plan: CI

Differential Revision: D49431968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109646
Approved by: https://github.com/jerryzh168
2023-09-20 08:09:55 +00:00
293205c54b [AOTInductor] Fix aot_inductor/test:test_custom_ops (#109660)
Summary: Fix aot_inductor/test:test_custom_ops, which was broken by https://github.com/pytorch/pytorch/pull/109391

Test Plan: buck2 run mode/dev-nosan //deeplearning/aot_inductor/test:test_custom_ops

Differential Revision: D49438928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109660
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2023-09-20 07:44:39 +00:00
cyy
5b50641bac [2/N] Add -Wdeprecated and related fixes (#109564)
This PR follows #108626.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109564
Approved by: https://github.com/ezyang
2023-09-20 07:03:25 +00:00
cyy
d137b620c5 Fix c10_tempfile_test failure on Windows (#109680)
Fixes c10_tempfile_test indicated by #109393.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109680
Approved by: https://github.com/clee2000
2023-09-20 07:01:42 +00:00
ad53b53518 Generate patterns in fp16 and fp32 (#109142)
aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663, #108894, #108917
2023-09-20 06:38:02 +00:00
122264a0c0 [generate_opcheck_tests] tests should ignore meta/FakeTensors (#109641)
These tests generally don't work on meta tensors because they need to
compare the data of the Tensors. For example, SchemaCheckMode errors out
if any inputs are meta or Fake because it needs to check their storages
to see if any mutation occurred and those do not have storages.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109641
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
ghstack dependencies: #109637, #109638, #109639, #109640
2023-09-20 06:33:37 +00:00
d3d71367b9 [generate_opcheck_tests] Always print a repro (#109640)
On failure of a test, we will always print a "repro". This repro isn't
really runnable but gives the user a sense of how to actually reproduce
the test without the test suite, because using the test suite is a bit
convoluted.

If the user passes PYTORCH_OPCHECK_PRINT_BETTER_REPRO, we will print a
fuller repro that saves the exact problematic test inputs to disk and
reads them back out.

Test Plan:
- expecttests on the generate_repro helper function
- tried this out locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109640
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
ghstack dependencies: #109637, #109638, #109639
2023-09-20 06:33:37 +00:00
af900fe228 [generate_opcheck_tests] flip unified_diff order (#109639)
It was reversed. As written this is a bit difficult to test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109639
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
ghstack dependencies: #109637, #109638
2023-09-20 06:33:37 +00:00
7564f04389 [generate_opcheck_tests] add type checking (#109638)
Test Plan:
- lintrunner
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109638
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
ghstack dependencies: #109637
2023-09-20 06:33:37 +00:00
10d575911e [generate_opcheck_tests] rename "success" to "xsuccess" (#109637)
Not BC breaking because no existing failures dict have "success" in
them.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109637
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
2023-09-20 06:33:37 +00:00
d271a5c796 [minimizer]skip mode for minimizer (#109399)
Summary: - skip known issue nodes in minimizer and check the whole graph

Reviewed By: siyan-lin

Differential Revision: D48990707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109399
Approved by: https://github.com/jfix71
2023-09-20 06:23:46 +00:00
067f172930 Serialize Remaining Patterns (#108917)
Serializes the remaining traced patterns.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108917
Approved by: https://github.com/davidberard98
ghstack dependencies: #109663, #108894
2023-09-20 05:39:23 +00:00
16d608d70d Add Python serialization to Pattern Matcher patterns (#108894)
Adds a Python Pretty Printer to the pattern matcher that serializes patterns as python. Generating our fuse attention patterns was taking 4 seconds of compile time, which will only get worse as we add more variants (which I will do in the rest of this stack). To write out patterns, build pytorch, then run `gen_attention_patterns.py`.

Since there is a line limit for PRs  i'm only including the _sdpa_pattern1 in this first diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108894
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663
2023-09-20 05:36:52 +00:00
1a5e0edf56 [dynamo] Avoid divided by zero error when printing out choices (#109328)
Summary: We met with problem in practice. Although I think whenever we meet with this problem, something bad probably happened in upstream. Like the run instance possibly returned immediately due to error. Throw this out to see if we can catch something earlier. That'll be better.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109328
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-09-20 05:27:20 +00:00
76dd38b591 add back in unsafe view decomp (#109663)
This decomp makes pattern matching easier, and was only just excluded from decomp set in https://github.com/pytorch/pytorch/pull/108713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109663
Approved by: https://github.com/davidberard98, https://github.com/yanboliang
2023-09-20 05:23:59 +00:00
238fb66085 python functionalization: support higher order ops (#108656)
We now have two types of functionalization, C++ Functionalization (through the `Functionalize` dispatch key), and python functionalization (through the `FunctionalTensorMode` torch_dispatch mode).

This means that all higher order ops need custom functionalization rules for the python variant too. I added them here, as well as a helper function `dispatch_functionalize()` - equivalent to `torch.func.functionalize()`, except that it uses `FunctionalTensorMode`.

In theory we could have secretly switched `torch.func.functionalize` to use `FunctionalTensorMode`. This would be BC-breaking, though, since `FunctionalTensorMode` isn't composable with the other functorch transforms (the functorch layer-mode stack doesn't know how to re-order torch_dispatch modes arbitrarily).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108656
Approved by: https://github.com/zou3519
ghstack dependencies: #109024, #109248
2023-09-20 04:37:31 +00:00
d9342cde6e custom ops: don't error if autograd input is a tensor subclass (#109248)
This is needed to allow the custom ops in our custom op autograd tests to accept `FunctionalTensor` arguments as inputs that we compute gradients for. Previously, custom ops would raise an error if you tried to pass in a tensor subclass when using autograd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109248
Approved by: https://github.com/zou3519
ghstack dependencies: #109024
2023-09-20 04:37:31 +00:00
c9b60a691b functorch: fallthrough on calls to custom size/stride/storage_offset calls (#109024)
The problem (that @zou3519 pointed out) is that functorch assumes that when it create a TensorImpl (like `TensorWrapper`, [code](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/functorch/TensorWrapper.cpp#L43)), it doesn't re-enter the dispatcher.

However, if the inner tensor that we hold is a tensor subclass with custom size/strides, then calls like `sym_storage_offset()` get plumbed to `__torch_dispatch__` as `torch.ops.aten.sym_storage_offset.default`, which is a real op registered to the dispatcher ([here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/runtime/register_prim_ops.cpp#L526)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109024
Approved by: https://github.com/zou3519
2023-09-20 04:37:31 +00:00
0317626df5 [MPS] adding weight_norm_interface support for mps (#108008)
Fixes #104513

Adds support for aten::_weight_norm_interface to the mps backend.

Also adds a consistency test for the output and the grad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108008
Approved by: https://github.com/kulinseth
2023-09-20 02:18:28 +00:00
1b3e5b53f3 [FSDP][optim_state_dict] Add device to _shard_utils.py to explicitly use the device from fsdp_state (#109631)
_get_pg_default_device does not always get the device we want. This PR let the user explicitly tell use the correct device.

Differential Revision: [D49425743](https://our.internmc.facebook.com/intern/diff/D49425743/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109631
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/wz337
2023-09-20 01:59:38 +00:00
6b760ffd6c improve unique performance on CPU (#107846)
Fix https://github.com/pytorch/pytorch/issues/107098, improve `unique` performance on CPU.

The algorithm is taken from Numpy implementation at https://github.com/numpy/numpy/blob/main/numpy/lib/arraysetops.py#L323, it first do a sort on the input sequence and then  use a `mask` to record the unique element of each consecutive section.

Now we don't have parallel sort on 1-dimension float tensor, will have it enabled in next step. Parallel radix sort is used for 1-dimensional int tensor.

The following data is collected with script in the issue on Intel(R) Xeon(R) Gold 6248 CPU @ 2.5GHz with single sockets (20 cores):

#### before (dtype int64)
```
Numpy just sort: 0.4271528720855713 s
Numpy sort + indexes: 6.383563041687012 s
Torch just sort: 0.46924352645874023 s
Torch sort + indexes: 1.8140404224395752 s
```

#### after (dtype int64)
```
Torch just sort: 0.2540090084075928 s
Torch sort + indexes: 0.2766146659851074 s
```

#### before (float32)
```
Numpy just sort: 0.41129398345947266 s
Numpy sort + indexes: 6.422696590423584 s
Torch just sort: 9.109549283981323 s
Torch sort + indexes: 37.59021711349487 s
```

#### after (float32)
```
Torch just sort: 3.5369982719421387 s
Torch sort + indexes: 3.582240581512451 s
```

if we enabled parallel sort on 1-dimension float tensor, the performance is:
```
Torch just sort: 0.3212606906890869 s
Torch sort + indexes: 0.36211371421813965 s
```

Since i have fused the `inverse_indices` and `count` calculation in fused parallel loop (the algorithm is identical to NumPy's but with better optimization), they will take a small amount of additional time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107846
Approved by: https://github.com/jgong5, https://github.com/nikitaved, https://github.com/peterbell10
2023-09-20 01:38:19 +00:00
518308a740 Trace through pytree API with dynamo. (#108533)
Fix: #107315

This PR enables dynamo to trace through the `pytree` API by inlining its functions. In
order to do so, a few details of `pytree` had to be changed.

In summary, this PR:

- Introduces `TreeSpecVariable` for representing `TreeSpec` instances
- Specializes `<type>.__bases__` call, returning a `TupleVariable`
- Enables the call to `id` builtin function for every variable that implements
  `as_python_constant` method
- Specializes `ConstantVariable.call_method` for its (un)flatten functions
- Implements `UserDefinedObjectVariable.as_python_constant`
- Modifies `pytree` by:
    - Make `SUPPORTED_NODES` a map of ids (instead of types) to `NodeDef`
    - Removed `functools.wraps` function, since it can't be inlined

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108533
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
ghstack dependencies: #109201
2023-09-20 00:04:56 +00:00
103260a43b Re-define check for typing classes. (#109201)
This PR fix the `is_typing` function: checks whether a value is an instance of a class
from the `typing` package.

This reverts commit b09c09f7bb3adb6a5b8a107a5b96757b569daa8d.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109201
Approved by: https://github.com/ezyang
2023-09-20 00:04:56 +00:00
85d26f7868 [inductor] Enable mypy checking for torch/_inductor/codegen/triton.py (#109146)
Summary: enably mypy chcking for torch/_inductor/codegen/triton.py and make the minimum number of fixes / ignores to get the linter to pass

Test Plan: `lintrunner -a`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109146
Approved by: https://github.com/peterbell10
2023-09-19 23:01:03 +00:00
8705fc1bbd Revert "Add Python serialization to Pattern Matcher patterns (#108894)"
This reverts commit 7db175b6f628ed18f98eeb41d8b15c85c40e0f51.

Reverted https://github.com/pytorch/pytorch/pull/108894 on behalf of https://github.com/eellison due to land race ([comment](https://github.com/pytorch/pytorch/pull/108894#issuecomment-1726649151))
2023-09-19 23:00:03 +00:00
8b4b1817c8 Revert "Serialize Remaining Patterns (#108917)"
This reverts commit 7bf08b77f378e5b540fb08dd0c61326fe3ab5583.

Reverted https://github.com/pytorch/pytorch/pull/108917 on behalf of https://github.com/eellison due to land race ([comment](https://github.com/pytorch/pytorch/pull/108917#issuecomment-1726646267))
2023-09-19 22:54:52 +00:00
b1d2028eb0 Add compiled optimizer test for nadam (#109548)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109548
Approved by: https://github.com/janeyx99
2023-09-19 22:54:36 +00:00
c2f5d4d8f0 Revert "Generate patterns in fp16 and fp32 (#109142)"
This reverts commit 14994cc9780cc66e03f8ce6720996e798dd85e19.

Reverted https://github.com/pytorch/pytorch/pull/109142 on behalf of https://github.com/eellison due to MESSAGE ([comment](https://github.com/pytorch/pytorch/pull/109142#issuecomment-1726641232))
2023-09-19 22:52:05 +00:00
11c6a98bca [torch] add use_buffers to swa_utils interface (#109078)
Summary: As title, this already exists in swa_utils.py

Differential Revision: D49155243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109078
Approved by: https://github.com/janeyx99
2023-09-19 21:30:59 +00:00
14994cc978 Generate patterns in fp16 and fp32 (#109142)
aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142
Approved by: https://github.com/yanboliang
ghstack dependencies: #108894, #108917
2023-09-19 20:59:42 +00:00
7b53303d3c Improved the docs for torch.std, torch.var, torch.std_mean, torch.var_mean and torch.cov (#109326)
Fixes #109186.

This PR updates the docs for
- `torch.var`
- `torch.var_mean`
- `torch.std`
- `torch.std_mean`
- `torch.cov`

to reflect the actual implementation behavior when `correction >= N`. The math for `torch.cov` should probably be double checked before merging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109326
Approved by: https://github.com/albanD
2023-09-19 20:47:24 +00:00
7bf08b77f3 Serialize Remaining Patterns (#108917)
Serializes the remaining traced patterns.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108917
Approved by: https://github.com/davidberard98
ghstack dependencies: #108894
2023-09-19 20:45:52 +00:00
7db175b6f6 Add Python serialization to Pattern Matcher patterns (#108894)
Adds a Python Pretty Printer to the pattern matcher that serializes patterns as python. Generating our fuse attention patterns was taking 4 seconds of compile time, which will only get worse as we add more variants (which I will do in the rest of this stack). To write out patterns, build pytorch, then run `gen_attention_patterns.py`.

Since there is a line limit for PRs  i'm only including the _sdpa_pattern1 in this first diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108894
Approved by: https://github.com/yanboliang
2023-09-19 20:36:52 +00:00
5845fc2fa6 [PyTorch][Coreml] Bubble up NSError from loadModel (#109444)
Summary: This can help debug issues esp fc/bc issues with coreml tools, when a model fails to load.

Test Plan:
On a macbook fbsource,
```
arc focus2 -b pp-ios -a ModelRunner -a //xplat/caffe2/c10:c10Apple -a //xplat/caffe2/fb/dynamic_pytorch:dynamic_pytorch_implApple -a //xplat/caffe2:coreml_delegateApple --auto-test-schemes --force-with-wrong-xcode
```
It builds and runs the Playground app using a bunch of coreml models on my iPhone. Here is one for example,
https://pxl.cl/3nSPn

Also forcefully triggering MLModel ctor failure to test this code by setting a `modelURL=nil`, and as expected got this,
```
libc++abi: terminating due to uncaught exception of type c10::Error: Error loading MLModel Error details:  Localized_description: nil value for URL Domain: com.apple.CoreML Code: 3 User Info: {
    NSLocalizedDescription = "nil value for URL";
} Input Shapes: N/A

Exception raised from compile at xplat/caffe2/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm:162 (most recent call first):
(no backtrace available)
```

Instead of a previous message would have been,
```
Loading MLModel failed
```

Unrelated issues
* P829736691 - with running MaskRCNN on Coreml with the Playground app. Only happens some times.
* P829741377 - with Metal Operator Tests with the Playground app.

Differential Revision: D49349726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109444
Approved by: https://github.com/kimishpatel
2023-09-19 20:08:37 +00:00
a86727a06b [Pytorch][Vulkan] rewrite available() check and add tests for them (#109541)
Summary: As suggested by liuk22 [[here](https://www.internalfb.com/diff/D49306279?dst_version_fbid=3583458958608887&transaction_fbid=282478474429100)] , rewrote `available()` check and add tests to ensure they work.

Test Plan:
```
LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin
```

Differential Revision: D49388848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109541
Approved by: https://github.com/yipjustin
2023-09-19 18:59:01 +00:00
964b79c813 [EASY] Update dynamo dependency installing Makefile (#107229)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107229
Approved by: https://github.com/bdhirsh
2023-09-19 18:58:37 +00:00
caf4376349 [PyTorch] remove branch in isIntrusivePtr (#109273)
There is a code comment in ivalue.h that is intended to explain the motivation for this change fully; please request changes if it doesn't.

Differential Revision: [D49245910](https://our.internmc.facebook.com/intern/diff/D49245910/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D49245910/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109273
Approved by: https://github.com/ezyang
ghstack dependencies: #109272
2023-09-19 17:51:41 +00:00
e29330deab [PyTorch] clang-format ivalue.h (#109272)
I don't know how this got out of format, but now it's formatted.

Differential Revision: [D49245911](https://our.internmc.facebook.com/intern/diff/D49245911/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109272
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-09-19 17:51:41 +00:00
cd31c170c9 Revert "[ONNX] Remove deprecated functions (#107208)"
This reverts commit 263ca7d69bb9b3b58ae0f9b4d27864587611389c.

Reverted https://github.com/pytorch/pytorch/pull/107208 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107208#issuecomment-1726183104))
2023-09-19 17:26:48 +00:00
70f2adaec3 Setup_context does not contain default values of forward() (#108561)
Fixes #108529

As the title shown.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108561
Approved by: https://github.com/soulitzer
2023-09-19 16:23:52 +00:00
1427b8149c Revert "Eliminate c10::guts::make_unique_base (#109429)"
This reverts commit 6b1a15d1bb465b9f0f07a7a7c8dc5d88d086438a.

Reverted https://github.com/pytorch/pytorch/pull/109429 on behalf of https://github.com/clee2000 due to Sorry its me again, I'm getting that this caused an instruction count regression internally ([comment](https://github.com/pytorch/pytorch/pull/109429#issuecomment-1725923294))
2023-09-19 15:47:00 +00:00
a68280e2c3 [cpu] Vectorize nan_to_num (#98329)
Locally I see a roughly 4x speedup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98329
Approved by: https://github.com/lezcano
2023-09-19 15:25:41 +00:00
1895bd9bb5 [inductor] Decompose torch.ops.quantized.embedding_bag_byte_unpack (#109398)
This would be cleaner if we had support for u8->float32 views
(bitcasts) in inductor, but it works for now.

Differential Revision: [D49329910](https://our.internmc.facebook.com/intern/diff/D49329910/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109398
Approved by: https://github.com/hl475, https://github.com/jansel, https://github.com/jgong5
2023-09-19 14:11:47 +00:00
d0cc623192 [Decomposition] _unsafe_view (#108713)
Summary:
Decomp already exists so just add it to core_aten_decompositions

https://www.internalfb.com/code/fbsource/[9d5eabd7b213d1a356d4e7bb400355d574ea924b]/fbcode/caffe2/torch/_decomp/decompositions.py?lines=3091

Differential Revision: D48619079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108713
Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad
2023-09-19 13:37:35 +00:00
deea268e43 Update aten_fill to avoid d2h sync (#109533)
Fixes #109115

### Before:
<img width="1526" alt="Screenshot 2023-09-18 at 11 57 32 AM" src="https://github.com/pytorch/pytorch/assets/32754868/394a4c51-7cae-4d05-b9ad-b17d02beaf72">

### After:
<img width="1550" alt="Screenshot 2023-09-18 at 11 57 25 AM" src="https://github.com/pytorch/pytorch/assets/32754868/e2f774f5-5374-49c3-95ec-dd3a85f74a2e">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109533
Approved by: https://github.com/mikaylagawarecki
2023-09-19 13:34:49 +00:00
2e721aab98 [Decomposition] Trunc (#109319)
Summary:
Add Decomp for Trunc and add it to core_aten_decompositions

Differential Revision: D49042033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109319
Approved by: https://github.com/SherlockNoMad
2023-09-19 13:30:13 +00:00
ae66d0b3bf [Decomposition] clamp_max (#108718)
Summary:
Decomp already exists so just add it to core_aten_decompositions

https://www.internalfb.com/code/fbsource/[abda43a5a268e83fef6d62b49531a390ce915ad2]/fbcode/caffe2/torch/_refs/__init__.py?lines=1855

Differential Revision: D48880026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108718
Approved by: https://github.com/SherlockNoMad
2023-09-19 13:25:35 +00:00
25e81f19f3 reland "python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917)" (#109518)
Reland - the previous PR was reverted by internal with this error:
```
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/363cd7e240f5d021/caffe2/torch/fb/trainer/data_modules/tests/__test_dataloader__/test_dataloader#link-tree/torch/__init__.py", line 29, in <module>
    from ._utils_internal import _functionalize_sync as _sync
ImportError: cannot import name '_functionalize_sync' from 'torch._utils_internal'
```

I couldn't figure out why internal was unhappy with the import. One potential reason is that I see a build rule for *another* `_utils_internal.py` in the fb folder here ([link](https://www.internalfb.com/code/fbsource/[30ed85cd88409af98b7490be137aaa5dfd7afd01]/fbcode/caffe2/TARGETS?lines=444))

Rather than burn more time investigating, I confirmed internally that the error goes away if I move the util from `torch/_utils_internal.py` to `torch/_utils.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109518
Approved by: https://github.com/albanD
2023-09-19 13:25:24 +00:00
677a1010e6 Implement traceable torch.tensor when you have SymInt/SymFloat inputs (#109515)
I just ported the C++ torch.tensor implementation to Python, swapping out the inner bits to successively stack tensors together, so that we can trace through `scalar_tensor`.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109515
Approved by: https://github.com/voznesenskym
ghstack dependencies: #109513
2023-09-19 13:19:57 +00:00
8ed906030c add fp16 support for mkldnn conv and deconv on CPU (#99496)
The PR is part of https://github.com/pytorch/pytorch/issues/97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage  avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99496
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-09-19 12:37:28 +00:00
54c28c564f add Half support for BatchNorm on CPU (#102070)
Fixes #106543

### Testing

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882

28 cores:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 |  0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/mingfeima
2023-09-19 10:43:33 +00:00
2f53bca0fc [Docs] Fix typo in torch.unflatten (#109588)
Fixes https://github.com/pytorch/pytorch/issues/109559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109588
Approved by: https://github.com/lezcano
2023-09-19 10:37:45 +00:00
af867c2d14 [Docs] Fix compiler.list_backends invocation (#109568)
s/torch.compile.list_backends/torch.compiler.list_backends`

Fixes https://github.com/pytorch/pytorch/issues/109451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109568
Approved by: https://github.com/msaroufim, https://github.com/svekars
2023-09-19 10:00:04 +00:00
cyy
a53a677b4d [1/N] Add -Wdeprecated and related fixes (#108626)
This PR adds -Wdeprecated to CMake warnings and fixes related issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108626
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-09-19 09:24:04 +00:00
4a60bd22b2 [Quant][Inductor] Enable quantization dynamic batch size support (#108550)
**Summary**
This Diff enables dynamic batch size support for quantization use case in Inductor. Take the UT in this PR as example, after this PR, the generated code will have assumption of dynamic input batch size.
```
cpp_fused_quantize_per_tensor_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const float* in_ptr0,
                       unsigned char* out_ptr0,
                       const long ks0,
                       const long ks1)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(3L); i1+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i2=static_cast<long>(0L); i2<static_cast<long>(static_cast<long>(ks1*ks1)); i2+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(i2 + (i1*(static_cast<long>(ks1*ks1))) + (3L*i0*(static_cast<long>(ks1*ks1))))];
                    auto tmp1 = static_cast<float>(40.36037717834931);
                    auto tmp2 = decltype(tmp0)(tmp0 * tmp1);
                    auto tmp3 = std::nearbyint(tmp2);
                    auto tmp4 = static_cast<float>(97.0);
                    auto tmp5 = tmp3 + tmp4;
                    auto tmp6 = static_cast<float>(0.0);
                    auto tmp7 = max_propagate_nan(tmp5, tmp6);
                    auto tmp8 = static_cast<float>(255.0);
                    auto tmp9 = min_propagate_nan(tmp7, tmp8);
                    auto tmp10 = static_cast<unsigned char>(tmp9);
                    out_ptr0[static_cast<long>(i1 + (3L*i2) + (3L*i0*(static_cast<long>(ks1*ks1))))] = tmp10;
                }
            }
        }
    }
}
''')

cpp_fused_dequantize_per_tensor_mean_quantize_per_tensor_1 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0,
                       unsigned char* out_ptr1,
                       const long ks0,
                       const long ks1)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L))
        {
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(16L); i1+=static_cast<long>(16L))
            {
                {
                    #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={at::vec::Vectorized<float>(0)})
                    float tmp_acc0 = 0;
                    at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(1L + (static_cast<long>((at::native::div_floor_integer(ks1, 2L))*(at::native::div_floor_integer(ks1, 2L)))) + (2L*(at::native::div_floor_integer(ks1, 2L)))); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = at::vec::Vectorized<uint8_t>::loadu_one_fourth(in_ptr0 + static_cast<long>(i1 + (16L*i0) + (16L*i2) + (16L*i0*(static_cast<long>((at::native::div_floor_integer(ks1, 2L))*(at::native::div_floor_integer(ks1, 2L))))) + (32L*i0*(at::native::div_floor_integer(ks1, 2L)))));
                        auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                        auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                        auto tmp3 = tmp1 - tmp2;
                        auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.010429476387798786));
                        auto tmp5 = tmp3 * tmp4;
                        tmp_acc0_vec = tmp_acc0_vec + tmp5;
                    }
                    tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (16L*i0)));
                }
            }
        }
    }
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L*ks0); i0+=static_cast<long>(1L))
        {
            auto tmp0 = out_ptr0[static_cast<long>(i0)];
            auto tmp1 = static_cast<float>(1L + (static_cast<long>((at::native::div_floor_integer(ks1, 2L))*(at::native::div_floor_integer(ks1, 2L)))) + (2L*(at::native::div_floor_integer(ks1, 2L))));
            auto tmp2 = tmp0 / tmp1;
            auto tmp3 = static_cast<float>(168.09128392896545);
            auto tmp4 = decltype(tmp2)(tmp2 * tmp3);
            auto tmp5 = std::nearbyint(tmp4);
            auto tmp6 = static_cast<float>(0.0);
            auto tmp7 = tmp5 + tmp6;
            auto tmp8 = max_propagate_nan(tmp7, tmp6);
            auto tmp9 = static_cast<float>(255.0);
            auto tmp10 = min_propagate_nan(tmp8, tmp9);
            auto tmp11 = static_cast<unsigned char>(tmp10);
            out_ptr1[static_cast<long>(i0)] = tmp11;
        }
    }
}
''')

cpp_fused_dequantize_per_tensor_2 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0,
                       const long ks0)
{
    {
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L*ks0); i0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<uint8_t>::loadu_one_fourth(in_ptr0 + static_cast<long>(i0));
            auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
            auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0));
            auto tmp3 = tmp1 - tmp2;
            auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.0056716203689575195));
            auto tmp5 = tmp3 * tmp4;
            tmp5.store(out_ptr0 + static_cast<long>(i0));
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg8_1, arg9_1, arg10_1 = args
    args.clear()
    s0 = arg8_1
    s2 = arg9_1
    assert_size_stride(arg10_1, (s0, 3, s2, s2), (3*(s2*s2), s2*s2, s2, 1))
    buf0 = empty_strided((s0, 3, s2, s2), (3*(s2*s2), 1, 3*s2, 3), device='cpu', dtype=torch.uint8)
    cpp_fused_quantize_per_tensor_0(c_void_p(arg10_1.data_ptr()), c_void_p(buf0.data_ptr()), c_long(s0), c_long(s2))
    del arg10_1
    buf1 = torch.ops.onednn.qconv2d_pointwise(buf0, 0.024776775389909744, 97, constant5, constant2, constant3, constant0, [1, 1], [1, 1], [1, 1], 1, 95.88209060714476, 0, False, 'relu', [], '')
    assert_size_stride(buf1, (s0, 16, 1 + s2, 1 + s2), (16 + (16*(s2*s2)) + (32*s2), 1, 16 + (16*s2), 16))
    del buf0
    # Source Nodes: [quantize_per_tensor_default_2], Original ATen: [quantized_decomposed.quantize_per_tensor]
    buf2 = torch.ops.quantized.max_pool2d(buf1, [3, 3], [2, 2], [1, 1], [1, 1], False)
    del buf1
    buf3 = buf2
    assert_size_stride(buf3, (s0, 16, 1 + (s2 // 2), 1 + (s2 // 2)), (16 + (16*((s2 // 2)*(s2 // 2))) + (32*(s2 // 2)), 1, 16 + (16*(s2 // 2)), 16))
    del buf2
    buf4 = empty_strided((s0, 16, 1, 1), (16, 1, 16*s0, 16*s0), device='cpu', dtype=torch.float32)
    buf5 = empty_strided((s0, 16), (16, 1), device='cpu', dtype=torch.uint8)
    cpp_fused_dequantize_per_tensor_mean_quantize_per_tensor_1(c_void_p(buf3.data_ptr()), c_void_p(buf4.data_ptr()), c_void_p(buf5.data_ptr()), c_long(s0), c_long(s2))
    del buf3
    buf6 = torch.ops.onednn.qlinear_pointwise(buf5, 0.005949148442596197, 0, constant6, constant4, constant3, constant1, 176.31645543014483, 100, False, 'none', [], '')
    assert_size_stride(buf6, (s0, 16), (16, 1))
    del buf5
    buf7 = reinterpret_tensor(buf4, (s0, 16), (16, 1)); del buf4  # reuse
    cpp_fused_dequantize_per_tensor_2(c_void_p(buf6.data_ptr()), c_void_p(buf7.data_ptr()), c_long(s0))
    return (buf7, )

```

**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_maxpool2d_linear_dynamic
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108550
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-09-19 08:30:16 +00:00
cyy
ac603bc2f8 [Reland] Eliminate invocations of c10::stoi,c10::stod,c10::stoull,c10::stoll (#109566)
This is reland of #87603 with definitions of c10::stoXX kept for further investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109566
Approved by: https://github.com/huydhn
2023-09-19 07:15:25 +00:00
2c1554a032 Make SymFloat behave symmetrically with float in torch.tensor (#109513)
Previously, SymFloat would force double precision.  That's wrong;
instead, we must respect default dtype.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109513
Approved by: https://github.com/voznesenskym
2023-09-19 01:52:41 +00:00
e8ab8c877d [exir] Add lift constant tensors passes after aten_to_edge (#109382)
Summary:
X-link: https://github.com/pytorch/executorch/pull/359

When exporting using enable_aot (through the torch.export path), we want to lift all constant tensors as buffers to the exported program. The ScalarToTensor pass in EXIR's aten_to_edge passes will create some constant tensors in the graph, so we will need to run a lift_constant_tensors pass afterwards.

Note that this only needs to be applied when exporting using the torch.export path because in the original path, nothing is lifted.

Test Plan: CI

Differential Revision: D49207492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109382
Approved by: https://github.com/cccclai
2023-09-19 01:34:58 +00:00
0ec9f59f70 Loudly Error in dynamo bench if eager fails (#109536)
Helps debug https://github.com/pytorch/benchmark/issues/1901

I will wait until the ONNX beartype sev is fixed before merging

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109536
Approved by: https://github.com/xuzhao9
2023-09-19 00:40:42 +00:00
98208e5160 [export] Update deserialized FakeTensorMode/ShapeEnv with same configs as export (#109522)
Summary: Deserialized FakeTensorMode/ShapeEnv should have the same configs as export: https://fburl.com/code/y7jxf5qw

Test Plan: CI

Differential Revision: D49377410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109522
Approved by: https://github.com/zhxchen17
2023-09-19 00:34:30 +00:00
a44cf44067 improved type hints ScriptModule (#109535)
Added properties

- "code"
- "code_with_constants"
- "graph"
- "inlined_graph"
- "original_name"

With appropriate type hints to `ScriptModule` stub and removed them from child class `RecursiveScriptModule`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109535
Approved by: https://github.com/ezyang
2023-09-19 00:13:15 +00:00
871b5caae7 Fix hpu deserialization bug (#109499)
# Motivation
fix hpu deserialization bug. It should check hpu model if and only if location start with hpu. Otherwise, it always raise an AssertError if hpu is not imported. This break the serialization/desirialization functionality abourt other third-party like IPEX.

# Solution
only assert hpu model when start with hpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109499
Approved by: https://github.com/ezyang
2023-09-19 00:10:51 +00:00
5b13f74e9b [export] Update how we input kwargs (#109160)
Previously, the code for passing inputs to exported program was:
```
if kwargs:
    return (args, kwargs)
else:
    return args
```

However, this causes some inconsistency where if the original input contains args and kwargs, the treespec would be a tuple containing a tuple of arguments, and a dictionary of keyword arguments. But if the original input only contained args, the treespec would just be a tuple of arguments. This inconsistency causes some inconveniences in the runtime.

So I updated the code to just always keep the kwargs around.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109160
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
2023-09-19 00:04:32 +00:00
a6d34c60a1 Fixing searchsorted doc (#109364)
Removing ambiguous description

Fixes #109298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109364
Approved by: https://github.com/colesbury
2023-09-18 23:12:53 +00:00
6f4b9cc9ab [export] Skip noop runtime assertion pass. (#109395)
Summary:
If there's no inline constraints added, just return the original graph.
We want to do this because sometimes this pass mess up the node names,
before we actually fix this, we could make the behavior a bit less buggy
by skipping noop passes.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109395
Approved by: https://github.com/angelayi
2023-09-18 22:37:28 +00:00
550b0ec3d4 Release GIL around VariableInfo::zeros to avoid deadlocks (#109454)
See https://github.com/pytorch/pytorch/issues/109074#issue-1891369807 and https://github.com/pytorch/pytorch/issues/109074#issuecomment-1718825855
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109454
Approved by: https://github.com/albanD
2023-09-18 22:28:48 +00:00
0e2b22c451 [ONNX] switch from onnxscript-preview to onnxscript (#109139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109139
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-09-18 22:24:47 +00:00
9863286abf [ROCM] Enable bwd cross_entropy on ROCM now that eps tolerance update (#109384)
Follow up to https://github.com/pytorch/pytorch/pull/109038

The fix in the PR above also fixes this test on rocm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109384
Approved by: https://github.com/jeffdaily, https://github.com/albanD
2023-09-18 22:20:38 +00:00
0bf30c140a [pytree] Use OpTree for PyTree manipulation (#93139)
Split from #92679. Use C++-based PyTree implementation.

## Highlights

1. High performance (20x speedup than the pure-Python implementation, 10%-20% overall speedup for `torch.fx`)
2. Multi-input tree-map support
3. Custom tree node registry with namespace isolation

Refs:

- #65761
- #91323
- #92679

From https://github.com/pytorch/pytorch/issues/65761#issuecomment-1334746366:

> ### 0. Out-of-box compatible with JAX's pytree, provides the same interfaces and functions (and more).
>
> ### 1. High-performance: `optree` has comparable fast tree operations (~0.9x for `dict`s and ~2.5x for `OrderedDict`s) than JAX's pytree and it is 20x faster than `torch.utils._pytree`.
>
> `optree` implements some common Python container types in C++ (e.g., `OrderedDict`) and achieves 2.5x performance than JAX's pytree. Check out section [Built-in PyTree Node Types](https://github.com/metaopt/optree#built-in-pytree-node-types) and [Benchmark](https://github.com/metaopt/optree#benchmark) for more details.
>
> | Module    | Nodes | OpTree (μs) | JAX XLA (μs) | PyTorch (μs) | DM-Tree (μs) | Speedup (J / O) | Speedup (P / O) | Speedup (D / O) |
> | :-------- | ----: | ----------: | -----------: | -----------: | -----------: | --------------: | --------------: | --------------: |
> | TinyMLP   |    53 |       26.40 |        68.19 |       586.87 |        34.14 |            2.58 |           22.23 |            1.29 |
> | AlexNet   |   188 |       84.28 |       259.51 |      2182.07 |       125.12 |            3.08 |           25.89 |            1.48 |
> | ResNet18  |   698 |      288.57 |       807.27 |      7881.69 |       429.39 |            2.80 |           27.31 |            1.49 |
> | ResNet34  |  1242 |      580.75 |      1564.97 |     15082.84 |       819.02 |            2.69 |           25.97 |            1.41 |
> | ResNet50  |  1702 |      791.18 |      2081.17 |     20982.82 |      1104.62 |            2.63 |           26.52 |            1.40 |
> | ResNet101 |  3317 |     1603.93 |      3939.37 |     40382.14 |      2208.63 |            2.46 |           25.18 |            1.38 |
> | ResNet152 |  4932 |     2446.56 |      6267.98 |     56892.36 |      3139.17 |            2.56 |           23.25 |            1.28 |
> | ViT-H/14  |  3420 |     1681.48 |      4488.33 |     41703.16 |      2504.86 |            2.67 |           24.80 |            1.49 |
> | Swin-B    |  2881 |     1565.41 |      4091.10 |     34241.99 |      1936.75 |            2.61 |           21.87 |            1.24 |
> |           |       |             |              |              |  **Average** |        **2.68** |       **24.78** |        **1.38** |
>
> <div align="center">
>   <img src="https://user-images.githubusercontent.com/16078332/200494435-fd5bb385-59f7-4811-b520-98bf5763ccf3.png" width="90%" />
> </div>
>
> ### 2. Namespace Isolation for the PyTree Type Registry
>
> In addition to the JAX's pytree registry for custom node type registration, `optree` adds `namespace` isolation to the registry. Users can register the same type multiple times for different flatten/unflatten behavior. It also provides module-level isolation for safety reasons. For example, you can add a unique prefix to your namespace to isolate your registry with other modules (e.g., `torch.xxx`, `torch.functorch.xxx`):
>
> ```python
> # Register a Python type into a namespace
> import torch
>
> optree.register_pytree_node(
>     torch.Tensor,
>     # (tensor) -> (children, metadata)
>     flatten_func=lambda tensor: (
>         (tensor.cpu().numpy(),),
>         dict(dtype=tensor.dtype, device=tensor.device, requires_grad=tensor.requires_grad),
>     ),
>     # (metadata, children) -> tensor
>     unflatten_func=lambda metadata, children: torch.tensor(children[0], **metadata),
>     namespace='torch.torch2numpy',
> )
> ```
>
> ```python
> >>> tree = {'weight': torch.ones(size=(1, 2)).cuda(), 'bias': torch.zeros(size=(2,))}
> >>> tree
> {'weight': tensor([[1., 1.]], device='cuda:0'), 'bias': tensor([0., 0.])}
>
> # Flatten without specifying the namespace
> >>> tree_flatten(tree)  # `torch.Tensor`s are leaf nodes
> ([tensor([0., 0.]), tensor([[1., 1.]], device='cuda:0')], PyTreeSpec({'bias': *, 'weight': *}))
>
> # Flatten with the namespace
> >>> leaves, treespec = optree.tree_flatten(tree, namespace='torch.torch2numpy')
> >>> leaves, treespec
> (
>     [array([0., 0.], dtype=float32), array([[1., 1.]], dtype=float32)],
>     PyTreeSpec(
>         {
>             'bias': CustomTreeNode(Tensor[{'dtype': torch.float32, 'device': device(type='cpu'), 'requires_grad': False}], [*]),
>             'weight': CustomTreeNode(Tensor[{'dtype': torch.float32, 'device': device(type='cuda', index=0), 'requires_grad': False}], [*])
>         },
>         namespace='torch.torch2numpy'
>     )
> )
>
> # `entries` are not defined and use `range(len(children))`
> >>> optree.tree_paths(tree, namespace='torch.torch2numpy')
> [('bias', 0), ('weight', 0)]
>
> # Unflatten back to a copy of the original object
> >>> optree.tree_unflatten(treespec, leaves)
> {'bias': tensor([0., 0.]), 'weight': tensor([[1., 1.]], device='cuda:0')}
> ```
>
> Check out section [Registering a Container-like Custom Type as Non-leaf Nodes](https://github.com/metaopt/optree#notes-about-the-pytree-type-registry) for more details.
>
> ### 3. Support both `None` as Non-leaf Node and `None` as Leaf
>
> In JAX's implementation, `None` is always an internal non-leaf node with an arity 0, which is like an empty tuple. This limits the usage of the JAX's pytree utilities for PyTorch. For example, the `nn.Module` uses `_parameters` and `_buffers` (`OrderedDict[str, Optional[Tensor]]`) to hold the tensors, while the value can be a tensor or `None`.
>
> `optree` supports both `None` as Non-leaf Node (JAX's default) and `None` as Leaf (PyTorch's default). Check out section [None is Non-leaf Node vs. None is Leaf](https://github.com/metaopt/optree#none-is-non-leaf-node-vs-none-is-leaf) for more details.
>
> ### 4. Some other improvements and bug fixes
>
> 1. Adds in-place version of treemap (`tree_map_`), which reduces redundant unflatten operation for better performance.
> 2. Adds support for tree flatten and tree map with paths. (useful for `functorch` module extraction).
> 3. Improves the JAX's pytree sorting support for `dict`s.
> 4. Better string representation `repr(PyTreeSpec)`.
> 5. Fixes some bugs for JAX's pytree of hashing, pickle serialization, segmentation fault for infinite recursion, and tree-compose/tree-transpose.

From https://github.com/pytorch/pytorch/pull/92679#issuecomment-1398778481:

> ```python
> # pytree_make_fx_bench.py
> import torch
> from torch.fx.experimental.proxy_tensor import make_fx
> import time
>
> def f(x):
>     for _ in range(10000):
>         x = x+x
>     return x
>
> import time
> begin = time.time()
> out = make_fx(f, tracing_mode="real")(torch.randn(20))
> begin = time.time()
> print(f'tracing_mode="real" {time.time() - begin:.2f}')
> out = make_fx(f, tracing_mode="fake")(torch.randn(20))
> print(f'tracing_mode="fake" {time.time() - begin:.2f}')
>
> out = make_fx(f, tracing_mode="symbolic")(torch.randn(20))
> print(f'tracing_mode="symbolic" {time.time() - begin:.2f}')
> ```
>
> This seems to run around 10-20% faster with the optree implementation:
>
> ```
> # Optree
> python pytree_make_fx_bench.py
> tracing_mode="real" 0.00
> tracing_mode="fake" 6.32
> tracing_mode="symbolic" 27.13
> ```
>
> ```
> # torch.utils._pytree
> python pytree_make_fx_bench.py
> tracing_mode="real" 0.00
> tracing_mode="fake" 7.66
> tracing_mode="symbolic" 31.07
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93139
Approved by: https://github.com/malfet
2023-09-18 21:24:56 +00:00
8a567bb59d [HigherOrderOp] Should automatically pop modes (#109157)
Fixes #108282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109157
Approved by: https://github.com/zou3519
2023-09-18 20:54:09 +00:00
73ac814148 [Pytorch][quant] Move xnnpack quantizer to use aten.linear (#109254)
Summary:
Now that quantization works on pre-dispatch aten IR, moving to full set
of aten ops is ok. Plus when tracing models like ViT, the linear
projections of of k, q, v uses functional.linear and not nn.Linear,
which results not being able to extract nodes corresponding to linear.

Test Plan:
quant tests

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D49252194](https://our.internmc.facebook.com/intern/diff/D49252194)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109254
Approved by: https://github.com/jerryzh168
2023-09-18 20:26:44 +00:00
77d745666b Add TORCH_CHECK_ALWAYS_SHOW_CPP_STACKTRACE (#109373)
Unlike TORCH_CHECK, these always show C++ stacktrace on error.  Put it
on errors where you frequently seem to need this information.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109373
Approved by: https://github.com/bdhirsh
ghstack dependencies: #109372
2023-09-18 19:46:32 +00:00
8a1bbf383d Out-of-line cannot call with symbolic error test (#109372)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109372
Approved by: https://github.com/bdhirsh
2023-09-18 19:46:32 +00:00
050c56d0a5 [dynamo][ci] Pin beartype to 0.15.0 (#109510)
CIs are failing because of https://github.com/beartype/beartype/issues/282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109510
Approved by: https://github.com/thiagocrepaldi
2023-09-18 19:08:32 +00:00
4d44d8c00a Revert "Eliminate c10::stoi,c10::stod,c10::stoull,c10::stoll (#109179)"
This reverts commit 852f1b8417e80b72a7d1c4a772f66af28da02913.

Reverted https://github.com/pytorch/pytorch/pull/109179 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this is breaking periodic buck build, so please fix the issue and reland the change https://github.com/pytorch/pytorch/actions/runs/6207458526/job/16852695272 ([comment](https://github.com/pytorch/pytorch/pull/109179#issuecomment-1724168571))
2023-09-18 18:41:12 +00:00
70ca3ee951 Revert "inductor: only do the conv+bn folding for the freezing path (#109270)"
This reverts commit c7017fff38e73210541124739ed9404492ddd68c.

Reverted https://github.com/pytorch/pytorch/pull/109270 on behalf of https://github.com/malfet due to Broke slow test, see c7017fff38 ([comment](https://github.com/pytorch/pytorch/pull/109270#issuecomment-1724132526))
2023-09-18 18:15:31 +00:00
5cd8a6d40a Enable typechecking for _inductor/fx_utils.py (#109415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109415
Approved by: https://github.com/Skylion007
ghstack dependencies: #109269, #109347, #109335
2023-09-18 18:12:23 +00:00
fe452108fb Enable typechecking for _inductor/debug.py (#109335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109335
Approved by: https://github.com/eellison
ghstack dependencies: #109269, #109347
2023-09-18 18:12:23 +00:00
9172c9f03f Fix spelling / capitalization in freezing.py error message (#109347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109347
Approved by: https://github.com/eellison
ghstack dependencies: #109269
2023-09-18 18:12:20 +00:00
bab627073a Enable typechecking for _inductor/freezing.py (#109269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109269
Approved by: https://github.com/eellison
2023-09-18 18:12:18 +00:00
282aa26764 Update the instruction to enable dynamo logs (#109409)
```
   torch._dynamo.config.log_level = logging.INFO
   torch._dynamo.config.output_code = True
```

were replaced with the module level log control https://github.com/pytorch/pytorch/pull/94858
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109409
Approved by: https://github.com/msaroufim
2023-09-18 17:49:40 +00:00
a399f839ac Revert "Add PR number to metrics when available (#109406)"
This reverts commit f0fb4b3897e9cac3b99ee5b9b2ecab255e9e2da3.

Reverted https://github.com/pytorch/pytorch/pull/109406 on behalf of https://github.com/ZainRizvi due to breaks trunk ([comment](https://github.com/pytorch/pytorch/pull/109406#issuecomment-1724061024))
2023-09-18 17:35:37 +00:00
05c31b3b69 typo in DispatchKeySet.h (#109431)
Fixes #108641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109431
Approved by: https://github.com/Skylion007
2023-09-18 17:34:36 +00:00
cddeceb6b6 [inductor] scale down RBLOCK for occupancy (#109275)
For large reduction (with large xnumel and rnumel), we potentially need run large number of thread blocks. Occupancy matters here since with larger occupancy we can run more blocks on each SM and we may need less number of waves to run the entire kernel on the GPU.  Number of registers used by each thread can limit the occupancy. For A100, it's safe to say that register usage does not limit occupancy only if each thread use <= 32 registers. This PR leverage this observation and reduce RBLOCK (thus reduce registers used by each thread) if thread usage limit occupancy for large reduction.

The scenario mentioned can happen for the softmax kernel used in transformers. Here are some results get from devgpu:
- PLBartForCausalLM we improve from 1.88x (58.7ms) to 2.00x (55.82ms)
- TrOCRForCausalLM we improve from 1.45x (92.9ms) to 1.51x (89.12ms)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109275
Approved by: https://github.com/jansel
2023-09-18 17:29:30 +00:00
cyy
5d5990fc49 Remaining replacement of c10::stoi with std::stoi (#109482)
PR #109179 replaced c10::stoi with std::stoi. However, there were some files forgotten to change. This patch fixes them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109482
Approved by: https://github.com/vfdev-5, https://github.com/Skylion007
2023-09-18 16:05:09 +00:00
6ffa59031a [inductor] Fix CudaStreamGuard in AOTInductor ABI compatible mode (#109471)
Summary: Use a RAII class to wrap around at::cuda::CUDAStreamGuard. Previous implementation didn't follow the exact CUDAStreamGuard behavior.

Test Plan: CI

Differential Revision: D49355542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109471
Approved by: https://github.com/chenyang78
2023-09-18 15:54:58 +00:00
d2ca5fa6c5 [lintrunner] Capture mypy internal error (#109421)
Mypy internal errors are reported to stderr rather than stdout and does not contain column number

This should prevent internal errors from creeping into the code and occlude other legitimate errors

Test plan: Checkout 5cd861fcf7 apply this change and see `lintrunner` run to report internal error

Fixes https://github.com/pytorch/pytorch/issues/104940

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109421
Approved by: https://github.com/Skylion007
2023-09-18 15:48:14 +00:00
591c01995b Add CONDA_CMAKE=yes for all ROCm docker configs (#109334)
This ensures all BUILD_ENVIRONMENT ROCm configs will have LAPACK/MKL support enabled due to using conda cmake. This should have no impact on pytorch/pytorch CI builds though, since those do not fall in the catch-all condition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109334
Approved by: https://github.com/pruthvistony, https://github.com/kit1980
2023-09-18 15:08:06 +00:00
88600e7d2e [RELAND] Force synced KJT to trace unbacked SymInt (#108960) (#109216)
Summary:

The basic concept behind this diff is to modify Dynamo's tracing behavior when it encounters a KeyedJaggedTensor that is synced (aka has `_length_per_key` and `_offset_per_key` populated). These fields are lists of integers; ordinarily, Dynamo will optimistically try to specialize on integers, however, for KJTs, we know that these integers will definitely vary from run-to-run. Furthermore, ordinarily, we would also specialize these integers if they are 0/1, but we will frequently expect features in KJTs to be 0/1.

The fix is to detect KJTs and treat these integers as *unbacked integers*. This is NOT a universally sound optimization: when treating these integers as unbacked, we never report them as equal to zero or one. In return, we always generate graphs that generalize no matter the length of values on features. This is enough to trace through APS sparse arch, torchrec_dlrm and some small split-cat examples.

The special integer behavior is triggered by a dynamically scoped `force_unspec_int_unbacked_size_like` variable on TracingContext, which we trigger when we wrap a KJT. There probably are other ways to do this, but this was simple and worked.

Test Plan:
```
buck2 test mode/dev-nosan //pytorch/benchmark/fb/test_gpu:run_test_gpu
```

from aakhundov

1. first build feed_lower_benchmark:
```
buck2 build --show-output mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true hpc/new/models/feed/benchmark:feed_lower_benchmark
```
2. then run the lowering of the model with it:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" TORCH_COMPILE_DEBUG=1 ../buck-out/v2/gen/fbcode/79c6b019ee0f9469/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --load=manifold://ig_inference_model/tree/user/facebook/fblearner/predictor/960999465/60/gpu_lowering/input.predictor --skip-trt --skip-ait --sync-mode=0 --enable-aot-inductor --lower-presets="ig_stories" --gpu-trace
```
cf https://docs.google.com/document/d/1yD30xYrdmM8r2HTdmXnZTg0-MHVexfVrAa0294m1AUE/edit?pli=1#heading=h.qiv3fp7e6zg0

From torchrec: https://www.internalfb.com/intern/wiki/Torchrec/Development/Testing_production_models/

From ge0405
baseline (without your diff): f477293168
your diff: f477292363

```
buck2 test //caffe2/test/dynamo:test_dynamo_torchrec
buck2 run 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'
```

Differential Revision: D49236757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109216
Approved by: https://github.com/voznesenskym
2023-09-18 14:39:44 +00:00
1a361e4e9f [inductor] realize_into should not alias src and dst (#108126)
Fixes #107995

In the reproducer we have the fx graph:
```python
class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: f32[1]):
        # File: <ipython-input-1-5f62cb746ad5>:10, code: return self.layer1(inputs)
        gt: b8[1] = torch.ops.aten.gt.Scalar(arg0_1, 0)
        mul: f32[1] = torch.ops.aten.mul.Tensor(arg0_1, 5.2955089)
        where: f32[1] = torch.ops.aten.where.self(gt, arg0_1, mul);  gt = mul = None

        # No stacktrace found for following nodes
        copy_: f32[1] = torch.ops.aten.copy_.default(arg0_1, where);  arg0_1 = None
        return (where,)
```

The `where` node is both copied into `arg0_1` and returned as the output of the
function. Currently `realize_into` converts the where's storage into a
`MutationLayout` of `arg0_1`, for which no tensor named `buf0` is allocated.

This is incorrect as `where` and `arg0_1` shouldn't share storage. It also
breaks the wrapper code generation which references `buf0` directly in the
return, but never allocates a `buf0`.

This issue only appears for size zero tensors, because otherwise the `src`
buffer becomes a user of `arg0_1` which forces this copy to happen anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108126
Approved by: https://github.com/jansel
2023-09-18 14:16:43 +00:00
fc47ba2794 [Decomposition] clamp_min (#108717)
Summary:
Decomp already exists so just add it to core_aten_decompositions

https://www.internalfb.com/code/fbsource/[abda43a5a268e83fef6d62b49531a390ce915ad2]/fbcode/caffe2/torch/_refs/__init__.py?lines=1846

Differential Revision: D48880080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108717
Approved by: https://github.com/SherlockNoMad
2023-09-18 12:43:58 +00:00
a6d4cca7c0 [Decomposition] unsafe_split.Tensor (#108544)
Summary:
Include decomp in core_aten_decompositions

Decomp already exists

https://www.internalfb.com/code/fbsource/[03ff511cad587fc27ed8fd6a54b87845246e8e0c]/fbcode/caffe2/torch/_decomp/decompositions.py?lines=1209

Test Plan: OSS + Phabricator Tests

Differential Revision: D48940445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108544
Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad
2023-09-18 12:43:07 +00:00
af93b29c5e [Decomposition] std.correction (#108733)
Summary:
Include decomp in core_aten_decompositions

Decomp:
https://www.internalfb.com/code/fbsource/[e69bf00ff87a55c9a30bd7905881661ff05fa211]/fbcode/caffe2/torch/_refs/__init__.py?lines=2398

Differential Revision: D48940402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108733
Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad
2023-09-18 11:38:23 +00:00
a683bc54fd move transform_bias_rescale_qkv vectorized code to cpu sub folder (#109095)
`at::vec::Vectorized<>` will not be properly vectorized under the folder of `aten/src/ATen/native/transformers`, move the vectorized code to `aten/src/ATen/native/cpu` where the macros of `CPU_CAPABILITY_AVX2`, `CPU_CAPABILITY_AVX512` etc. are defined.

Here is the vtune log before and after this patch on `transform_bias_rescale_qkv_cpu`
1. before:
![transformer_bioas_rescale_qkv_before](https://github.com/pytorch/pytorch/assets/20233731/582f6873-d86e-47a6-bd2a-620b97acc5b1)
2. after:
![transformer_bioas_rescale_qkv_after](https://github.com/pytorch/pytorch/assets/20233731/949004ab-3cbc-4a1d-a03d-9a17efa981ae)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109095
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-09-18 08:40:06 +00:00
f0fb4b3897 Add PR number to metrics when available (#109406)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 780bfa6</samp>

Add a new metric for pull request number in `tools/stats/upload_metrics.py`. This allows tracking the CI performance of pull requests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109406
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/clee2000
2023-09-18 03:17:54 +00:00
9e86a093e4 add torch.device to python type (#108116)
Fixes #107856

This PR adds torch.device instance check in the python_type method for torch variables in dynamo.

@ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108116
Approved by: https://github.com/msaroufim, https://github.com/ezyang
2023-09-18 02:20:30 +00:00
6d725e7d66 [BE]: enable ruff rules PLR1722 and PLW3301 (#109461)
Enables two ruff rules derived from pylint:
* PLR1722 replaces any exit() calls with sys.exit(). exit() is only designed to be used in repl contexts as may not always be imported by default. This always use the version in the sys module which is better
* PLW3301 replaces nested min / max calls with simplified versions (ie. `min(a, min(b, c))` => `min(a, b. c)`). The new version is more idiomatic and more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109461
Approved by: https://github.com/ezyang
2023-09-18 02:07:21 +00:00
cyy
a9a0f7a4ad Build CUDA image for lintrunner (#109456)
Following the recent works, it is necessary to add CUDA files in the docker container so that we can lint CUDA code in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109456
Approved by: https://github.com/malfet
2023-09-18 02:05:17 +00:00
0cae3b5df5 Revert "[PyTorch] Add Expanded call stack to nodes (#108426)" (#109468)
This reverts commit c657d9ecc555facb18cb0eecd8ffe15141394aa1. https://github.com/pytorch/pytorch/pull/108426

The diff got reverted internally via a backout diff without getting exported to github.

Do not import this PR

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109468
Approved by: https://github.com/kit1980
2023-09-17 23:46:20 +00:00
f9e72acc8f Guard default dtype in torchdynamo (#109459)
Fixes https://github.com/pytorch/pytorch/issues/109458

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109459
Approved by: https://github.com/ezyang
2023-09-17 22:51:33 +00:00
71420a98ab Revert "Remove c10::either (#109299)"
This reverts commit 9d297cc77374f3a7e4bddd9b04fbe7ed64b838be.

Reverted https://github.com/pytorch/pytorch/pull/109299 on behalf of https://github.com/clee2000 due to sorry but there are a few internal usages and when I tried swapping them out, I got some errors.  I will get someone to look at them on Monday ([comment](https://github.com/pytorch/pytorch/pull/109299#issuecomment-1722579387))
2023-09-17 22:05:47 +00:00
525e4f42d0 Revert "replace torch::make_unique with std::make_unique (#108866)"
This reverts commit 03e35efbf733da28d9e1c5a4b1b203fe335b5f94.

Reverted https://github.com/pytorch/pytorch/pull/108866 on behalf of https://github.com/clee2000 due to Sorry but I found more usages of `torch::make_unique` internally, I can go change all of these, but I'd prefer if that gets done before this gets merged ([comment](https://github.com/pytorch/pytorch/pull/108866#issuecomment-1722577925))
2023-09-17 21:57:30 +00:00
07f2efa285 Revert "[HigherOrderOp] Should automatically pop modes (#109157)"
This reverts commit f03b8abd4706e53b3fb6aefbd4304884e537616d.

Reverted https://github.com/pytorch/pytorch/pull/109157 on behalf of https://github.com/clee2000 due to broke internal builds D49346922 ([comment](https://github.com/pytorch/pytorch/pull/109157#issuecomment-1722571262))
2023-09-17 21:19:52 +00:00
49b18ae546 Revert "python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917)"
This reverts commit 0ad595954a1766f26aa55b0f72814d55865bb1dc.

Reverted https://github.com/pytorch/pytorch/pull/107917 on behalf of https://github.com/clee2000 due to breaking internal builds D49346637 ([comment](https://github.com/pytorch/pytorch/pull/107917#issuecomment-1722566885))
2023-09-17 20:57:41 +00:00
17193faf1a Revert "Created nested utils.cpp (#109304)"
This reverts commit 924723bda7e3b7dfba1612027ecd3f7af10fb449.

Reverted https://github.com/pytorch/pytorch/pull/109304 on behalf of https://github.com/clee2000 due to sorry but this is breaking internal builds due to the new header file D49346814 ([comment](https://github.com/pytorch/pytorch/pull/109304#issuecomment-1722561958))
2023-09-17 20:32:49 +00:00
c8e4e08c8d [inductor] Forward fix a windows test error (#109449)
Summary: forward fix a windows test error from https://github.com/pytorch/pytorch/pull/109391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109449
Approved by: https://github.com/mikekgfb
2023-09-17 19:54:21 +00:00
cyy
75b954b715 [4/N] Enable clang-tidy in torch/csrc/autograd (#109455)
The PR enables clang-tidy checks in torch/csrc/autograd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109455
Approved by: https://github.com/Skylion007
2023-09-17 17:11:50 +00:00
c7017fff38 inductor: only do the conv+bn folding for the freezing path (#109270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109270
Approved by: https://github.com/eellison
2023-09-17 12:36:49 +00:00
cyy
51d2d825ab [3/N] apply clang-tidy in torch/csrc/autograd (#109368)
This PR applies clang-tidy fixes in torch/csrc/autograd/FunctionsManual.cpp. There are also other fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109368
Approved by: https://github.com/Skylion007
2023-09-17 07:26:59 +00:00
d8da2a7c85 Switch to CUDA event based profiling (#109338)
In https://github.com/pytorch/pytorch/pull/107901, the CUDA event based
profiling is changed to profiler based profiling to avoid counting CPU-side
kernel launch overhead in final latency numbers. However, it turns out that
torch.profile() is significantly slower than CUDA event which affects model
compilation speed quite significantlly. This PR changes back to CUDA event
based profiling.

Follow-ups:
* Try CUDA event profiling with CUDAGraphs;
* Multi-GPU profiling;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109338
Approved by: https://github.com/frank-wei
2023-09-17 06:04:41 +00:00
cyy
92b0db2967 Don't find MKL if it isn't used (#109426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109426
Approved by: https://github.com/Skylion007
2023-09-17 03:39:39 +00:00
cyy
6b1a15d1bb Eliminate c10::guts::make_unique_base (#109429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109429
Approved by: https://github.com/Skylion007
2023-09-17 00:04:09 +00:00
4e4314da7f [dynamo] remove DummyGlobalSource (#109411)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109411
Approved by: https://github.com/ezyang
2023-09-16 23:11:11 +00:00
9a95b4bc7b [dtensor] quick fix to #109306 (#109428)
Looks like the op argument schema type check is not reliable.. for
things like aten.div.Tensor(Tensor, Tensor), the second argument can still be
a float/scalar for some reason, switch to check with the instance type
directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109428
Approved by: https://github.com/awgu, https://github.com/fegin
2023-09-16 20:53:55 +00:00
f15adf204b [BE]: Replace undocumented constant in logging (#109434)
Replaces the undocumented alias with the proper constant WARNING

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109434
Approved by: https://github.com/ezyang
2023-09-16 20:17:32 +00:00
0aedacb4f7 [2D][1/N] Add _enable_extension to fsdp state (#109242)
Add _enable_extension to fsdp state. We will use this to determine whether we should enable the extension or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109242
Approved by: https://github.com/fegin
2023-09-16 19:03:10 +00:00
322bf50dbe [2D][2/N][DeviceMesh] Add get_parent_mesh_dim() in _MeshEnv class (#109330)
Adding some additional APIs that are needed for 2D workflow.

Since each parallelism is only aware of its own mesh when we are constructing 2D state_dict. We need to know the mesh_dim of the child mesh in the parent mesh. So, we can use it to create DTensor that is 2D sound.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109330
Approved by: https://github.com/fegin, https://github.com/fduwjj, https://github.com/wanchaol
2023-09-16 19:03:04 +00:00
b275a902d3 Small type hint fix (#109414)
# Summary
Adds these types to the type hint list for better IDE experience

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109414
Approved by: https://github.com/Skylion007
2023-09-16 18:46:46 +00:00
247e2f8461 [BE]: Update ruff to v0.0.290 (#109435)
Updates our ruff linter to the latest and fixes a few false negatives along the way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109435
Approved by: https://github.com/ezyang
2023-09-16 18:43:34 +00:00
0f646b1d15 [inductor] Add a C shim layer for libtorch (#109391)
Summary:
This PR adds a limited C shim layer for libtorch. The ultimate goal is to ban any direct reference to aten/c10 data structures or functions, to avoid ABI breakage by providing stable C interfaces.

To make the review and landing easier, we broke the changes into several steps. In this PR (a combination of https://github.com/pytorch/pytorch/pull/109022 and https://github.com/pytorch/pytorch/pull/109351), we add C interfaces for certain libtorch functions and modify the wrapper codegen to generate calls to those interfaces. There are a few other items to be addressed in future PRs:

* The AOTInductor runtime interface still takes lists of aten tensors as input and output
* The interaction with ProxyExecutor (general fallback support) needs to move away from aten tensor
* Remove all references to aten/c10 headers in the AOTInductor-generated code

Differential Revision: D49302669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109391
Approved by: https://github.com/chenyang78
2023-09-16 16:46:26 +00:00
d860313903 Improve can't call type() error message (#109378)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109378
Approved by: https://github.com/Skylion007
2023-09-16 16:12:58 +00:00
58bdc63dd6 [inductor] Remove a bunch of check_gradient=False in opinfo tests (#109417)
Despite what the comments say, they do not seem to segfault not cause
CUDA errors any more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109417
Approved by: https://github.com/lezcano
ghstack dependencies: #109359, #109416
2023-09-16 13:31:05 +00:00
1e4f2b576d Have inductor tests call output_process_fn_grad (#109416)
This is similar to what's done in test_ops.py.

Fixes #109353.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109416
Approved by: https://github.com/lezcano
ghstack dependencies: #109359
2023-09-16 13:31:05 +00:00
7f3885137f Add meta function for _segment_reduce (#109359)
This fixes numerous tests which were xfailing. For instance, the
`_segment_reduce.lengths` OpInfo test, which was previously relying on
the fallback kernel to determine the shape of the meta tensor. The
fallback kernel would fail with

    segment_reduce(): Expected all rows of lengths along axis to sum to data.size(lengths.dim()-1) when !unsafe.

as it was trying to read the values of a meta tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109359
Approved by: https://github.com/ezyang
2023-09-16 13:31:03 +00:00
55c19a3c6d Inductor: Increase multiplier to 3 for Inductor AMP benchmark correctness check (#109097)
**Summary**
As reported in https://github.com/pytorch/pytorch/issues/108333, we find some of the models have failed the benchmark's correctness check. However, the end-to-end model's accuracy ([test script](https://gist.github.com/leslie-fang-intel/aac8b3c2b450532fd0517c758bb845e0)) when comparing AMP with FP32 is within a difference of less than 0.1%. Thus, it's possible that the correctness check failures for these models are false alarms. We use multiplier of 3 instead of 2 in this PR to avoid these false alarms. Model end-to-end accuracy test results are:

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/jiahaofa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/jiahaofa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link="#0563C1" vlink="#954F72">

SPR |   |   |   |   |   |  
-- | -- | -- | -- | -- | -- | --
  | FP32 Imperative TOP1 Accuracy | FP32 Imperative TOP5 Accuracy | BF16 AMP Inductor TOP1 Accuracy | BF16 AMP Inductor TOP5 Accuracy | BF16/FP32 Relative Loss TOP1 Accuracy | BF16/FP32 Relative Loss TOP5 Accuracy
gluon_inception_v3 | 73.262 | 90.774 | 73.256 | 90.802 | -0.01% | 0.03%
mobilenetv2_100 | 72.89 | 90.996 | 72.826 | 90.946 | -0.09% | -0.05%
mobilenetv3_large_100 | 75.72 | 92.55 | 75.764 | 92.554 | 0.06% | 0.00%

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109097
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-09-16 10:02:56 +00:00
cyy
7014ef0f43 Eliminates c10::guts::array (#109423)
Follow the work of #106810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109423
Approved by: https://github.com/Skylion007
2023-09-16 09:18:03 +00:00
b03ef1d969 [Dynamo] Fix numpy error in test_numpy_torch_operators (#109087)
When you inplace matmul two one dimensional numpy arrays, numpy=="1.24.3" gives
```
TypeError: In-place matrix multiplication is not (yet) supported. Use 'a = a @ b' instead of 'a @= b'.
```
but numpy=="1.25.2" gives
```
ValueError: inplace matrix multiplication requires the first operand to have at least one and the second at least two dimensions.
```

This diff makes it so that newer versions of numpy does not fail on this test because we do not catch ValueError.

An alternative solution would be to update the test cases to be 2 dimensional, but that would have impact on other operators being tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109087
Approved by: https://github.com/jansel
2023-09-16 07:37:07 +00:00
cyy
852f1b8417 Eliminate c10::stoi,c10::stod,c10::stoull,c10::stoll (#109179)
We can remove these functions in favor of std ones.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109179
Approved by: https://github.com/colesbury
2023-09-16 07:22:50 +00:00
393fe9339a Back out "Revert D49107540: [pytorch][PR] split by tag" (#109332)
Summary:
Original commit changeset: 6391a068640b

Original Phabricator Diff: D49107540

Test Plan: same as D49107540

Differential Revision: D49297522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109332
Approved by: https://github.com/842974287
2023-09-16 05:29:16 +00:00
cyy
7bce7f50f3 Add torchgen path in gen_vulkan_spy (#108980)
Fixes the CMake building error
```
    from torchgen.code_template import CodeTemplate
ModuleNotFoundError: No module named 'torchgen'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108980
Approved by: https://github.com/ezyang
2023-09-16 04:09:56 +00:00
706d8e2230 [dynamo] Respect shape dynamism of SymInt sized tensor (#109331)
Before this PR, if we run the following code:
```python
def true_fn(x):
    return x - x.cos()

def false_fn(x):
    return x + x.sin()

def foo(x):
    return cond(x.shape[0] == 4, true_fn, false_fn, [x])
gm = make_fx(foo, tracing_mode='symbolic')(torch.ones(3, 4))
gm = make_fx(foo, tracing_mode='symbolic')(torch.ones(4, 5))
```
we'll have the following error:
```python
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/make_fx.py", line 16, in <module>
    gm = make_fx(foo, tracing_mode='symbolic')(torch.ones(4, 5))
  File "/home/yidi/local/pytorch/torch/fx/experimental/proxy_tensor.py", line 841, in wrapped
    t = dispatch_trace(wrap_key(func, args, fx_tracer, pre_dispatch), tracer=fx_tracer, concrete_args=tuple(phs))
  File "/home/yidi/local/pytorch/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/fx/experimental/proxy_tensor.py", line 461, in dispatch_trace
    graph = tracer.trace(root, concrete_args)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/fx/_symbolic_trace.py", line 817, in trace
    (self.create_arg(fn(*args)),),
  File "/home/yidi/local/pytorch/torch/fx/experimental/proxy_tensor.py", line 497, in wrapped
    out = f(*tensors)
  File "/home/yidi/local/pytorch/make_fx.py", line 13, in foo
    return control_flow.cond(x.shape[0] == 4, true_fn, false_fn, [x])
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 151, in cond
    return torch.compile(cond_op, backend="eager", fullgraph=True)(
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 545, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 140, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 380, in _convert_frame_assert
    return _compile(
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 561, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 483, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 432, in transform
    tracer = InstructionTranslator(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__
    self.symbolic_locals = collections.OrderedDict(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2035, in <genexpr>
    VariableBuilder(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 229, in __call__
    vt = self._wrap(value).clone(**self.options())
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 374, in _wrap
    return type_dispatch(self, value)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 808, in wrap_listlike
    output = [
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 809, in <listcomp>
    VariableBuilder(self.tx, GetItemSource(self.get_source(), i))(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 229, in __call__
    vt = self._wrap(value).clone(**self.options())
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 374, in _wrap
    return type_dispatch(self, value)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 808, in wrap_listlike
    output = [
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 809, in <listcomp>
    VariableBuilder(self.tx, GetItemSource(self.get_source(), i))(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 229, in __call__
    vt = self._wrap(value).clone(**self.options())
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 374, in _wrap
    return type_dispatch(self, value)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 1040, in wrap_tensor
    tensor_variable = wrap_fx_proxy(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 1267, in wrap_fx_proxy
    return wrap_fx_proxy_cls(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 1382, in wrap_fx_proxy_cls
    example_value = wrap_to_fake_tensor_and_record(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 1652, in wrap_to_fake_tensor_and_record
    dynamic_dims, constraint_dims = _automatic_dynamic(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 1550, in _automatic_dynamic
    if dim is not None and e.size()[i] != dim:
  File "/home/yidi/local/pytorch/torch/__init__.py", line 352, in __bool__
    return self.node.bool_()
  File "/home/yidi/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1019, in bool_
    return self.guard_bool("", 0)
  File "/home/yidi/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1001, in guard_bool
    r = self.shape_env.evaluate_expr(self.expr, self.hint, fx_node=self.fx_node)
  File "/home/yidi/local/pytorch/torch/fx/experimental/recording.py", line 227, in wrapper
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3793, in evaluate_expr
    assert orig_expr == hint, f"{orig_expr} != {hint}"
AssertionError: False != True

from user code:

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```

It's because we record the SymInt in the frame state in _automatic_dynamic the first time we compile the function. Then In the second time, when we are given a symint sized input with different hints, the comparison fails.

Implementation:
This PR returns shape dynamism according to the dynamism of inputs: if a diemsion is SymInt, return DYNAMIC else return static.

Test Plan:
Add a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109331
Approved by: https://github.com/ezyang
2023-09-16 02:56:53 +00:00
fb58a72d96 Use torch.cumsum instead of numpy one (#109400)
`s/list(numpy.cumsum(foo))/torch.cumsum(torch.tensor(foo), 0).tolist()/`

Test plan: ` python3 ../test/inductor/test_split_cat_fx_passes.py -v`

Partially addresses https://github.com/pytorch/pytorch/issues/109387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109400
Approved by: https://github.com/ezyang
2023-09-16 02:52:49 +00:00
4ee179c952 Fix ConstantVariable init method if NumPy is missing (#109388)
By adding `np is not None` check before `isinstance(value, np.number)`

Partially addresses https://github.com/pytorch/pytorch/issues/109387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109388
Approved by: https://github.com/ezyang
2023-09-16 00:07:19 +00:00
b904432e82 [dynamo] preserve some FX node metadata of GraphModules (#107067)
Requested from @tugsbayasgalan: we want dynamo to preserve some FX node metadata when we trace `GraphModule`s (`nn_module_stack`, `source_fn`, `stack_trace`). This is helpful for the case when we export an aten-level `GraphModule`, add some (possibly non-torch or non-aten) ops, and we want to transform the graph back into an aten-level graph. Without preserving metadata, future passes that look at metadata (e.g. quantization passes) won't work.

This feature also has the additional benefit of being able to preserve origin line of code when `print_readable`'ing a `GraphModule`. This is helpful when debugging graphs that have passed through dynamo several times.

The added unit test demonstrates the added functionality of this PR.

~This PR is currently a proof-of-concept implementation that shows that preserving node metadata across dynamo is possible.~ This PR preserves node metadata across dynamo by doing the following:
- ~inject a counter variable into the `GraphModule` source code, which is incremented every time a node is run~
- Construct a line number -> node index map in `GraphModule` as the source code is being generated.
- pass a list of node metadata and the line number map to dynamo's bytecode analyzer
- ~dynamo traces the counter as a `ConstantVariable`, so when we create a new proxy, we can determine which original node index this proxy corresponds by looking at the value of the traced counter~
- When we create a new proxy, get the current instruction's line number, and get the node index using the line number map
- index into the original node metadata ~using the counter variable's tracked value.~

~Some things that should be addressed off the top of my head:~
- ~Is this feature even desirable? (Do we really want Dynamo to have special behavior for `GraphModules`? Should we expect users to re-export `GraphModules`?)~
- ~Is there a better approach than to use a counter? We considered using node names, line numbers, and assuming that proxies are created in the same order as the nodes, but each of these 3 have shortcomings. For node names, we only have access to new node names, not the old ones. Using line number is fragile. The third is problematic since not all created nodes go through `create_proxy` (e.g. inputs). We currently generate a line number to node index map when the `GraphModule`'s code is generated.~
- ~What's the best way to send data across the "CPython gap"? That is, it is not obvious how to cleanly pass data from dynamo's `eval_frame.py:_TorchDynamoContext.__call__` to `symbolic_convert.py:InstructionTranslatorBase.__init__`. In this PR, we use a global.~

Differential Revision: [D49257108](https://our.internmc.facebook.com/intern/diff/D49257108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107067
Approved by: https://github.com/jansel
2023-09-15 23:29:14 +00:00
7af792ab05 Revert "[inductor][Optimus]Improve logging for group batch fusion (#109314)"
This reverts commit afad0d074b5504c87aa1dc9ae352686a8dd3a8eb.

Reverted https://github.com/pytorch/pytorch/pull/109314 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/109314#issuecomment-1722015037))
2023-09-15 23:28:50 +00:00
cyy
a14d30d8d1 [1/N] apply clang-tidy in torch/csrc/autograd (#109032)
This PR begins a new series of patches for enabling clang-tidy checks in torch/csrc/augograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109032
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-09-15 23:28:43 +00:00
b4ea3260d7 [JIT] Document torch.jit.interface (#109356)
Good option for replacing "Callable" types; we should document it so
it's searchable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109356
Approved by: https://github.com/eellison, https://github.com/gmagogsfm
2023-09-15 23:23:47 +00:00
ec8b58f5ba Add support for tolist on AsyncCollectiveTensor (#109377)
This has to be done by hand because tolist isn't supported on tensor subclasses.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109377
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-09-15 21:48:13 +00:00
806c52b4c9 Update chunk_sharding_spec.py (#108915)
Fixes #108869

Implements the first solution proposed in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108915
Approved by: https://github.com/wanchaol, https://github.com/wz337
2023-09-15 21:43:15 +00:00
afad0d074b [inductor][Optimus]Improve logging for group batch fusion (#109314)
Summary: Log graph with Everpaste for debug and find more patterns to fuse

Test Plan: to add logs

Differential Revision: D49284640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109314
Approved by: https://github.com/yanboliang
2023-09-15 20:46:08 +00:00
71b4b32014 return_and_correct_aliasing: massage some schemas to work with torchgen (#108897)
This issue is that `str(torch.ops.aten.conv2d.default._schema)` does not return the same schema that is in native_functions.yaml ([link](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L1654)).

Torchscript appears to change the default arg string `int[2] strides=1` to `int[2] strides=[1, 1]`. If you try to parse that with torchgen, torchgen is unhappy (it tries to split arguments on comma, but now we have a comma inside of the default argument).

Fixing the issue directly in torchgen was a bit more painful, so I opted just to undo the transformation that torchscript made: convert `=[1, 1]` back into `=1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108897
Approved by: https://github.com/ezyang
ghstack dependencies: #106404, #107917
2023-09-15 20:19:25 +00:00
0ad595954a python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917)
Added two new utils to help with turning python functionalization on in AOTAutograd (next PR):

(1) updated `torch._sync()`. Previously, this API could only handle `torch.Tensor` instances that had a `FunctionalTensorWrapper` TensorImpl. It now needs to handle python `FunctionalTensor`'s. In theory I can probably break BC and change this API (since it's private?), but I decided not to do it in this PR stack do minimize the chance of reverts. Instead of updating that API directly (which is in C++), I just added a python shim that first tries to unwrap the python `FunctionalTensor` if there is one, then calls the existing C++ logic

(2) `mirror_autograd_meta` is now a standalone API that tries to mirror the `requires_grad` and `is_leaf` autograd metadata from one tensor to another. Previously this was hardcoded into `torch._to_functional_tensor()`. But I now need to use it in a more standalone way: later in AOTAutograd when we unwrap and re-wrap a tensor subclasses, we need to manually mirror the autograd metadata from the original to the updated version of the subclass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107917
Approved by: https://github.com/ezyang
ghstack dependencies: #106404
2023-09-15 20:19:25 +00:00
f22b303f65 Add TorchDispatch version of functionalization (#106404)
This PR adds a new `FunctionalTensor` subclass, and `FunctionalTensorMode` torch dispatch mode. Together, this class/mode are a lightweight wrapper around our existing C++ functionalization logic.

This idea came from Ed - later in the stack, I want to be able to run functionalization **underneath** torch_dispatch, when performing tracing in AOTAutograd. I can't do this easily with vanilla C++ functionalization, because it has a dedicated dispatch key that always runs before TorchDispatch. However, by adding a torch_dispatch mode shim around functionalization, we can use functionalization as a torch_dispatch mode, which will make it easier to run underneath other modes later.

This PR provides the basic new classes, and some light testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106404
Approved by: https://github.com/ezyang
2023-09-15 20:19:25 +00:00
504dceacb1 [ONNX] Fix indexing issue of meshgrid op (#109350)
Should unpack tensor_list before swapping the elements for indexing 'xy'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109350
Approved by: https://github.com/thiagocrepaldi
2023-09-15 19:49:43 +00:00
cyy
4c208c1475 Remove unneeded linking in CMake targets (#109192)
This PR removes unused library dependencies, help refactoring in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109192
Approved by: https://github.com/ezyang
2023-09-15 19:43:25 +00:00
d3a64ff249 Display subclass name when tolist() fails due to tensor subclass (#109376)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109376
Approved by: https://github.com/wanchaol
2023-09-15 19:42:39 +00:00
cyy
9d297cc773 Remove c10::either (#109299)
We can replace it with std::variant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109299
Approved by: https://github.com/colesbury, https://github.com/ezyang
2023-09-15 19:34:31 +00:00
cc03e3a892 [AOTInductor] Do not hardcode directory with .cubin files (#109151)
Reviewed By: frank-wei, chenyang78

Differential Revision: D49081883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109151
Approved by: https://github.com/chenyang78
2023-09-15 18:38:05 +00:00
7da3c938cf [quant][be] Move QAT tests to its own file (#108061)
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT
python test/test_quantization.py TestQuantizePT2EQATModels

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108061
Approved by: https://github.com/jerryzh168
2023-09-15 18:34:44 +00:00
369a84e5c4 [core][sparse][pruning] Add (i8i8)-> fp16 support to cuSPARSELt matmul (#109214)
Summary:

This PR adds in support for sparse matmul using cuSPASRELt with int8
inputs and fp16 outputs.

It does so by adding a out_dtype flag to `torch_cslt_sparse_mm`.
Because the only mixed_dtype support present in cuSPARSELt is for int8
input and fp16 output, we error out if:

* out_dtype is set and the input tensors are not int8.
* out_dtype is set to any value other than fp16

Test Plan:

python test/test_sparse_semi_structured -k int8_in_fp16_out

Reviewers:

@cphursh

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109214
Approved by: https://github.com/cpuhrsch
2023-09-15 18:14:40 +00:00
ab99a95470 Update planner.py (#107998)
Fixes #107997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107998
Approved by: https://github.com/wz337
2023-09-15 18:12:45 +00:00
86e6bd3e53 [inductor] Enable mypy checking for torch/_inductor/bounds.py (#109271)
Summary: Add type hints and enable mypy checking for torch/_inductor/bounds.py

Test Plan: lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109271
Approved by: https://github.com/lezcano
2023-09-15 17:47:24 +00:00
a9bf1031d4 [BE] Do not use numpy in torch._inductor.codegen.cpp (#109324)
`s/numpy.iinfo(numpy.int32)/torch.iinfo(torch.int32)/` as those two are interchangeable

Partially addresses https://github.com/pytorch/pytorch/issues/109387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109324
Approved by: https://github.com/albanD
2023-09-15 17:29:10 +00:00
653c1564bf Fix broadcasting cosine_similarity (#109363)
Fixes https://github.com/pytorch/pytorch/issues/109333
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109363
Approved by: https://github.com/peterbell10
2023-09-15 17:12:35 +00:00
aed9bee041 [inductor] Lower masked_scatter on CUDA (#108803)
This decomposes masked_scatter into `aten.cumsum` and a single pointwise kernel,
which is similar to what is done in eager. I only do this for CUDA because on CPU
it isn't split into two passes like this so would cause a slowdown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108803
Approved by: https://github.com/lezcano
2023-09-15 16:36:06 +00:00
3943afc94e [quant][be] Remove unused APIs (#109342)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109342
Approved by: https://github.com/kimishpatel, https://github.com/andrewor14
2023-09-15 16:07:01 +00:00
f3d1401843 Fix cond branches take no arguments (#109308)
For code like this:
```python
import torch
from functorch.experimental import control_flow
def exportdb_example2(x):
    def true_fn():
        return torch.sin(x)

    def false_fn():
        return torch.cos(x)

    return control_flow.cond(x.sum() > 0, true_fn, false_fn, [])
ep = torch._export.export(exportdb_example2, (torch.randn(4, 5),))
```
before the pr, when the branches take an empty/list of tuple as inputs, we'll have error like following:
```python
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/test_cond.py", line 11, in <module>
    ep = torch._export.export(exportdb_example2, (torch.randn(4, 5),))
  File "/home/yidi/local/pytorch/torch/_export/__init__.py", line 340, in export
    gm_torch_level, _ = torch._dynamo.export(
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 1207, in inner
    result_traced = opt_f(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/test_cond.py", line 3, in exportdb_example2
    def exportdb_example2(x):
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 1173, in result_capturing_wrapper
    graph_captured_result = torch.func.functional_call(
  File "/home/yidi/local/pytorch/torch/_functorch/functional_call.py", line 143, in functional_call
    return nn.utils.stateless._functional_call(
  File "/home/yidi/local/pytorch/torch/nn/utils/stateless.py", line 264, in _functional_call
    return module(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 725, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 305, in __call__
    raise e
  File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 292, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.2", line 10, in forward
  File "/home/yidi/local/pytorch/torch/_ops.py", line 301, in __call__
    return wrapper()
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_ops.py", line 297, in wrapper
    return self.dispatch(
  File "/home/yidi/local/pytorch/torch/_ops.py", line 280, in dispatch
    return kernel(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/utils.py", line 52, in inner
    return autograd_not_implemented_inner(op, deferred_error, *args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/utils.py", line 25, in autograd_not_implemented_inner
    result = operator(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_ops.py", line 301, in __call__
    return wrapper()
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_ops.py", line 297, in wrapper
    return self.dispatch(
  File "/home/yidi/local/pytorch/torch/_ops.py", line 255, in dispatch
    return self.python_key_mode_table[type(curr_mode)](*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 310, in cond_fake_tensor_mode
    flat_false_outs, _ = pytree.tree_flatten(false_fn(*operands))
  File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 725, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 305, in __call__
    raise e
  File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 292, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: forward() takes 2 positional arguments but 3 were given
```

Thanks for @williamwen42 spotting this error! We fix it by addressing the case when add_after is -1.

Test Plan:
See newly added tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109308
Approved by: https://github.com/williamwen42
2023-09-15 15:35:46 +00:00
1aba61e977 Allow cond to have more dynamo cache beyond limit (#109318)
This is short term workaround for https://github.com/pytorch/pytorch/issues/108500. In the long term, we should have separate caches if cond appears at different places in user code or per true_fn/false_fn cache.

Test Plan:
see added test. It tests cond can go beyond cache limit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109318
Approved by: https://github.com/ezyang
2023-09-15 15:33:36 +00:00
dfdc0b63c9 Bisect FX node asserts on ValidationException. (#107493)
This PR introduces binary search for finding smaller validation errors, when they occur.

We do that by bisecting the sequence of `torch._assert` FX nodes recorded as the source
expression of the translation validator (TV) by `ShapeEnv.evaluate_expr` calls. Then, we
raise the error caused by the earliest node.

In summary, the changes are:
- Call `bisect` on `ValidationError` @ _torch/_dynamo/convert_frame.py_
- Implement the binary search @ _torch/fx/experimental/symbolic_shapes.py_

Edit: moved `ShapeEnv` replay-recording to #107989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107493
Approved by: https://github.com/ezyang
ghstack dependencies: #107989
2023-09-15 15:18:12 +00:00
a873f523ba [aarch64][caffe2/torch/csrc/profiler] Support aarch64 in inline assembly (#104707)
Summary:
Port x86 inline assembly to aarch64:
- Use `sp` instead of `%rsp` for stack pointer; move to second caller-
  saved register `x1` instead of `%rsi`
- Use `x29` instead of `%rbp` for base pointer; move to third caller-
   saved register `x2` instead of `%rdx`

Test Plan:
```
$ buck2 build fbcode//mode/opt fbcode//caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file
```

Reviewed By: jasonjk-park

Differential Revision: D47242468

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104707
Approved by: https://github.com/aaronenyeshi
2023-09-15 14:34:55 +00:00
faf3de35db Fix max/min.reduction_with_dim opinfo test for bool tensors (#109264)
The tests were failing because the input tensors were getting
incorrectly promoted to int64, due to `transform_args` incorrectly
considering the `dim` argument when determining the type to promote to.

It doesn't seem like type promotion is necessary in general for max and
min, so I've switched them to `type_promotion_kind=None`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109264
Approved by: https://github.com/eellison
ghstack dependencies: #109165
2023-09-15 12:37:17 +00:00
19f8b05afe Disable gradient check for linalg.eig (#109165)
Both the eager and compiled versions fail with the following message
when trying to compute the grad:

    RuntimeError: linalg_eig_backward: The eigenvectors in the complex
    case are specified up to multiplication by e^{i phi}. The specified
    loss function depends on this quantity, so it is ill-defined.

I'm not sure if there's a way to adapt the OpInfo such that the grad is
computable, but we should at least check that the forward pass is
correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109165
Approved by: https://github.com/eellison
2023-09-15 12:37:17 +00:00
66fdea606d Enable typing for _inductor/exc.py (#109176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109176
Approved by: https://github.com/eellison
ghstack dependencies: #109173
2023-09-15 12:36:59 +00:00
bd89f80bae Add more types for inductor_prims.py (#109173)
Also fix a grammatical issue in the docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109173
Approved by: https://github.com/eellison
2023-09-15 12:36:59 +00:00
1cc0921eb6 Add tensorboard to pip requirements (#109349)
https://github.com/pytorch/pytorch/pull/108351/files is failing on mac and windows because we don't have the dependency
It is available on linux because it is included in .ci/docker/requirements-docs.txt

Adding skips to make it green.

Here are some outputs for future debugging
https://github.com/pytorch/pytorch/actions/runs/6192933622/job/16813841625
https://ossci-raw-job-status.s3.amazonaws.com/log/16813841625
```

2023-09-15T02:09:43.2397460Z =================================== FAILURES ===================================
2023-09-15T02:09:43.2397650Z ______________________ TestTensorBoardSummary.test_audio _______________________
2023-09-15T02:09:43.2397830Z Traceback (most recent call last):
2023-09-15T02:09:43.2398090Z   File "/Users/ec2-user/runner/_work/pytorch/pytorch/test/test_tensorboard.py", line 417, in test_audio
2023-09-15T02:09:43.2398390Z     self.assertTrue(compare_proto(summary.audio('dummy', tensor_N(shape=(42,))), self))
2023-09-15T02:09:43.2398720Z   File "/Users/ec2-user/runner/_work/_temp/conda_environment_6192933622/lib/python3.9/unittest/case.py", line 688, in assertTrue
2023-09-15T02:09:43.2399100Z ##[endgroup]
2023-09-15T02:09:43.2399240Z     raise self.failureException(msg)
2023-09-15T02:09:43.2399400Z AssertionError: False is not true
2023-09-15T02:09:43.2399490Z
2023-09-15T02:09:43.2399590Z To execute this test, run the following from the base repo dir:
2023-09-15T02:09:43.2399820Z      python test/test_tensorboard.py -k test_audio
2023-09-15T02:09:43.2399930Z
```

https://github.com/pytorch/pytorch/actions/runs/6192933622/job/16814065258
https://ossci-raw-job-status.s3.amazonaws.com/log/16814065258
```

2023-09-15T02:38:44.6284979Z ================================== FAILURES ===================================
2023-09-15T02:38:44.6285295Z ______________________ TestTensorBoardNumpy.test_scalar _______________________
2023-09-15T02:38:44.6285556Z Traceback (most recent call last):
2023-09-15T02:38:44.6285915Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_tensorboard.py", line 794, in test_scalar
2023-09-15T02:38:44.6286325Z     res = make_np(np.float128(1.00008 + 9))
2023-09-15T02:38:44.6286705Z   File "C:\Jenkins\Miniconda3\lib\site-packages\numpy\__init__.py", line 315, in __getattr__
2023-09-15T02:38:44.6287700Z     raise AttributeError("module {!r} has no attribute "
2023-09-15T02:38:44.6288060Z AttributeError: module 'numpy' has no attribute 'float128'
2023-09-15T02:38:44.6288241Z
2023-09-15T02:38:44.6288390Z To execute this test, run the following from the base repo dir:
2023-09-15T02:38:44.6288679Z      python test\test_tensorboard.py -k test_scalar
2023-09-15T02:38:44.6288846Z
```

https://github.com/pytorch/pytorch/actions/runs/6193449301/job/16815113985
https://ossci-raw-job-status.s3.amazonaws.com/log/16815113985
```
2023-09-15T03:25:53.7797550Z =================================== FAILURES ===================================
2023-09-15T03:25:53.7797790Z __________________ TestTensorBoardSummary.test_histogram_auto __________________
2023-09-15T03:25:53.7798000Z Traceback (most recent call last):
2023-09-15T03:25:53.7798310Z   File "/Users/ec2-user/runner/_work/pytorch/pytorch/test/test_tensorboard.py", line 426, in test_histogram_auto
2023-09-15T03:25:53.7798690Z     self.assertTrue(compare_proto(summary.histogram('dummy', tensor_N(shape=(1024,)), bins='auto', max_bins=5), self))
2023-09-15T03:25:53.7799090Z   File "/Users/ec2-user/runner/_work/_temp/conda_environment_6193449301/lib/python3.9/unittest/case.py", line 688, in assertTrue
2023-09-15T03:25:53.7799430Z     raise self.failureException(msg)
2023-09-15T03:25:53.7799610Z AssertionError: False is not true
2023-09-15T03:25:53.7799720Z
2023-09-15T03:25:53.7799840Z To execute this test, run the following from the base repo dir:
2023-09-15T03:25:53.7800170Z      python test/test_tensorboard.py -k test_histogram_auto
2023-09-15T03:25:53.7800310Z
2023-09-15T03:25:53.7800430Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-09-15T03:25:53.7800870Z - generated xml file: /Users/ec2-user/runner/_work/pytorch/pytorch/test/test-reports/python-pytest/test_tensorboard/test_tensorboard-aef95b5e2d69c061.xml -
2023-09-15T03:25:53.7801200Z =========================== short test summary info ============================
```

https://github.com/pytorch/pytorch/actions/runs/6193576371/job/16815396352
https://ossci-raw-job-status.s3.amazonaws.com/log/16815396352
```
2023-09-15T03:47:02.9430070Z _________________ TestTensorBoardSummary.test_histogram_doane __________________
2023-09-15T03:47:02.9430250Z Traceback (most recent call last):
2023-09-15T03:47:02.9430520Z   File "/Users/ec2-user/runner/_work/pytorch/pytorch/test/test_tensorboard.py", line 433, in test_histogram_doane
2023-09-15T03:47:02.9430850Z     self.assertTrue(compare_proto(summary.histogram('dummy', tensor_N(shape=(1024,)), bins='doane', max_bins=5), self))
2023-09-15T03:47:02.9431180Z   File "/Users/ec2-user/runner/_work/_temp/conda_environment_6193576371/lib/python3.9/unittest/case.py", line 688, in assertTrue
2023-09-15T03:47:02.9431390Z     raise self.failureException(msg)
2023-09-15T03:47:02.9431550Z AssertionError: False is not true
2023-09-15T03:47:02.9431640Z
2023-09-15T03:47:02.9431730Z To execute this test, run the following from the base repo dir:
2023-09-15T03:47:02.9432000Z      python test/test_tensorboard.py -k test_histogram_doane
2023-09-15T03:47:02.9432120Z
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109349
Approved by: https://github.com/huydhn
2023-09-15 10:39:48 +00:00
9456de937b [dtensor] Fix and improve the sharding cache behavior (#109306)
resolves https://github.com/pytorch/pytorch/issues/109101

The problem is essentially because we were hashing all the arguments, including
the scalar too (i.e. aten.div(tensor, scalar)), in the optimizer, the scalar might
change everytime we call the op, thus cache miss everytime we call the op

This PR improves the sharding cache behavior by introducing a
RuntimeSchemaInfo, used to record some runtime necessary hashing
information during op registration time. This enable us to:
* only hash arguments that are tensor or have static_argnum, this is to
enable many cases like aten.div.Tensor(tensor, 0.23231) hit the cache.
as we currently hashing all args which exclude those cases
* with the correct cache behavior, optimizers will hit the cache again
and resolve the high cpu overhead issue.

simple MLP shows all cache hit and for a single addmm -> 0.319ms (from 0.341ms), shows some hashing improvements:
<img width="1172" alt="Screenshot 2023-09-14 at 11 06 07 AM" src="https://github.com/pytorch/pytorch/assets/9443650/3406d673-dd8d-4ad9-9b80-9d4721c430e3">

Adam optimizer shows aten.div hit sharding cache again
<img width="1016" alt="Screenshot 2023-09-14 at 11 02 10 AM" src="https://github.com/pytorch/pytorch/assets/9443650/4280e8e3-af44-4fc2-8360-ea80b768f1d9">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109306
Approved by: https://github.com/fduwjj
2023-09-15 10:32:49 +00:00
0cbca85707 Add check to prevent NumPy ndarray from being treated as tuple when indexing (#108954)
Fixes #108689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108954
Approved by: https://github.com/lezcano
2023-09-15 08:51:58 +00:00
f786fbdebd Reland 3rd try [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#109323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109323
Approved by: https://github.com/huydhn, https://github.com/voznesenskym
2023-09-15 08:44:14 +00:00
cyy
af7d79923c Remove thrift from Docker builds (#109344)
Thrift is not used in Pytorch. Outdated gcc7 configs are removed too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109344
Approved by: https://github.com/malfet
2023-09-15 08:28:32 +00:00
34ddf08f27 [inductor] update fbcode skips for AOTInductor (#109313)
Summary: seems like the `if __name__ == "__main__":` part doesn't run in fbcode; instead, just add skips

Differential Revision: D49258492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109313
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2023-09-15 04:28:25 +00:00
2b6d983b8b Reland [dynamo][activation checkpointing] Trace through ActivationWrapper (#109327)
Fixes https://github.com/pytorch/pytorch/issues/108269
Original reverted PR - https://github.com/pytorch/pytorch/pull/108599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109327
Approved by: https://github.com/aakhundov
2023-09-15 03:43:59 +00:00
924723bda7 Created nested utils.cpp (#109304)
# Summary
This refactors the preprocessing for nestedtensors that glue into SDPA. This is done in order to aid with reviewing:
https://github.com/pytorch/pytorch/pull/97485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109304
Approved by: https://github.com/cpuhrsch
2023-09-15 03:33:11 +00:00
2d4924db32 Remove S3 Update Workflow (#109317)
Simpler workflow to update S3 management is done in https://github.com/pytorch/builder/pull/1531. We can remove this job from here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109317
Approved by: https://github.com/huydhn
2023-09-15 03:17:38 +00:00
b3272b2c00 Trace attention inference patterns with p=0, cleanup (#109118)
When dropout is traced in inference, it creates a clone() instead of training pattern of rand() etc. This was partially addressed by manually https://github.com/pytorch/pytorch/pull/108141, however that did not cover all of the patterns that included dropout, and there is no reason we should have to specify them manually.

This updates the inference patterns generated to trace with dropout_p = 0.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109118
Approved by: https://github.com/drisspg, https://github.com/Valentine233
2023-09-15 01:40:04 +00:00
5349615240 [dynamo] Unblock a model with jit.isinstance (#109178)
prevents this error

```
File "/tmp/jetter.azp5q59y/torch/fx/proxy.py", line 291, in create_arg
python/0     raise NotImplementedError(f"argument of type: {type(a)}")
python/0 torch._dynamo.exc.InternalTorchDynamoError: argument of type: <class 'typing._GenericAlias'>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109178
Approved by: https://github.com/yanboliang
2023-09-15 01:19:46 +00:00
2bca5f2af7 [C10D] Track pg name in c++. (#108813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108813
Approved by: https://github.com/wconstab
2023-09-15 01:10:29 +00:00
58a883093f [quant][pt2e] Add test for serialize and deserialize quantized model (#109158)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_save_load

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109158
Approved by: https://github.com/andrewor14
ghstack dependencies: #108924, #108925
2023-09-15 00:50:55 +00:00
cyy
36b8ca4e48 [2/N] apply clang-tidy in torch/csrc/autograd (#109277)
This PR follows the work of PR #109032.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109277
Approved by: https://github.com/albanD
2023-09-15 00:39:12 +00:00
ec3c748fa2 Document Triton dependency for the release process (#109296)
Document triton dependency for the Pytorch Release

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at e9773f0</samp>

Add documentation for Triton dependency in `RELEASE.md`. The documentation covers how to install and use Triton for various PyTorch builds and platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109296
Approved by: https://github.com/malfet
2023-09-14 23:48:45 +00:00
cyy
8cb96f5f2c [Reland]Use cpuinfo to determine c10::ThreadPool thread number (#107339)
Relands PR #107010 and fixes BUCK builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107339
Approved by: https://github.com/ezyang
2023-09-14 23:44:23 +00:00
fa62308673 [tensorboard] Fix TensorBoard summary encoding for torch.bfloat16 tensors (#108351)
Summary:
The `tensor_proto` function in the TensorBoard summary writer code doesn't correctly encode `torch.bfloat16` tensors; it tries to use a data type of `DT_BFLOAT` when creating the protobuf, but `DT_BFLOAT` is not a valid enum value (see `types.proto`). The correct value to use when encoding tensors of this type is `DT_BLOAT16`. This diff updates the type map in the summary code to use the correct type.

While fixing this error, I also noticed the wrong field of the protobuf was being used when encoding tensors of this type; per the docs in the proto file, the DT_HALF and DT_BFLOAT16 types should use the `half_val` field, not `float_val`. Since this might confuse folks trying to read this data from storage in the future, I've updated the code to correctly use to `half_val` field for these cases. Note that there's no real size advantage from doing this, since both the `half_val` and `float_val` fields are 32 bits long.

Test Plan:
Added a parameterized unit test that tests encoding tensors with `torch.half`, `torch.float16`, and `torch.bfloat16` data types.

# Before this change
The test fails with an `ValueError` due to the incorrect enum label:
```
======================================================================
ERROR: test_bfloat16_tensor_proto (test_tensorboard.TestTensorProtoSummary)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/jcarreiro/fbsource/buck-out/v2/gen/fbcode/f88b3f368c9334db/caffe2/test/__tensorboard__/tensorboard#link-tree/torch/testing/_internal/common_utils.py", line 2382, in wrapper
    method(*args, **kwargs)
  File "/data/users/jcarreiro/fbsource/buck-out/v2/gen/fbcode/f88b3f368c9334db/caffe2/test/__tensorboard__/tensorboard#link-tree/test_tensorboard.py", line 871, in test_bfloat16_tensor_proto
    tensor_proto(
  File "/data/users/jcarreiro/fbsource/buck-out/v2/gen/fbcode/f88b3f368c9334db/caffe2/test/__tensorboard__/tensorboard#link-tree/torch/utils/tensorboard/summary.py", line 400, in tensor_proto
    tensor_proto = TensorProto(**tensor_proto_args)
ValueError: unknown enum label "DT_BFLOAT"

To execute this test, run the following from the base repo dir:
     python test/__tensorboard__/tensorboard#link-tree/test_tensorboard.py -k test_bfloat16_tensor_proto

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
```

# After this change
The test passes.

Reviewed By: tanvigupta17

Differential Revision: D48828958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108351
Approved by: https://github.com/hamzajzmati, https://github.com/XilunWu
2023-09-14 23:12:22 +00:00
bf5622e965 Revert "split by tag (#108892)"
This reverts commit 89b6276be9f1b04491625cc0d05de01c15f75597.

Reverted https://github.com/pytorch/pytorch/pull/108892 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/108892#issuecomment-1720249148))
2023-09-14 22:43:03 +00:00
be9f73f031 Revert "Add meta and OpInfo for _embedding_bag_dense_backward (#109211)"
This reverts commit fe14e43d14420a53426215a5fff30113da6d216a.

Reverted https://github.com/pytorch/pytorch/pull/109211 on behalf of https://github.com/clee2000 due to Sorry I think the test_ops.py::TestCommonCUDA::test_compare_cpu__embedding_bag_dense_backward_cuda_float32 is failing 492a93d185 https://github.com/pytorch/pytorch/actions/runs/6190707847/job/16808644559 not sure why this is run in slow when it looks to be a new test ([comment](https://github.com/pytorch/pytorch/pull/109211#issuecomment-1720235918))
2023-09-14 22:29:12 +00:00
28169193b4 [TD] Improve heuristic metrics collection (#109305)
Fixes a bug with heuristic metrics collection where the metrics would sometimes inaccurately claim a heuristic to have ranked a test more highly than any other heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109305
Approved by: https://github.com/clee2000
2023-09-14 22:20:34 +00:00
89b6276be9 split by tag (#108892)
Differential Revision: D49107540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108892
Approved by: https://github.com/842974287
2023-09-14 21:49:11 +00:00
2bf7a283cb Remove expected test failures for cond (#108709)
Remove the expected failure in def test_control_flow_tracing(self) by chaning the error message to `Expected pred to be bool or tensor, but got Proxy\(eq\)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108709
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #107662, #107850
2023-09-14 21:34:31 +00:00
6140facf00 Support SymBool input to torch.compile (#107850)
We could have SymBool inputs for torch.compile, e.g. in the following situation:
```
def f(x:torch.Tensor):
  pred = x.size(0) == 3
  torch.compile(f)(pred, x)

make_fx(f, tracing_mode="symbolic")(x)
```

The idea of this PR (credit to @ezyang) is to support SymBool by re-using the infra we've already had for SymInt so that we don't need to replicate a lot of stuff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107850
Approved by: https://github.com/ezyang
ghstack dependencies: #107662
2023-09-14 21:34:31 +00:00
ea94344821 [ROCm] Enable Lerp tests for complex32 (#108100)
Enables previously disabled "lerp" opinfo tests for chalf on ROCm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108100
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/kit1980
2023-09-14 21:21:29 +00:00
54c5f474a7 Forward rank and world size info to Torchbench models when using dynamo runner (#108438)
Adding support to pass rank and world_size to torchbench model, via its extra_args parameter: https://github.com/pytorch/benchmark/blob/main/torchbenchmark/util/model.py#L83C80-L83C90

This is used for models which distribute over multiple GPUs e.g. simple_gpt https://github.com/pytorch/benchmark/pull/1867

Also add an option to skip multiprocess only gpu models

Testing via `python benchmarks/dynamo/torchbench.py -d cuda --output=benchmark_logs/performance.csv --inference --performance --timing --print-memory --multiprocess --only simple_gpt`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108438
Approved by: https://github.com/Chillee
2023-09-14 21:01:20 +00:00
cyy
03e35efbf7 replace torch::make_unique with std::make_unique (#108866)
It should be safe to remove the old torch::make_unique functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108866
Approved by: https://github.com/albanD
2023-09-14 20:52:26 +00:00
f03b8abd47 [HigherOrderOp] Should automatically pop modes (#109157)
Fixes #108282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109157
Approved by: https://github.com/zou3519
2023-09-14 20:46:26 +00:00
492a93d185 [HSDP] Updating HSDP test - test_hsdp_init_with_device_mesh (#109202)
Remove DeviceMesh import dependency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109202
Approved by: https://github.com/fegin
2023-09-14 20:03:22 +00:00
602413a0a0 Refactor test_foreach.py (#107869)
## Summary
- Change the default of `supports_autograd` and `supports_forward_ad` of `ForeachFuncInfo` to `True`
- Add `test_zero_size_tensor_inputs` to make sure that foreach functions can handle 0-size Tensor inputs
- Add `test_parity` to check the consistency between outputs of foreach and for-loop of native function.
- Add `test_autodiff` to check forward-mode and reverse-mode AD
- Keep the corner cases that are not covered by the newly introduced methods

rel:
- #58833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107869
Approved by: https://github.com/janeyx99
2023-09-14 19:39:26 +00:00
f7574ea43f torch.load: Replaced multiple one byte read() calls during the _is_zipfile check with a single call (#109119)
Fixes #108955.

Right now, the `_is_zipfile` check in `torch.load` performs multiple `read()` calls, reading 1 byte at a time in a loop. This is rather wasteful and leads to performance problems when accessing files on a network share (see #108955) .
This PR replaces those 1 byte calls with a single big call. Functionally, this is equivalent as `read(n)` only reads up to `n` bytes, so even if the file is shorter there should not be any problems.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109119
Approved by: https://github.com/mikaylagawarecki
2023-09-14 19:39:10 +00:00
c382ad47dd Deprecate torch.cross default behaviour (#108760)
Long overdue this one. We may be able to change it in a few years :hopeful:.

**BC-breaking note**

This PR deprecates `torch.cross`'s default dim in favor of
`torch.linalg.cross`.
A upgrade guide is added to the documentation for `torch.cross`.

Note this PR DOES NOT remove `torch.cross`.

Fixes https://github.com/pytorch/pytorch/issues/108664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108760
Approved by: https://github.com/albanD
2023-09-14 19:36:29 +00:00
78cd86c552 NT support for gt (#109121)
Needed for mask thresholding in SAM. TODO: Hook into existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109121
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-09-14 19:35:27 +00:00
263ca7d69b [ONNX] Remove deprecated functions (#107208)
The usage of some functions is deprecated. This PR drop them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107208
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-09-14 19:09:56 +00:00
0edb616793 Add test/onnx_caffe2 to ONNX Exporter merge rule (#109295)
As we deprecate the TorchScript ONNX exporter we need to refactor the onnx caffe2 tests to start using private functions instead of public ones

That requires changes to the merge rules to allow ONNX exporter to drive deprecation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109295
Approved by: https://github.com/malfet
2023-09-14 19:06:12 +00:00
fe14e43d14 Add meta and OpInfo for _embedding_bag_dense_backward (#109211)
The sample inputs is a bit involved because there are a lot of
shenanigans in the derivative formula.  Check comments.

This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211
Approved by: https://github.com/albanD, https://github.com/zou3519
2023-09-14 18:49:32 +00:00
b121e4df92 Increase tolerances for baddbmm opinfo test (#109164)
The compiled `baddbmm` deviates from the eager `baddbmm` due to its
decomp into badd + bmm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109164
Approved by: https://github.com/eellison
2023-09-14 18:37:04 +00:00
9187559e75 [quant][be] Remove test/quantization/pt2e/test_quantize_pt2e_fx.py (#108925)
Summary:
this is no longer needed since we have the quantizer api now

Test Plan:
.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108925
Approved by: https://github.com/andrewor14
ghstack dependencies: #108924
2023-09-14 18:35:17 +00:00
900288f138 Revert "[inductor] Lower masked_scatter on CUDA (#108803)"
This reverts commit e4036ed7068cdcbe07470c1740ca25ab8ead7a3b.

Reverted https://github.com/pytorch/pytorch/pull/108803 on behalf of https://github.com/peterbell10 due to Bot merged after aborted rebase ([comment](https://github.com/pytorch/pytorch/pull/108803#issuecomment-1719918831))
2023-09-14 18:12:27 +00:00
d4990ad5a1 Fix the example in the extending.func.rst (#109279)
As the title shown ,the `backward` function is missing the definition of `ind` and `ind_inv`, which will lead to error when calling backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109279
Approved by: https://github.com/zou3519
2023-09-14 17:29:39 +00:00
9021fb8dac [dynamo] implement custom dict variable as a general solution for HF's ModelOutput class (#105044)
before the PR, for HF's ModelOutput class, we use dicts.py/DataClassVariable with our own implementation on __getItem__, __setAttr__, __setItem__. There is a risk that ModelOutput logic may change since it is a user code

after the PR, we inline __getItem__, __setAttr__, __setItem__ using dicts.py/CustomizedDictVariable so the logic always keep AA

unit test
* python test/dynamo/test_model_output.py -k test_HF_bert_model_output

test on HF benchmark
* python benchmarks/dynamo/huggingface.py -d cuda --inference --accuracy --progress --inductor --print-dataframe-summary 2>&1
* all metric are the same before/after the PR, including pass rate, unique_graphs, graph_breaks, unique_graph_breaks
  * before the PR: P790393916
  * after the PR: P790368991

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105044
Approved by: https://github.com/jansel
2023-09-14 17:15:50 +00:00
e4036ed706 [inductor] Lower masked_scatter on CUDA (#108803)
This decomposes masked_scatter into `aten.cumsum` and a single pointwise kernel,
which is similar to what is done in eager. I only do this for CUDA because on CPU
it isn't split into two passes like this so would cause a slowdown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108803
Approved by: https://github.com/lezcano
ghstack dependencies: #108802
2023-09-14 17:07:53 +00:00
800c665618 Revert "[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)"
This reverts commit 5976a08eea1656a0f5420661b33e0937248f2097.

Reverted https://github.com/pytorch/pytorch/pull/106581 on behalf of https://github.com/peterbell10 due to This combined with #108803 uncovered a triton bug openai/triton#2298 ([comment](https://github.com/pytorch/pytorch/pull/106581#issuecomment-1719811113))
2023-09-14 16:58:52 +00:00
1b502139f3 Added a flag is_cpu to the AOTInductor runtime (#109300)
Summary:
added a flag is_cpu that can be specified by the user to
indicate whether the AOTInductor runtime is for CPU. It's
false by default.

Test Plan: ci

Reviewed By: hl475, aakhundov

Differential Revision: D49253826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109300
Approved by: https://github.com/aakhundov
2023-09-14 16:24:09 +00:00
3acccb3aa0 [AOTInductor] Add is_cpu for AOTInductorModelContainer (#109287)
Summary:
If is_cpu is set for the model container, no need to move the weights from cpu to device.

Reviewed By: bertmaher

Differential Revision: D49252595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109287
Approved by: https://github.com/aakhundov
2023-09-14 16:24:01 +00:00
b226373d16 Revert "add Half support for BatchNorm on CPU (#102070)"
This reverts commit b6a1d3fb97ca8eeccf15a4c495fdd1af4b197f88.

Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to I'm very sorry but it looks like #106543 was not fixed, I still see it failing on main b6a1d3fb97 https://github.com/pytorch/pytorch/actions/runs/6185704949/job/16793975677 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1719747065))
2023-09-14 16:13:34 +00:00
94a54b89aa [dynamo] Add BACKEND_MATCH guard to detect and recompile when backend changes (#107337)
**Motivation:**
We try to make torch.cond use torch.compile automatically so that we could error out when there is side-effects in the branches and correctly handle the closures.

Before this PR, we have a warning if we don't turn on a config raise_on_backend_change (turning it on gives us an error) for the following code:
```python
def foo()

# Inside torch.cond, we'd like to do something like
torch.compile(foo, backend="eager", fullgraph=True)(...)
...
# Users may then call torch.compile somewhere else.
# Dynamo will use the cached code of foo for "eager" backend
# but we expect dynamo to recompile with "inductor" backend.
torch.compile(foo, backend="inductor")(...)
```

This PR adds a BACKEND_MATCH guard. Effectively, it implements a per-backend cache. In the above example, the cached code for "eager" won't work for "inductor" due to guard check failures and the second torch.compile will do a re-compilation. In the future, it might be useful to have something like a configuration guard that guards against dynamo configuration changes across different compiles (e.g. compile a function with fullgraph=False then compile it again with fullgraph=True).

**Implementation:**
1. We add a guarded_backend_cache and check the most_recent_backend against the backend associated with cached code. We also remove the raise_on_backend_change flag.

Note: More lines are printed for debug log due to newly added context manager and guard adds .

**Test Plan:**
Removed original tests that raise on different backend and add a new test to test whether the BACKEND_MATCH guard can guard against backend change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107337
Approved by: https://github.com/jansel
2023-09-14 15:49:30 +00:00
9b3f5823f3 Added test for interpolate nearest exact (#108558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108558
Approved by: https://github.com/mikaylagawarecki
2023-09-14 15:17:33 +00:00
111b9ef390 [ROCM] Enable test_fn_fwgrad_..._functional_binary_cross_entropy on ROCM (#109038)
Fixes #98431
This change addresses a hardware assertion that was triggered on ROCm only tests.

Description of the problem:
Assertion triggered
```
Device-side assertion `target_val >= zero && target_val <= one' failed.
```
The issue in question is due to a GPU side assertion in `binary_cross_entropy_out_cuda` where a `target_val` get's passed to the kernel that does not fall between 0 and 1. The value in question that triggers the assertion is -0.000000000810. The origin of this negative value comes from one of the tensors generated for the test. In this tensor, one of the values (on ROCM) is 0.000000999190 which adhere's to the restriction that it is between 0 and 1. However, this value is eventually passed as a single entry tensor to gradcheck.py::_compute_numerical_gradient
( https://github.com/pytorch/pytorch/blob/main/torch/autograd/gradcheck.py#L347)

This function perturbs the tensor value in-place by subtracting `v` from it and then adding it back. The value of `v` comes from the default `eps` value defined here https://github.com/pytorch/pytorch/blob/main/torch/autograd/gradcheck.py#L2119

Currently pegged at `1e-6`. So what occurs is when an input is less than the default eps (like 0.000000999190 ), the perturbation calculation causes an entry in the tensor to flip to negative, i.e. 0.000000999190 - 1e-6 = -0.000000000810  (due to the subtraction here: https://github.com/pytorch/pytorch/blob/main/torch/autograd/gradcheck.py#L364) which then triggers the device side assertion in `binary_cross_entropy_out_cuda`.

This PR loosens the EPS by an order of magnitude to get around the error. Since this issue has not been caught in the field in any meaningful way, I find this to be an adequate solution, though am happy to hear opposing viewpoints.

Important to mention, while this error was only occurring on ROCm platforms, the issue described is also present in CUDA based environments. The difference being that CUDA doesn't seem to generate a tensor with any values less than `1e-6`. When injecting the small value on an Nvidia box, the same device side assertion was triggered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109038
Approved by: https://github.com/jeffdaily, https://github.com/albanD
2023-09-14 15:17:29 +00:00
7f1f5afc91 Run only one pytest parametrization when generating optest (#108936)
Richard, I'm curious to see what you think of this. I'm trying to use optest on the torchvision test suite, and after hacking up pytest support in https://github.com/pytorch/pytorch/pull/108929 I noticed that this was 5x'ing the test time... for no good reason.

* torchvision nms tests before optests: 60 passed, 4 skipped, 1206 deselected in 11.47s
* after optests: 300 passed, 20 skipped, 1206 deselected in 49.85s

It's no good reason because torchvision has parametrized the tests to get a spread of various random generation, but for checking schema or fake tensor, we don't actually need to test for different values.

This PR hacks up the codegen to replace pytest parametrize markers so that, instead of sampling many values, we sample only one value if you mark it with `opcheck_only_one`. There's a carveout for device parametrization, where we always run all those variants.

With this PR:

* reduced optests: 88 passed, 4 skipped, 1206 deselected in 13.89s

Companion torchvision PR which uses this at https://github.com/pytorch/vision/pull/7961

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108936
Approved by: https://github.com/zou3519
2023-09-14 14:54:57 +00:00
7f7f6267e9 [AOTInductor] Skip pre_grad_passes for exported graph. (#109246)
Summary:
We skip pre_grad_passes if graph comes from export (aten IR) since
pre_grad_passes (i.e. remove_identity) would not preserve meta["val"] in aten IR.

Differential Revision: D49246374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109246
Approved by: https://github.com/aakhundov
2023-09-14 13:30:12 +00:00
b6a1d3fb97 add Half support for BatchNorm on CPU (#102070)
Fixes #106543

### Testing

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882

28 cores:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 |  0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki
2023-09-14 12:23:59 +00:00
41e2189843 [quant] Remove reference representation rewrite for adaptive_avg_pool2d (#108924)
Summary:
integer adaptive_avg_pool2d is not well defined due to different possible ways of rounding fp32 value to integer value, and
this op isn't too critical for numerics (since it appears not too often), so we'll skip this for now.

we might need to revert the changes that adds integer impl for adaptive_avg_pool op as well

Test Plan:
python test/test_quantization.py TestQuantizePT2ERepresentation

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108924
Approved by: https://github.com/kimishpatel
2023-09-14 10:18:36 +00:00
a6fadf643f Re-do D48544397: [TGIF Inplace] [xlv2][1/n] Expose a couple APIs from inline_container that will be used for chunk read" (#109183)
Summary:
Original commit changeset: 4a5f31518ad0

Original Phabricator Diff: D48544397

fix easycla

Differential Revision: D49221088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109183
Approved by: https://github.com/wqfish
2023-09-14 08:17:14 +00:00
9cd4548f01 AOTInductor dynamic shape (#109012)
Summary: This PR adds dynamic-shape support for AOTInductor

* On the runtime/interface side, we added two structs, StaticDimInfo
and DynamicDimInfo, to hold values for static and dynamic dimensions,
respectively. Dynamic dimensions are tracked by an unordered map field
defined in AOTInductorModelBase. At inference time, the inference run
method will assign the current real dimensional value to each dynamic
dimension before executing any kernel.

* On the CUDA wrapper codegen side, we generate dynamic symbols
appropriately for shape computations. We simulate kernel launch grids
in the C++ land by re-using the grid functions from the Python world.
The returned grid configs, which may contain symbolic expressions,
are printed out in their C++ forms via the CppPrinter. Note that
when dynamic shapes are involved, we have to compute grid configs
for each kernel at runtime in the same way as we do for launching
the corresponding Triton kernel. Otherwise, we may end up with
memory-access failures or mis-computations caused by invalid indices
for fetching or storing data in device memory.

Differential Revision: D49100472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109012
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/hl475
2023-09-14 08:00:30 +00:00
f4e96df60a [export] Preserve shape dynamism for unused inputs. (#109239)
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109239
Approved by: https://github.com/ydwu4
2023-09-14 07:43:36 +00:00
25bf1a49c0 [FSDP][Wrap] ModuleWrapPolicy callable (#109117)
Makes ModuleWrapPolicy callable, in my case this is needed for
composition with or_policy. We should also make or_policy a public interface
IMO.

Differential Revision: [D49175112](https://our.internmc.facebook.com/intern/diff/D49175112/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109117
Approved by: https://github.com/fegin
ghstack dependencies: #109116
2023-09-14 07:14:18 +00:00
f558e86fa0 [FSDP] continue if param not exist in sharded load (#109116)
If I add a param and then wrap with FSDP + load state dict, when
strict=False don't hard error here.

Differential Revision: [D49170812](https://our.internmc.facebook.com/intern/diff/D49170812/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109116
Approved by: https://github.com/fegin
2023-09-14 07:14:18 +00:00
6898754401 [ONNX] bump ort-nightly==1.16.0.dev20230908001 (#109212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109212
Approved by: https://github.com/malfet
2023-09-14 05:04:18 +00:00
90068ab30a Fix CUDA-12 wheel loading on AmazonLinux (#109244)
Or any other distro that have different purelib and platlib paths Regression was introduced, when small wheel base dependency was migrated from CUDA-11 to CUDA-12

Not sure why, but minor version of the package is no longer shipped with following CUDA-12:
 - nvidia_cuda_nvrtc_cu12-12.1.105
 - nvidia-cuda-cupti-cu12-12.1.105
 - nvidia-cuda-cupti-cu12-12.1.105

But those were present in CUDA-11 release, i.e:
``` shell
bash-5.2# curl -OL 922c5996aa/nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl |grep \.so
    testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7   OK
    testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.11.2   OK
bash-5.2# curl -OL c64c03f49d/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl|grep \.so
    testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.12.1   OK
    testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.12   OK
```

Fixes https://github.com/pytorch/pytorch/issues/109221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109244
Approved by: https://github.com/huydhn
2023-09-14 03:13:32 +00:00
47f79e9a2b Revert "Support SymBool input to torch.compile (#107850)"
This reverts commit 9f6d70b2fdbc4847dbff7c807c5620b4b408bb59.

Reverted https://github.com/pytorch/pytorch/pull/107850 on behalf of https://github.com/huydhn due to Sorry for reverting this, but test_export_with_symbool_inputs is failing in trunk a08e1370ef ([comment](https://github.com/pytorch/pytorch/pull/107850#issuecomment-1718675877))
2023-09-14 02:53:36 +00:00
de76c88d90 Revert "Remove expected test failures for cond (#108709)"
This reverts commit a08e1370ef8cb13cfbf18d9663427a57fa8657f2.

Reverted https://github.com/pytorch/pytorch/pull/108709 on behalf of https://github.com/huydhn due to Sorry for reverting this, but test_export_with_symbool_inputs is failing in trunk a08e1370ef ([comment](https://github.com/pytorch/pytorch/pull/108709#issuecomment-1718669964))
2023-09-14 02:47:28 +00:00
05170b0b73 Reformat line of code header to put co_name after (#109233)
I find this more intuitive as it matches the default Python traceback
formatting.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109233
Approved by: https://github.com/williamwen42
2023-09-14 02:07:16 +00:00
c914ca7577 [quant][be] Add TestPT2ERepresentation test case (#108923)
Summary:
att

Test Plan:
python test/test_quantization.py TestPT2ERepresentation
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108923
Approved by: https://github.com/andrewor14
2023-09-14 02:01:38 +00:00
064ae9ff33 Support register_hook on input tensors (#108903)
The strategy in this PR is pretty straightforward.

There are 2 kinds of hooks:

1) Hooks on objects with sources (inputs, params)
2) Hooks on objects w/o sources (intermediaries, and outputs).

Note: As outputs can be made simple by how dynamo handles residuals, they could actually be handled as if they were inputs, but, for the sake of this PR, we will refer to hooks as either hooks on inputs (sourced), or hooks on intermediaries (not sourced).

The plan:

**For tensors w/ a source:**
We record registered hooks, store them as a global, and associate them with the tensor in residuals. This means that when dynamo goes to create the frame, where we produce bytecode to stitch together our PT2 modified bytecode with the original eager code, we call `register_hook`. This registration of hooks in residuals is sound because (a) it happens right after a Pt2 frame region ends and (b) we know that the tensor is alive in f_locals, f_globals, or a module in the users invoking frame. This means we can soundly know it will be around to invoke `register_hook` on. As long as we guard on the identity of the lifted function, this is sound to do.

**For tensors w/o a source:**
Graph break - we will support this in a subsequent PR

**Handles:**

An interesting new component here is the creation of a `STORE_FAST `->`LOAD_FAST` associated with the handle, the return result of `register_hook`. If the user code stored the result of `register_hook` in a handle, we need to honor that. We do so by interceding into `STORE_FAST`, and recording the name of the local variable as directed by user code. We then honor that same name in the reconstructed bytecode. If the user did not store a hook, we merely pop the produced value to preserve the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108903
Approved by: https://github.com/ezyang
ghstack dependencies: #108846, #109092
2023-09-14 01:52:21 +00:00
50a084070f [inductor][easy] Enable mypy checking for all inductor files that already pass (#109238)
Summary: Let's just enable if mypy checking already passes. I checked all entries in the exclude list and enabled any that individually pass. Also needed one trivial change to a file already enabled.

Test Plan: `lintrunner torch/_inductor/*.py torch/_inductor/*/*.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109238
Approved by: https://github.com/eellison
2023-09-14 01:45:25 +00:00
acad84ba6c Disable cutlass tests in fbcode (#109241)
Summary: ATT, fbcode requires different cutlass path setup.

Test Plan: CI

Differential Revision: D49242138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109241
Approved by: https://github.com/DanilBaibak, https://github.com/chenyang78
2023-09-14 01:41:10 +00:00
62732bdcdb [ez][inductor][fx passes] quick fix for invalid nodes (#109234)
Summary: As title.Need to check whether node is valid before fusion

Test Plan: To add test

Differential Revision: D49241525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109234
Approved by: https://github.com/yanboliang
2023-09-14 01:40:49 +00:00
5edbee9404 [export] Normalize nn_module_stack paths. (#109231)
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109231
Approved by: https://github.com/angelayi
2023-09-14 01:34:31 +00:00
109ab6a0df Support str() on user defined functions (#108973)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108973
Approved by: https://github.com/anijain2305
2023-09-14 01:32:02 +00:00
a08e1370ef Remove expected test failures for cond (#108709)
Remove the expected failure in def test_control_flow_tracing(self) by chaning the error message to `Expected pred to be bool or tensor, but got Proxy\(eq\)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108709
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #107662, #107850
2023-09-14 01:16:29 +00:00
9f6d70b2fd Support SymBool input to torch.compile (#107850)
We could have SymBool inputs for torch.compile, e.g. in the following situation:
```
def f(x:torch.Tensor):
  pred = x.size(0) == 3
  torch.compile(f)(pred, x)

make_fx(f, tracing_mode="symbolic")(x)
```

The idea of this PR (credit to @ezyang) is to support SymBool by re-using the infra we've already had for SymInt so that we don't need to replicate a lot of stuff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107850
Approved by: https://github.com/ezyang
ghstack dependencies: #107662
2023-09-14 01:16:29 +00:00
025d1a18ab [export] Separate out exported_program.py (#109147)
Test Plan: CI

Differential Revision: D49205011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109147
Approved by: https://github.com/zhxchen17
2023-09-14 01:14:46 +00:00
4a09ed5459 [inductor] Parallelize Max Autotune step 2: Use multiple GPUs (#109127)
Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109127
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #109126
2023-09-14 00:37:39 +00:00
ce4283933f [inductor] Parallelize Max Autotune step 1: refactor autotune_process (#109126)
Summary: Step 1 in revamping subprocess autotune to support multiple GPUs. This diff just does some refactoring to autotune_process.py in order to prepare for the next diff:
* Move all logic for managing the sub-process (like detecting sub-process crashes) into the TuningProcess class.
* Use log.debug statements instead of print statements

Test Plan: python test/inductor/test_max_autotune.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109126
Approved by: https://github.com/shunting314, https://github.com/eellison
2023-09-14 00:37:39 +00:00
dbddf1816a Remove include_0d from sample_inputs_gather (#109125)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109125
Approved by: https://github.com/lezcano
ghstack dependencies: #108879, #108880, #109120
2023-09-13 23:13:09 +00:00
61f0578787 Update take_along_dim docs to include dim=None case (#109120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109120
Approved by: https://github.com/lezcano
ghstack dependencies: #108879, #108880
2023-09-13 23:13:09 +00:00
d046376c4f Dispatch numpy.take_along_axis to torch.take_along_dim (#108880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108880
Approved by: https://github.com/lezcano
ghstack dependencies: #108879
2023-09-13 23:13:09 +00:00
49e3d76684 Add SymInt support to torch.take_along_dim (#108879)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108879
Approved by: https://github.com/Skylion007, https://github.com/lezcano, https://github.com/Chillee
2023-09-13 23:13:09 +00:00
aca3bd44d1 Fix failing inductor test (#109220)
Summary: This broke as a result of the flashv2 PR. The tests couldnt' be listed expect for a100 machine which is weird..

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:fused_attention

Differential Revision: D49239716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109220
Approved by: https://github.com/eellison
2023-09-13 23:12:32 +00:00
33c94b8b16 Better error handling for cond (#108817)
## Exception in cond:
For code below:
```python
import torch
import functorch.experimental.control_flow as control_flow
def true_fn(x):
    return x.sin()

def false_fn(x):
    return x, x

def f(x, y):
    return control_flow.cond(y, true_fn, false_fn, [x])

f(torch.ones(3, 4), torch.tensor(False))
```
The original exception stack trace is:
```python
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/test_exc.py", line 33, in <module>
    f(torch.ones(3, 4), torch.tensor(False))
  File "/home/yidi/local/pytorch/test_exc.py", line 31, in f
    return control_flow.cond(y, true_fn, false_fn, [x])
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 154, in cond
    return torch.compile(cond_op, backend="eager", fullgraph=True)(
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 365, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 513, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 140, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 380, in _convert_frame_assert
    return _compile(
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 560, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 197, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 482, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 449, in transform
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2083, in run
    super().run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 733, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 696, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 397, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1164, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars.items)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 570, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 418, in call_function
    (false_r, false_graph, false_lifted_freevars) = speculate_branch(False)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 410, in speculate_branch
    raise UncapturedHigherOrderOpError(
torch._dynamo.exc.UncapturedHigherOrderOpError: Expected branch to return a single tensor

from user code:
   File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```
After this PR we get:
```python
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 50, in graph_break_as_hard_error
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 429, in call_function
    (false_r, false_graph, false_lifted_freevars) = speculate_branch(False)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 421, in speculate_branch
    unimplemented(
  File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 187, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: Expected branch to return a single tensor

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/yidi/local/pytorch/test_exc.py", line 33, in <module>
    f(torch.ones(3, 4), torch.tensor(False))
  File "/home/yidi/local/pytorch/test_exc.py", line 31, in f
    return control_flow.cond(y, true_fn, false_fn, [x])
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 154, in cond
    return torch.compile(cond_op, backend="eager", fullgraph=True)(
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 338, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 500, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 140, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 382, in _convert_frame_assert
    return _compile(
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 562, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 484, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 451, in transform
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2088, in run
    super().run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 728, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 691, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1159, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars.items)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 565, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 53, in graph_break_as_hard_error
    raise UncapturedHigherOrderOpError(reason + msg) from e
torch._dynamo.exc.UncapturedHigherOrderOpError: Cond doesn't work unless it is captured completely with torch.compile. Scroll up to find out what causes the graph break.

from user code:
   File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```
## Exception during speculating branches
The example code below has a inplace-buffer mutation error,
```python
import torch
import functorch.experimental.control_flow as control_flow

class Foo(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.register_buffer("buffer", torch.ones(6, 4))

    def forward(self, x):
        def true_fn(x):
            self.buffer += 1
            return self.buffer.sum() + x.sum()

        def false_fn(x):
            return (x - 1).sum()

        return control_flow.cond(x.shape[0] > 4, true_fn, false_fn, [x])

mod_for_compile = torch.compile(Foo(), backend="eager", dynamic=True)
mod_for_compile(torch.ones(3, 4))
```

Before this PR the exception looks like:
```python
[2023-09-08 15:20:03,332] [0/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting cond, we were unable to trace function `true_fn` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[2023-09-08 15:20:03,332] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] Can't inplace modify module params/buffers inside HigherOrderOp
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 163, in speculate_subgraph
    output = f.call_function(tx, args, sub_kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 606, in inline_user_function_return
    result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2200, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2316, in inline_call_
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 733, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 696, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1219, in STORE_ATTR
    .call_function(self, [obj, ConstantVariable(inst.argval), val], {})
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builtin.py", line 618, in call_function
    result = handler(tx, *args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builtin.py", line 1169, in call_setattr
    raise AttributeMutationError(
torch._dynamo.exc.AttributeMutationError: Can't inplace modify module params/buffers inside HigherOrderOp

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 394, in speculate_branch
    ret_val, ret_graph, ret_lifted_freevars = speculate_subgraph(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 222, in speculate_subgraph
    raise Unsupported(
torch._dynamo.exc.Unsupported: speculate_subgraph: while introspecting cond, we were unable to trace function `true_fn` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. Scroll up for the stack trace of the initial exception. The reason was: Can't inplace modify module params/buffers inside HigherOrderOp

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/yidi/local/pytorch/test_exc.py", line 20, in <module>
    mod_for_compile(torch.ones(3, 4))
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 365, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 513, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 632, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 140, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 380, in _convert_frame_assert
    return _compile(
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 560, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 197, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 482, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 449, in transform
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2083, in run
    super().run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 733, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 696, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 397, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1124, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 570, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 261, in call_function
    return super().call_function(tx, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 606, in inline_user_function_return
    result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2200, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2316, in inline_call_
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 733, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 696, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 397, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1124, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 570, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 415, in call_function
    (true_r, true_graph, true_lifted_freevars) = speculate_branch(True)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 405, in speculate_branch
    raise UncapturedHigherOrderOpError(
torch._dynamo.exc.UncapturedHigherOrderOpError: Cond doesn't work unless it is captured completely with torch.compile

from user code:
   File "/home/yidi/local/pytorch/test_exc.py", line 16, in forward
    return control_flow.cond(x.shape[0] > 4, true_fn, false_fn, [x])
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 127, in cond
    return cond_op(pred, true_fn, false_fn, operands)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```

after this PR, the only difference is the error message of UncapturedHigherOrderOpError changes from `Cond doesn't work unless it is captured completely with torch.compile` to `Cond doesn't work unless it is captured completely with torch.compile. Scroll up to find out what causes the graph break`.

```python
[2023-09-08 15:17:02,052] [0/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting cond, we were unable to trace function `true_fn` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[2023-09-08 15:17:02,052] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] Can't inplace modify module params/buffers inside HigherOrderOp
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 177, in speculate_subgraph
    output = f.call_function(tx, args, sub_kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 601, in inline_user_function_return
    result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2300, in inline_call_
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 728, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 691, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1214, in STORE_ATTR
    .call_function(self, [obj, ConstantVariable(inst.argval), val], {})
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builtin.py", line 618, in call_function
    result = handler(tx, *args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/builtin.py", line 1169, in call_setattr
    raise AttributeMutationError(
torch._dynamo.exc.AttributeMutationError: Can't inplace modify module params/buffers inside HigherOrderOp

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 50, in graph_break_as_hard_error
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 426, in call_function
    (true_r, true_graph, true_lifted_freevars) = speculate_branch(True)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 410, in speculate_branch
    ret_val, ret_graph, ret_lifted_freevars = speculate_subgraph(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 236, in speculate_subgraph
    raise Unsupported(
torch._dynamo.exc.Unsupported: speculate_subgraph: while introspecting cond, we were unable to trace function `true_fn` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. Scroll up for the stack trace of the initial exception. The reason was: Can't inplace modify module params/buffers inside HigherOrderOp

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/yidi/local/pytorch/test_exc.py", line 20, in <module>
    mod_for_compile(torch.ones(3, 4))
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 338, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 500, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 634, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 140, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 382, in _convert_frame_assert
    return _compile(
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 562, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 484, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 451, in transform
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2088, in run
    super().run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 728, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 691, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1119, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 565, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 261, in call_function
    return super().call_function(tx, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 601, in inline_user_function_return
    result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2300, in inline_call_
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 728, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 691, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1119, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 565, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 53, in graph_break_as_hard_error
    raise UncapturedHigherOrderOpError(reason + msg) from e
torch._dynamo.exc.UncapturedHigherOrderOpError: Cond doesn't work unless it is captured completely with torch.compile. Scroll up to find out what causes the graph break.

from user code:
   File "/home/yidi/local/pytorch/test_exc.py", line 16, in forward
    return control_flow.cond(x.shape[0] > 4, true_fn, false_fn, [x])
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 127, in cond
    return cond_op(pred, true_fn, false_fn, operands)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108817
Approved by: https://github.com/zou3519
2023-09-13 23:03:59 +00:00
04a765f95d Revert "add Half support for BatchNorm on CPU (#102070)"
This reverts commit 6065e7a97cfad4c2ae2b8722969648a53265fa13.

Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to sorry it looks like this is causing an unexpected success for `test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_nn_functional_batch_norm_cpu_float16` 6065e7a97c https://github.com/pytorch/pytorch/actions/runs/6178069462/job/16770849782 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1718402208))
2023-09-13 22:38:42 +00:00
c44f816960 Disable tests mentioned in 109213 (#109232)
#109213
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109232
Approved by: https://github.com/huydhn
2023-09-13 22:29:00 +00:00
2d26364fb3 [caffe2][cuda] Fix instrumentation of malloc/free SDTs for CUDACachingAllocator (#108907)
Summary:
There's currently a bug in `CUDACachingAllocator` which makes it impossible to determine whether a `malloc`ed sample has been deallocated (introduced in D48229150).

It happens because we currently instrument the `malloc` SDT **before** a block of memory has been allocated by either `cudaMalloc` or local cashing allocator `malloc` call. Since this is a static tracepoint, it receives arg values at the point of instrumentation. Currently, it receives the memory pointer, `void* p`, which is NULL.

Changes in this diff:
1) Move this SDT to right before the `allocate` function returns, so that memory has been allocated already and `p` pointer points to a valid, non-NULL address.
2) Enable tracing of `cudaMalloc` calls, in addition to `NativeCachingAllocator::malloc`
3) renames a poorly-named local var: `r` --> `devPtr` (pointer to the allocated memory block)

Test Plan:
Tested with a local PyTorch script that leaks memory. Verified the following:
* prior to this fix (prod), malloc samples are **not** marked as "freed"
* with the fix (branch), samples **are** marked as "freed"
* results are comparable with the current uprobe implementation to sample PyTorch malloc events in `gpusnoop`

Reviewed By: chaekit

Differential Revision: D48873734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108907
Approved by: https://github.com/chaekit
2023-09-13 22:15:41 +00:00
faa5985dfe Fix issue when input/output buffer of functional collective (e.g. allreduce / allgather) is incorrectly reused later (#108811)
For this program:
```python
def func(a, *, tag, ranks, group_size):
    ar = torch.ops.c10d_functional.all_reduce(a, "sum", tag, ranks, group_size)
    ar = torch.ops.c10d_functional.wait_tensor(ar)
    c = torch.relu(a)
    # c = a
    d = torch.matmul(c, c)
    e = d + ar
    return (e,)
```
the generated code is:
```python
def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (4, 4), (4, 1))
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1) # no-op to ensure context
        buf0 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
        buf0.copy_(arg0_1) #no reuse
        buf1_pg = c10d._find_or_create_pg_by_ranks_and_tag('', [0, 1], 2)
        buf1 = buf0
        buf1_work = dist.all_reduce(buf1, async_op=True, group=buf1_pg, op=fun_col_impl._str_to_reduce_op('sum'))
        fun_col_impl._register_tensor_work(buf1, buf1_work)
        del buf1
        buf0 = _wait_tensor(buf0)
        buf2 = buf0
        buf3 = buf0; del buf0  # reuse
        # Source Nodes: [relu], Original ATen: [aten.relu]
        stream1 = get_cuda_stream(1)
        triton_poi_fused_relu_0.run(arg0_1, buf3, 16, grid=grid(16), stream=stream1)
        del arg0_1
        buf4 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
        # Source Nodes: [add, relu], Original ATen: [aten.add, aten.relu]
        extern_kernels.addmm(buf2, buf3, buf3, alpha=1, beta=1, out=buf4)
        return (buf4, )
```
We can notice that allreduce input (`buf1` which is alias of `buf0`) is incorrectly reused as input (`buf3`) to the triton `triton_poi_fused_relu_0` inplace kernel, diverging from eager mode logic.

In general, we should make it so that Inductor doesn't try to reuse the input buffer to an inplace functional collective.

We have a similar problem for output buffer of out-of-place functional collectives, see https://github.com/pytorch/pytorch/issues/108780#issuecomment-1714921994.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108811
Approved by: https://github.com/Chillee, https://github.com/wconstab
2023-09-13 21:39:37 +00:00
54dd65f93a [FSDP] Only check exec order if DETAIL (#109049)
The execution order check seems to have been causing more problems than it prevents. Motivated by an internal issue, we can move this check to only `DISTRIBUTED_DEBUG_LEVEL=DETAIL`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109049
Approved by: https://github.com/fegin
2023-09-13 20:40:38 +00:00
916183a012 [MPS] Fix crash if nonzero is called concurrently (#108996)
Surrounds `stream->synchronize()` call with `dispatch_sync(stream->queue(), ^{});`,  which is a noop for signle threaded program, but serializes calls to the synchronize across the threads using the same stream.

Prevent `[IOGPUMetalCommandBuffer validate]:215: failed assertion 'commit an already committed command buffer'` non-recoverable exception, which is triggered every time one is using PyCharm to inspect tensors on MPS device

Fixes https://github.com/pytorch/pytorch/issues/100285
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1662ce2</samp>

> _Sing, O Muse, of the swift and skillful coders_
> _Who fixed the dreadful deadlock of the stream_
> _That crashed the mighty tensors of the MPS_
> _When they sought out the nonzero elements._

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108996
Approved by: https://github.com/kulinseth
2023-09-13 19:28:47 +00:00
35aeb6aa85 Do not use a specific LOC in link (#108957)
The order of LOC can change and so it should not be used in creating a link. Also, a specific LOC is not needed here given the function name as used in general in overall documentaton.
Previously, a fix was provided by updating the line number for the mentioned issue in this PR but the LOC was eventually changed resulting a broken link.

Fixes #102183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108957
Approved by: https://github.com/ezyang
2023-09-13 19:21:45 +00:00
32f50b7021 Improve type annotations for jit.script (#108782)
Fixes #108781

- [x] added `@overload` for `jit.script`
- [x] added typing unittest in `test/typing/pass/jit.py`
    - NOTE: unittest is not automatically checked by mypy when executing lintrunner currently. (how to fix?)
- [x] used `stubgen` to create [torch/jit/_script.pyi](https://github.com/pytorch/pytorch/pull/108782/files#diff-738e66abee2523a952b3ddbaecf95e187cce559473cf8c1b3da7c247ee5d1132) and added overloads there. (adding them inside `_script.py` itself interfered with JIT engine)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108782
Approved by: https://github.com/ezyang
2023-09-13 19:20:25 +00:00
8851603a9c Back out "[Inductor] Extend Pattern Matcher to Match Equivalent Function Invocation (#107832)" (#109174)
Summary:
Original commit changeset: ad8e1321811a

Original Phabricator Diff: D49151331

Test Plan: Sandcastle

Differential Revision: D49218851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109174
Approved by: https://github.com/hl475, https://github.com/yanboliang
2023-09-13 18:17:59 +00:00
c657d9ecc5 [PyTorch] Add Expanded call stack to nodes (#108426)
Summary:
To get a Node's call stack we currently loop on the InlinedCallStack graph and follow the "callee" chain. Since the node's inlined stack does not change we can optimize this but expanding the node's inlined stack once and reusing it. This is particularly useful when reading the node's stack from another process (e.g. BPF) as it simplified the memory traversal process.

The new data structure (NodeSourceInfo) only holds pointers to the function name and file name variables, and assumes these objects will be alive throughout the lifetime of the process.

Each Node has an extended attribute that has an index to a vector of stack frames `expanded_node_stacks_`

`node_stack_attr_symbol_` is only needed to make accessing the stack vector index attribute easier from BPF.

Test Plan:
- Performance Impact: The cost of expanding the call stack is between 500 - 1000 ns and happens only per instruction node at initialization time.
- Verified using BPF Program in subsequent diffs

Reviewed By: zdevito

Differential Revision: D46578700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108426
Approved by: https://github.com/zdevito
2023-09-13 17:48:47 +00:00
00908475e6 Use global variables to register the return_types namedtuples (#108832)
Fixes #69221. Builds on top of #107000, fixing the buck build issue linked [here](https://github.com/pytorch/pytorch/pull/107000#issuecomment-1708857375).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108832
Approved by: https://github.com/zou3519
2023-09-13 17:42:46 +00:00
6065e7a97c add Half support for BatchNorm on CPU (#102070)
Fixes #106543

### Testing

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882

28 cores:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 |  0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki
2023-09-13 17:30:16 +00:00
f6d8ecf9b3 Use the correct channel token when uploading nightly triton conda (#109073)
This fixes 2 bugs on triton build workflow:

* Use the wrong conda credential when `UPLOAD_CHANNEL` is not set https://github.com/pytorch/pytorch/actions/runs/6129675580/job/16691419329#step:7:18
* Upload wheel and conda packages when pushing to main in addition to nightly.  This is needed because the binary wheel build on trunk also looks for torchtriton package after the triton pin is updated.

### Testing

https://github.com/pytorch/pytorch/actions/runs/6152447684/job/16694843862?pr=109073#step:7:38 looks correct now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109073
Approved by: https://github.com/atalman
2023-09-13 17:12:33 +00:00
c9fdfafb00 Allow marking multiple unstable configs of the same job name (#109185)
This is a bug that has stayed for a surprisingly long period of time (my fault).  When there are multiple unstable configurations (`inductor`, `inductor_huggingface`, `inductor_huggingface_dynamic`) of the same job (`inductor / cuda12.1-py3.10-gcc9-sm86`), only the first one was marked as unstable.  The for loop returned too early and missed the other twos even though they were also marked as unstable, for example https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json

### Testing

* Add an unit test
* CI run https://github.com/pytorch/pytorch/actions/runs/6169798353 shows that the configs below are all marked as unstable:
  * https://github.com/pytorch/pytorch/issues/107079
  * https://github.com/pytorch/pytorch/issues/109153
  * https://github.com/pytorch/pytorch/issues/109154
* Manually run the script to verify the test matrix output:
```
python .github/scripts/filter_test_configs.py \
    --workflow "inductor" \
    --job-name "cuda12.1-py3.10-gcc9-sm86 / build," \
    --test-matrix "{ include: [
    { config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_huggingface_dynamic", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm_dynamic", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm_dynamic", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench_dynamic", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
  ]}
  " \
    --pr-number "" \
    --tag "" \
    --event-name "push" \
    --schedule "" \
    --branch ""
::set-output name=keep-going::False
::set-output name=is-unstable::False
::set-output name=reenabled-issues::
::set-output name=test-matrix::{"include": [{"config": "inductor", "shard": 1, "num_shards": 1, "runner": "linux.g5.4xlarge.nvidia.gpu", "unstable": "unstable"}, {"config": "inductor_huggingface", "shard": 1, "num_shards": 1, "runner": "linux.g5.4xlarge.nvidia.gpu", "unstable": "unstable"}, {"config": "inductor_timm", "shard": 1, "num_shards": 2, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "inductor_timm", "shard": 2, "num_shards": 2, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "inductor_torchbench", "shard": 1, "num_shards": 1, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "inductor_huggingface_dynamic", "shard": 1, "num_shards": 1, "runner": "linux.g5.4xlarge.nvidia.gpu", "unstable": "unstable"}, {"config": "inductor_timm_dynamic", "shard": 1, "num_shards": 2, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "inductor_timm_dynamic", "shard": 2, "num_shards": 2, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "inductor_torchbench_dynamic", "shard": 1, "num_shards": 1, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "inductor_distributed", "shard": 1, "num_shards": 1, "runner": "linux.g5.12xlarge.nvidia.gpu"}]}
::set-output name=is-test-matrix-empty::False
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109185
Approved by: https://github.com/clee2000
2023-09-13 17:06:37 +00:00
fe198f3141 inductor/test_max_autotune serial in CI (#109209)
Fixes #ISSUE_NUMBER
Trying to figure out why the this keeps timing out, wondering if its due to parallelization weirdness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109209
Approved by: https://github.com/huydhn
2023-09-13 17:04:43 +00:00
d05a6e5ade Add missing DeviceMesh import (#109187)
The test is broken after https://github.com/pytorch/pytorch/pull/107533#issuecomment-1709529759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109187
Approved by: https://github.com/clee2000
2023-09-13 16:50:35 +00:00
f2639a2c37 Back out "Dynamo support for autograd.Function w/ once_differentiable (#108686)" (#109199)
Summary:
Original commit changeset: e11cddf1fecc

Original Phabricator Diff: D49064185

Test Plan:
Comparing PT1 and PT2 performance on the IG Feed Model with this diff backed out: N4274204

Comparing the PT1 and PT2 performance on IG Feed with this diff committed: N4271093

Reviewed By: zou3519

Differential Revision: D49230047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109199
Approved by: https://github.com/zou3519, https://github.com/xw285cornell
2023-09-13 15:43:20 +00:00
264f1e7b4c [inductor] Enable Mypy Checking for torch/_inductor/codecache.py (#108789)
Summary: Add type annotations to torch/_inductor/codecache.py and enable mypy checking

Test Plan:
`lintrunner torch/_inductor/*.py`
`python test/inductor/test_max_autotune.py`
`python test/inductor/test_aot_inductor.py`
`python test/inductor/test_torchinductor.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108789
Approved by: https://github.com/Skylion007, https://github.com/eellison
2023-09-13 14:05:35 +00:00
ad90ab31f2 Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-09-13 13:59:05 +00:00
55f956f1d2 optests improvements based on torchvision usage on nms (#108929)
- Update cross-ref FakeMode test to use ShapeEnv.  Dynamic ops can now
  return an unbacked SymInt.  We always accept this as equal to whatever
  the real value was.
- Relax test so it works on all classes, not just unittest.TestCase
- Properly wrap the original method, so things like
  pytree.mark.parametrize are carried over
- Support dynamic shapes by default for make_fx `tracing_mode="fake"` without symbolifying everything else

Fixes https://github.com/pytorch/pytorch/issues/108927

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108929
Approved by: https://github.com/zou3519
2023-09-13 13:26:15 +00:00
bfa8429c6a [optests] Changed failures_dict format to json; automatic update of failures_dict (#109110)
We changed the failures_dict format from .py to json and added a way to
automatically update the failures dict (the user can set
PYTORCH_OPCHECK_ACCEPT=1 to do so), assuming the tests don't crash in the
process.

Some details:
- We introduced a FailuresDict class that handles save/load and from which one
can query a test status ("xfail", "skip", etc).
- PYTORCH_OPCHECK_ACCEPT=1 does not override everything. In particular: it
doesn't try to update the failures dict for a test marked as "skip", but it
will update it for tests marked as "xfail" or "success".
- PYTORCH_OPCHECK_ACCEPT=1 also does not override the "comment" field, unless
it is flipping an "xfail" into "success".
- I'll update the gdoc linked in the comments with how to actually use
PYTORCH_OPCHECK_ACCEPT=1 internally (it's not trivial).

Note that this isn't multithreading-safe, the current recommendation is to run
the tests sequentially if the user wants to use PYTORCH_OPCHECK_ACCEPT=1.

Differential Revision: D49167181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109110
Approved by: https://github.com/ezyang
2023-09-13 13:24:15 +00:00
db48bc80d9 Check index size during decomp of index_add (#108826)
This partially fixes the `test_index_add_correctness` test (#108181)
when run under inductor: it causes an exception to be raised [here][1]
as expected.

The test as a whole still cannot be made to pass under inductor because
the [last assert][2] still fails, likely due to #108798.

[1]: dec2b267d4/test/test_torch.py (L6049)
[2]: dec2b267d4/test/test_torch.py (L6051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108826
Approved by: https://github.com/eellison
2023-09-13 13:06:26 +00:00
d2d36aad6f Enable typechecking for _inductor/virtualized.py (#108916)
Also add a few more type annotations to utils.py (some of its functions
are called from virtualized.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108916
Approved by: https://github.com/eellison
2023-09-13 13:04:51 +00:00
c5e7588613 Revert "[dynamo] preserve some FX node metadata of GraphModules (#107067)"
This reverts commit 1d42148fee45e5bdb6c96a1ff45b8d4d326138ee.

Reverted https://github.com/pytorch/pytorch/pull/107067 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/107067#issuecomment-1717321061))
2023-09-13 09:59:33 +00:00
aee5dec3aa torch/csrc/profiler/README.md - stubs, RecordFunction, Autograd interaction (#108470)
Technical details about the profiler - stubs for the stuff I haven't had time to fill out yet, plus details about RecordFunction and the profiler's interaction with autograd.

reviewers - see 06c41eea9e/torch/csrc/profiler/README.md for rendered markdown
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108470
Approved by: https://github.com/aaronenyeshi
2023-09-13 07:46:01 +00:00
de0b18fad9 Use user directed names for variables where possible (#109092)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109092
Approved by: https://github.com/ezyang
ghstack dependencies: #108846
2023-09-13 07:44:04 +00:00
015be4cedb Forward fix lint (#109177)
After https://github.com/pytorch/pytorch/pull/109075
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109177
Approved by: https://github.com/angelayi
2023-09-13 06:10:34 +00:00
3d8d59e68b Update inductor ci_expected_accuracy (#109148)
Changes due to updating the HF pin: [107400](https://github.com/pytorch/pytorch/pull/107400)
Somehow during the previous PR it didn't need these changes...probably a CI bug

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109148
Approved by: https://github.com/clee2000, https://github.com/desertfire
2023-09-13 05:12:33 +00:00
3ac2396e00 Fix torch._numpy.random (#108944)
Fix several issues with `torch._numpy.random` functions on eager

1. actually return scalars when `size is None`
2. fix dispatch with USE_NUMPY_STREAM
3. make tnp.random functions composable: make numpy functions receive numpy arguments, not `tnp.ndarray`s
4. fix random.shuffle for e.g. lists

The main need for this gymnastics is due to `np.random` functions returning an ndarray or python scalar depending on the `size` argument. We decided a while ago to replicate this behavior in `tnp.random` and not elsewhere where we always return 0D arrays instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108944
Approved by: https://github.com/lezcano
2023-09-13 05:08:19 +00:00
41e5d410cf Symintify repeat_interleave (#109133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109133
Approved by: https://github.com/ezyang, https://github.com/voznesenskym, https://github.com/bdhirsh
2023-09-13 04:55:56 +00:00
a09539f454 Add torch.export.register_dataclass API (#109152)
`register_dataclass` allows dataclass to be used as valid input/output types of torch.export.export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109152
Approved by: https://github.com/ydwu4
2023-09-13 04:17:12 +00:00
375d2ca6c9 [dtensor][4/n] don't use make_fx for strategy propagation (#108262)
We were using make_fx for strategy based propagation so that we can get
a graph and the shape related metadata, this becomes too much overkill
for the sharding propagation purpose. This change refactors the strategy
propagation to remove the graph based propagation, instead just use the
op to index to the strategy functions.

We also just use a fake shape prop instead of relying on fx tracing for
the shape/stride propagation.

for a future possible decomposed propagation, we will exercise different
codepath to enable that

NOTE that this would also greatly reduce the latency of:
1. first time dtensor operations when populating the cache, the first
iter would become faster again!
2. greatly reduce the test_dtensor_ops.py time again, right now the
whole test finished within 2-3 mins again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108262
Approved by: https://github.com/fduwjj
ghstack dependencies: #107306, #108261
2023-09-13 04:08:02 +00:00
09f3e08bcc [dtensor][3/n] use dedicated TensorMeta instead of the fx one (#108261)
This PR switches the usage of fx's shape prop TensorMetadata to
dtensor's own dedicated defined TensorMeta, this is because DTensor
only cares three fields: shape/stride/dtype, all other fields are not
necessary and can be inferred from local_tensor directly. This would
help significantly simplify how we deal with the tensor metadata by not
caring other fields.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108261
Approved by: https://github.com/fduwjj
ghstack dependencies: #107306
2023-09-13 04:08:02 +00:00
fc1dcfb9ab [dtensor][2/n] use op overload instead of function schema (#107306)
function schema doesn't provide us anything as we can also get the schema from `op._schema`, include the op directly in op_schema makes easier for sharding prop to do fake execution, and in principle it should also make the hash comparison faster as we don't need to hash the function schema, instead we just hash the `id(op)` which is constant

This PR is just a refactor to include op to OpSchema instead of func schema, no other logic changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107306
Approved by: https://github.com/fduwjj
2023-09-13 04:08:02 +00:00
48e6ffbe30 [DCP][Test] Fix device assignment in test/distributed/checkpoint/test_file_system_checkpoint_cpu.py (#109141)
Device should always be "cpu" for cpu tensor types.

This will fix fb buck test failure when running in internal.
```
buck2 test '@fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:file_system_checkpoint_cpu -- --exact 'caffe2/test/distributed/checkpoint:file_system_checkpoint_cpu - test_switch_between_sharded_tensor_to_tensor_thread_count_1 (test_file_system_checkpoint_cpu.TestDistributedReshardOnLoad)'
```

This will unblock [D48667323](https://www.internalfb.com/diff/D48667323).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109141
Approved by: https://github.com/fegin
2023-09-13 03:04:14 +00:00
91e154fcd7 [ONNX] Support None in fx.args as torchlib inputs (#108708)
Prior to this PR, if None is returned from intermediate nodes, it will crashes the export because None is not expected to be passed into `_fill_tensor_shape_type`, and raise beartype roar. The function fills in shape and type to TorchScriptTensor according to its info from FX graph.

This is discovered after https://github.com/microsoft/onnxscript/pull/1043 is supported. The op specifically generates None in one of its inputs, but the only output from it being consumed is the first one (not None).

Reference test from a TorchBench model:
```python

    def test_nanogpt(self):
        import sys

        sys.path.append("/home/titaiwang")

        from nanoGPT.model import GPT, GPTConfig

        # Load the model
        kwargs = {
            "block_size": 256,
            "vocab_size": 8096,  # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
            "n_layer": 2,
            "n_head": 2,
            "n_embd": 128,
            "dropout": 0.0,
            "bias": False,  # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
        }
        config = GPTConfig(**kwargs)
        with torch.backends.cuda.sdp_kernel(
            enable_flash=True, enable_mem_efficient=True
        ):
            model = GPT(config)
        print("Done loading model")
        inputs = torch.arange(128).view(2, 64)
        targets = torch.arange(128).view(2, 64)

        self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime(
            model,
            (inputs,),
            input_kwargs={
                "targets": targets,
            },
            verbose=True,
        )
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108708
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-09-13 02:47:16 +00:00
a2ff345416 [HigherOrderOp] Support SymInt as input to body function (#108967)
Fixes #108283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108967
Approved by: https://github.com/zou3519
2023-09-13 02:14:16 +00:00
4667a5c948 Update SingletonSymNode to allow more comparisons (#108315)
In this PR:
- {in,}equality between singleton and plain ints returns false instead of erroring
- Morally define the semantic of j0 > c to be as if j0 represented an array [s_0, s_1, ... s_n] and s_k > c for all k
- Just like for equality, we don't actually want to do the comparison one by one, instead j0 is constrained to some range [min, max]. By default this range is [2, int64_t::max] so that it acts like a size and passes 0/1 specialization checks.
- In the future, we can define some API to allow users to constrain the range of their singletons

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108315
Approved by: https://github.com/ezyang
2023-09-13 01:58:02 +00:00
a46df6ebce [pytorch-vulkan] add aten::randn_like & aten::normal_ (#109075)
Summary:
Implemented `aten::normal_` shader and used it to create `aten::randn_like`.

Ops defintions:
https://pytorch.org/docs/stable/generated/torch.randn_like.html
https://pytorch.org/docs/stable/generated/torch.Tensor.normal_.html

Test Plan:
```
[ttingchulin@53491.od /data/sandcastle/boxes/fbsource (randn)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*<test>*" eg.  -- --gtest_filter="*randn_like*"

[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.randn_like
[       OK ] VulkanAPITest.randn_like (230 ms)
[ RUN      ] VulkanAPITest.randn_like_large
[       OK ] VulkanAPITest.randn_like_large (570 ms)
[----------] 2 tests from VulkanAPITest (801 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (801 ms total)
[  PASSED  ] 2 tests.

[ttingchulin@53491.od /data/sandcastle/boxes/fbsource (randn)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*<test>*" eg.  -- --gtest_filter="*normal_*"
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.normal_
[       OK ] VulkanAPITest.normal_ (222 ms)
[ RUN      ] VulkanAPITest.normal_large
[       OK ] VulkanAPITest.normal_large (136 ms)
[ RUN      ] VulkanAPITest.normal_error
[       OK ] VulkanAPITest.normal_error (37 ms)
[----------] 3 tests from VulkanAPITest (396 ms total)

[----------] Global test environment tear-down
[==========] 3 tests f.
```

Reviewed By: yipjustin

Differential Revision: D48814024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109075
Approved by: https://github.com/yipjustin
2023-09-13 01:07:34 +00:00
e5f300f085 Make mutation test work with quantized tensors (#108935)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108935
Approved by: https://github.com/zou3519
2023-09-13 00:54:01 +00:00
687f027896 [submodule] Fix eltwise share buffer issue in ideep (#108038)
Fix [#107876 ](https://github.com/pytorch/pytorch/issues/107876).

This PR is to fix [#107876 ](https://github.com/pytorch/pytorch/issues/107876), which root cause is that, eltwise is lack of the logic of dealing with src and diff_src with different shapes. After initializing a new diff_src and reordering back to diff_src's buffer, like inner_product and matmul, the issue https://github.com/pytorch/pytorch/issues/107876 can be addressed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108038
Approved by: https://github.com/jgong5, https://github.com/mingfeima
2023-09-13 00:53:57 +00:00
e027de2c86 Add torch.distributed get_rank and get_world_size to constant_fold_functions (#109029)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109029
Approved by: https://github.com/bdhirsh
2023-09-13 00:52:43 +00:00
12e8530b35 Record and replay for ShapeEnv. (#107989)
This PR introduces record and replay functionality for `ShapeEnv` instances. In short,
throughout the execution of a program, we record events (e.g. function calls that modify
its state) so that, in the future, we are able to reproduce any intermediary state of the
instance.

In summary, this PR introduces the following changes (they mostly belong to
_symbolic_shapes.py_ unless otherwise stated):

- Create `ShapeEnvEvent` class for recording function calls + arguments
- Create `record_shapeenv_event` decorator and decorate every function that changes the
  state of a `ShapeEnv`: it creates an appropriate event and add it to the available
  ShapeEnv instance (sometimes it has to extract from `SymTypes`).
- Create `SymNode.with_shape_env` convenient function for replacing `ShapeEnv` references
- Wraps `ShapeEnv` initialization method: so that we also save the exact way a `ShapeEnv`
  was constructed, i.e. arguments
- Introduces a way to compare two `ShapeEnv` instances, defining a concept of state for
  that class. In short, the state of `ShapeEnv` is every variable that may change the
  execution flow
- Create `check_shape_env_recorded_events` dynamo configuration for enabling the check for
  equality the state of `ShapeEnv` with another one that was constructed by replaying all
  the recorded events. This check takes place inside `produce_guards`
- Create `replay_shape_env_events` function for replaying given events. It assumes the
  first event is `ShapeEnv` initialization function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107989
Approved by: https://github.com/ezyang
2023-09-13 00:22:38 +00:00
max
e066056414 fix 'Node' object is not iterable in functorch.compile.minifier (#103011)
Fixes #102169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103011
Approved by: https://github.com/Chillee
2023-09-12 23:47:40 +00:00
063a62622b Add memory overlap check to meta_copy_ (#108989)
Fixes `test_copy_many_to_one`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108989
Approved by: https://github.com/eellison
2023-09-12 23:28:14 +00:00
f08885287f Fix cumprod f16 opinfo test via ref-in-float + increasing tolerances (#109128)
Without setting `reference_in_float`, cumprod's single sample case
passes (i.e. the compiled f16 result matches the eager mode f16 result;
in fact they are identical because they both call into aten). However,
the grad calculation does not line up.

Turning on `reference_in_float` causes the grad check to pass (i.e. we
are closer to the more accurate f64 grad calculation) but causes the
single sample case to fail. Since the compiled f16 case is no less
accurate than the eager f16 case for the single sample, relaxing the
tolerances here seems fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109128
Approved by: https://github.com/eellison
ghstack dependencies: #109081, #109089
2023-09-12 23:19:59 +00:00
6869b25f1b Fix a bunch of opinfo tests by using reference_in_float (#109089)
I set reference_in_float to be always True, ran the full opinfo test
suite, and observed which tests were now unexpectedly passing. However,
I didn't turn on reference_in_float by default in this diff because it
also creates some new failures.

Related: https://github.com/pytorch/pytorch/issues/105534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109089
Approved by: https://github.com/eellison
ghstack dependencies: #109081
2023-09-12 23:19:59 +00:00
baefe47161 Fix std_mean f16 opinfo test by using reference_in_float (#109081)
It seems that the compiled f16 op is more accurate than the eager f16
op:

**Compiled float16 vs Eager float64**

    Mismatched elements: 25 / 25 (100.0%)
    Greatest absolute difference: 3.718038455710615e-05 at index (1, 0) (up to 1e-07 allowed)
    Greatest relative difference: 0.0018021699903143316 at index (0, 4) (up to 1e-07 allowed)

**Eager float16 vs Eager float64**

    Mismatched elements: 25 / 25 (100.0%)
    Greatest absolute difference: 7.280254198286512e-05 at index (3, 3) (up to 1e-07 allowed)
    Greatest relative difference: 0.004104326045245938 at index (0, 4) (up to 1e-07 allowed)

**Compiled float16 vs Eager float16**

    Mismatched elements: 7 / 25 (28.0%)
    Greatest absolute difference: 7.62939453125e-05 at index (3, 3) (up to 1e-05 allowed)
    Greatest relative difference: 0.00588226318359375 at index (0, 4) (up to 0.001 allowed)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109081
Approved by: https://github.com/eellison
2023-09-12 23:19:59 +00:00
4c5e43574c Reland 2: Add PyObject preservation for UntypedStorage (#109039)
Relands #103907 after it was reverted. This PR makes the new `ignore_hermetic_tls` argument of `check_pyobj` optional to avoid causing a compilation error in torchdistx

Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109039
Approved by: https://github.com/ezyang
2023-09-12 22:26:05 +00:00
6dc56d3490 [DTensor] Remove compute_local_offset from _utils.py (#109096)
Separating internal changes with OSS changes. This PR contains removing the compute_local_offset from the OSS directory only.

This replaces https://github.com/pytorch/pytorch/pull/108965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109096
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-09-12 21:55:15 +00:00
cf26e5575d [quant][be] Reduce warnings in tests (#108922)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108922
Approved by: https://github.com/andrewor14
ghstack dependencies: #108920, #108921
2023-09-12 21:54:33 +00:00
9118073fe7 assign var for "not populated" str (#108844)
minor cleanup of assigning a variable to the 'not populated' string value referenced in several places in `vmapify_autograd_function`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108844
Approved by: https://github.com/zou3519
2023-09-12 20:53:48 +00:00
91aab161d0 Revert "[inductor] Lower masked_scatter on CUDA (#108803)"
This reverts commit c8e577bf409591910f9667a51f2cf92b3c5455e0.

Reverted https://github.com/pytorch/pytorch/pull/108803 on behalf of https://github.com/lezcano due to makes test_comprehensive_masked_scatter_cuda_int64 flaky ([comment](https://github.com/pytorch/pytorch/pull/108803#issuecomment-1716407433))
2023-09-12 20:49:06 +00:00
b01b934aca [quant][be] Cleanup xnnpack_quantizer implementation (#108921)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108921
Approved by: https://github.com/andrewor14
2023-09-12 19:28:41 +00:00
bde75eb9a8 [Gloo] Properly pass op type to Work (#108812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108812
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-09-12 18:21:09 +00:00
a2d5f13310 [Inductor CUTLASS backend] Step 5: Gemm CUTLASS templates (#108015)
This is the step 5 to add cutlass as an alternative inductor backend.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108015
Approved by: https://github.com/kadeng, https://github.com/jansel, https://github.com/aakhundov
ghstack dependencies: #107802, #107847, #107901, #107931
2023-09-12 17:44:38 +00:00
097fd43f8c [Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931)
This is the step 4 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931
Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng
ghstack dependencies: #107802, #107847, #107901
2023-09-12 17:44:38 +00:00
b2d764ece0 [Inductor CUTLASS backend] Step 3: autotune_process, and CUDABenchmarkRequest (#107901)
This is the step 3 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107901
Approved by: https://github.com/jansel, https://github.com/aakhundov, https://github.com/kadeng
ghstack dependencies: #107802, #107847
2023-09-12 17:44:36 +00:00
102fefac21 [Inductor CUTLASS backend] Step 2: CUDACodeCache (#107847)
This is the step 2 to add cutlass as an alternative inductor backend.
Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107847
Approved by: https://github.com/jansel, https://github.com/kadeng, https://github.com/aakhundov
ghstack dependencies: #107802
2023-09-12 17:44:34 +00:00
a14761b68a [Inductor CUTLASS backend] Step 1: Inductor config for cuda / cutlass, util functions. (#107802)
This is the step 1 to add cutlass as an alternative inductor backend.
Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107802
Approved by: https://github.com/jansel, https://github.com/aakhundov, https://github.com/kadeng
2023-09-12 17:44:32 +00:00
15b13d3cff Revert "CI Sev - pin docker images for A100 workers (#108871)" (#109071)
This reverts commit 89eb7a75a251c41c4bee86e9ede1001b0d3998af.

Not required anymore since issue addressed by https://github.com/pytorch/test-infra/pull/4563
But deploying normally. Want to get proper green signal for deployment

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109071
Approved by: https://github.com/huydhn
2023-09-12 17:22:04 +00:00
cd46b5db76 make sure all torch._numpy tests run on CI (#108762)
- Add `if __name__ == "__main__": run_tests()` stanzas to test files in `torch_np` folder so that these tests run on CI
- Skip / xfail things smoked out by this change
- remove a stray python file which should not have been added to tests in the first place.
- fix einsum if opt_einsum is present
- add skips for older numpies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108762
Approved by: https://github.com/lezcano
2023-09-12 17:12:21 +00:00
abd83ce180 Small fix in SDPA docstring codeblock (#109086)
Fix https://github.com/pytorch/pytorch/issues/109072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109086
Approved by: https://github.com/drisspg
2023-09-12 16:48:46 +00:00
1b9b3a2d15 [MPS] Adding lgamma, digamma, and polygamma implementations (#106292)
Fixes issue mentioned in #77764

e.g. https://github.com/pytorch/pytorch/issues/77764#issuecomment-1654111744

Adds MPS support for the following ops:

- lgamma
- mvlgamma
- digamma
- polygamma

The lgamma fucntion does not yet have an MPS backend implementation. I've added one using a custom metal kernel (following John D. Cook's c++ implementation of the log gamma function: https://www.johndcook.com/blog/cpp_gamma/). For the backward pass op, I've added a digamma kernel that follows the cpu+cuda digamma implementation, and for the backward pass of the digamma op, I've added a polygamma + trigamma kernel following, again, the cpu+cuda implementations.

NOTE:

The cpu implementation of the polygamma function incorrectly (as far as I can tell) outputs a finite number for order = 1 and x in the negative integers. The mps implementation correctly outputs infinite. (see https://github.com/pytorch/pytorch/issues/106692)

The polygamma tests currently don't pass because of the error in the cpu+cuda kernels, but also because there are smallish discrepancies near the negative integers between the cpu+cuda and the mps polygamma and trigamma kernels. I'm not sure exactly why this is, but let me know if the discrepancies are too big.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106292
Approved by: https://github.com/kulinseth
2023-09-12 16:43:37 +00:00
c8e577bf40 [inductor] Lower masked_scatter on CUDA (#108803)
This decomposes masked_scatter into `aten.cumsum` and a single pointwise kernel,
which is similar to what is done in eager. I only do this for CUDA because on CPU
it isn't split into two passes like this so would cause a slowdown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108803
Approved by: https://github.com/lezcano
ghstack dependencies: #108802
2023-09-12 16:16:05 +00:00
464f9c3725 [meta] Add meta implementation for aten.masked_scatter (#108802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108802
Approved by: https://github.com/lezcano
2023-09-12 16:16:05 +00:00
c3945b5f84 Update HF version to commit hash (6c26faa) (#107400)
Some [errors](https://ossci-raw-job-status.s3.amazonaws.com/log/15968424899) in the [torchinductor hf benchmarks](https://hud.pytorch.org/benchmark/huggingface/inductor_aot_inductor?startTime=Thu,%2010%20Aug%202023%2018:05:47%20GMT&stopTime=Thu,%2017%20Aug%202023%2018:05:47%20GMT&granularity=hour&mode=inference&dtype=bfloat16&lBranch=main&lCommit=384e0d104fd077d31efafc564129660e9b7a0f25&rBranch=main&rCommit=03414081ff7ee011e17ee10f9ddb2584811bf965) should be fixed in the most recent release (for example, this [line](c036c814f4/src/transformers/models/opt/modeling_opt.py (L688)) no longer exists). Additionally, I landed a [commit (6c26faa)](6c26faa159) to the HF transformers repro to fix one of the graph breaks. This PR results in [76% pass rate for the export + aot inductor HF benchmark!](https://hud.pytorch.org/benchmark/compilers?startTime=Thu%2C%2010%20Aug%202023%2022%3A45%3A09%20GMT&stopTime=Thu%2C%2017%20Aug%202023%2022%3A45%3A09%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/hf_version&lCommit=0accaaca2fa70ca2f78c1a587dd4b6750448dd90&rBranch=main&rCommit=03414081ff7ee011e17ee10f9ddb2584811bf965)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107400
Approved by: https://github.com/ezyang, https://github.com/desertfire, https://github.com/malfet
2023-09-12 15:25:28 +00:00
58391aeaf1 [export] Lift constant tensors as buffes (reland) (#109040)
Summary:
When we retrace the graph containing constant tensors, they get lifted as buffer inputs.
AotInductor also wants to lift all the constants as inputs.
If we separate the constants as a separate thing, then it adds an additional complexity where we now have to keep track of 3 inputs (params, buffers, constants).

Cons: People might care about specifically what buffers are/are not buffers?

If people want to know specifically which buffers are constants, we can add an additional field in the graph signature to mark this.

Test Plan: CI

Differential Revision: D49153367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109040
Approved by: https://github.com/zhxchen17
2023-09-12 15:23:00 +00:00
1d32c9c7f2 Revert "Force synced KJT to trace unbacked SymInt (#108960)"
This reverts commit f9a250c35bd061e2e6f4c2d92e2b1b16390e8636.

Reverted https://github.com/pytorch/pytorch/pull/108960 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/108960#issuecomment-1715850779))
2023-09-12 14:37:36 +00:00
8c981c8c4b [ONNX] bump submodule to onnx==1.14.1 (#108895)
Bump the pip and submodule ONNX dependencies to official stable 1.14.1; there were no code changes between 1.14.1rc2 and 1.14.1.

Also bump ORT to run tests against ort-nightly==1.16.0.dev20230908001.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108895
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-09-12 14:20:22 +00:00
5a7c008b30 Revert "[ROCm] Add ROCm AMDGPU support for inductor cpp codegen (#105141)"
This reverts commit 8ff00360a4daab7848307a9a0b1c81b1da873d0c.

Reverted https://github.com/pytorch/pytorch/pull/105141 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/105141#issuecomment-1715629007))
2023-09-12 12:29:55 +00:00
5531a23b20 Don't set requires_grad inside meta function (#108988)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108988
Approved by: https://github.com/lezcano, https://github.com/bdhirsh
2023-09-12 12:24:13 +00:00
bc3f0d341a LazyBatchNorm{1-3}d support dict&set (#109015)
Fixes #105292

As the title shown ,LazyBatchNorm don`t support dict&set, keep consistent with BatchNorm{1-3}d.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109015
Approved by: https://github.com/mikaylagawarecki
2023-09-12 09:09:59 +00:00
41bd0fde7e Revert "Remove fixed skips (#108674)"
This reverts commit ab9fb03d6f674e3592910a0c4cc8208517a71084.

Reverted https://github.com/pytorch/pytorch/pull/108674 on behalf of https://github.com/huydhn due to Sorry for picking this up a bit late, but with https://github.com/pytorch/pytorch/pull/108647 reverted, these tests are failing again. So we need to wait for the PR to reland before we can land this change ([comment](https://github.com/pytorch/pytorch/pull/108674#issuecomment-1715202692))
2023-09-12 08:04:32 +00:00
65a3d398f1 [Pytorch][Vulkan] Call binary_op_scalar when 'other' is a 0-dim tensor (#109035)
Summary:
0-dim tensor are not supported in Vulkan.

If a binary_op_tensor is called with 'other_arg' being a 0-dim tensor, then we extract the scalar out and call binary_op_scalar.

Used to run the [FD model](
https://www.internalfb.com/manifold/explorer/wrist-camera-ml/tree/models/fd-ted-pi/fd-hybrid/fd_hybrid_vulkan.ptl) on [CLI](https://www.internalfb.com/intern/wiki/Malibu/Software/Machine_Learning/PyTorch_On_Device_Catalog/#build-and-run-native-pyt)

Test Plan:
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1
...

[ RUN      ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:6891: Skipped
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 339 tests from VulkanAPITest (5308 ms total)

[----------] Global test environment tear-down
[==========] 339 tests from 1 test suite ran. (5308 ms total)
[  PASSED  ] 338 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS
```

Differential Revision: D48672535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109035
Approved by: https://github.com/manuelcandales
2023-09-12 07:35:59 +00:00
59f605be57 Revert "Reland 2: Add PyObject preservation for UntypedStorage (#109039)"
This reverts commit 419e4e17a2c991d17685754a7fb0ddcf7dfdac87.

Reverted https://github.com/pytorch/pytorch/pull/109039 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing linter job in trunk, probably due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/109039#issuecomment-1715147020))
2023-09-12 07:26:11 +00:00
47be61e12b untracked inputs in constraints (#109037)
Differential Revision: [D49157009](https://our.internmc.facebook.com/intern/diff/D49157009/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109037
Approved by: https://github.com/zhxchen17
2023-09-12 06:50:01 +00:00
f9a250c35b Force synced KJT to trace unbacked SymInt (#108960)
Summary:
The basic concept behind this diff is to modify Dynamo's tracing behavior when it encounters a KeyedJaggedTensor that is synced (aka has `_length_per_key` and `_offset_per_key` populated). These fields are lists of integers; ordinarily, Dynamo will optimistically try to specialize on integers, however, for KJTs, we know that these integers will definitely vary from run-to-run. Furthermore, ordinarily, we would also specialize these integers if they are 0/1, but we will frequently expect features in KJTs to be 0/1.

The fix is to detect KJTs and treat these integers as *unbacked integers*. This is NOT a universally sound optimization: when treating these integers as unbacked, we never report them as equal to zero or one. In return, we always generate graphs that generalize no matter the length of values on features. This is enough to trace through APS sparse arch, torchrec_dlrm and some small split-cat examples.

The special integer behavior is triggered by a dynamically scoped `force_unspec_int_unbacked_size_like` variable on TracingContext, which we trigger when we wrap a KJT. There probably are other ways to do this, but this was simple and worked.

Test Plan:
```
buck2 test mode/dev-nosan //pytorch/benchmark/fb/test_gpu:run_test_gpu
```

from aakhundov

1. first build feed_lower_benchmark:
```
buck2 build --show-output mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true hpc/new/models/feed/benchmark:feed_lower_benchmark
```
2. then run the lowering of the model with it:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" TORCH_COMPILE_DEBUG=1 ../buck-out/v2/gen/fbcode/79c6b019ee0f9469/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --load=manifold://ig_inference_model/tree/user/facebook/fblearner/predictor/960999465/60/gpu_lowering/input.predictor --skip-trt --skip-ait --sync-mode=0 --enable-aot-inductor --lower-presets="ig_stories" --gpu-trace
```
cf https://docs.google.com/document/d/1yD30xYrdmM8r2HTdmXnZTg0-MHVexfVrAa0294m1AUE/edit?pli=1#heading=h.qiv3fp7e6zg0

From torchrec: https://www.internalfb.com/intern/wiki/Torchrec/Development/Testing_production_models/

From ge0405
baseline (without your diff): f477293168
your diff: f477292363

Differential Revision: D49019987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108960
Approved by: https://github.com/voznesenskym
2023-09-12 03:44:24 +00:00
6c8b0dfba6 [export] Add a private interface for customizing decomp. (#109058)
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109058
Approved by: https://github.com/angelayi
2023-09-12 03:05:46 +00:00
15202cc80c [caffe2] Remove cxx override to c++17 (#108687)
Summary: Allow the user to specify the cxx version to use when compiling. For applications that compile with C++20, we wish to also compile this library with C++20 to avoid subtle ODR violations with using different library standards.

Test Plan: Built the project successfully.

Reviewed By: smeenai

Differential Revision: D48636406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108687
Approved by: https://github.com/davidberard98
2023-09-12 02:54:59 +00:00
b1f21399c8 Prerequisite of ATen/native/utils header for C++ extension (#109013)
# Motivate
Without this PR, if we would like to include the header file like ```#include <ATen/native/ForeachUtils.h>``` in our C++ extension, it will raise a Error ```/home/xxx/torch/include/ATen/native/ForeachUtils.h:7:10: fatal error: 'ATen/native/utils/ParamsHash.h' file not found```. We should fix it.

# Solution
Add the ATen/native/utils header file in the build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109013
Approved by: https://github.com/ezyang
2023-09-12 02:30:45 +00:00
85428f5ea5 Fix 0-sized views of tensors in cudagraphs (#109055)
Fixes internal model. If a tensor with real storage is viewed by a 0-dim tensor it is still kept alive and needs to be kept track of in our storage tracking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109055
Approved by: https://github.com/ezyang, https://github.com/xw285cornell
2023-09-12 01:24:43 +00:00
419e4e17a2 Reland 2: Add PyObject preservation for UntypedStorage (#109039)
Relands #103907 after it was reverted. This PR makes the new `ignore_hermetic_tls` argument of `check_pyobj` optional to avoid causing a compilation error in torchdistx

Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109039
Approved by: https://github.com/ezyang
2023-09-12 01:19:40 +00:00
2039f30c06 Revert "[inductor] Parallelize Max Autotune step 1: Use Popen (#107982)"
This reverts commit d6856680039e5557b45e4cd6e95f82ca64f6435a.

Reverted https://github.com/pytorch/pytorch/pull/107982 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/107982#issuecomment-1714818307))
2023-09-12 01:12:22 +00:00
c36c2bfcb2 Revert "[inductor] Parallelize Max Autotune step 2: Use all GPUs (#107983)"
This reverts commit 2c61313ff3b9ca585f04a4bb78263f301a8cec27.

Reverted https://github.com/pytorch/pytorch/pull/107983 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/107983#issuecomment-1714816358))
2023-09-12 01:08:08 +00:00
cyy
f150f96255 [Reland] increase clang-tidy coverage in torch/csrc (#108875)
Reland  PR #103058 since there was a time gap between this PR and other PRs in terms of torch/csrc modifications

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108875
Approved by: https://github.com/Skylion007
2023-09-12 00:54:53 +00:00
b6f9d4dbc4 [DCP] Enable nD device_mesh resharding DTensor in DCP and add associated tests (#106230)
This PR:
     1. Drop assert for 1D DeviceMesh check to allow DTensor with nD DeviceMesh when creating write_item.
     2. Add tests for both placement changes and mesh changes for both 1D and 2D scenarios.

cc. @kumpera  @wanchaol  @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106230
Approved by: https://github.com/kumpera
2023-09-12 00:47:58 +00:00
cyy
8025b193a9 Re-enable some Windows tests (#108930)
The tests were disabled long before. It should be fine to enable them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108930
Approved by: https://github.com/kit1980
2023-09-12 00:33:19 +00:00
4691cb26b3 Disable compile for massive data pipe test (#109063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109063
Approved by: https://github.com/clee2000
ghstack dependencies: #108846
2023-09-12 00:15:52 +00:00
55a204ebc8 [Easy] log graphs in compiled_autograd if TORCH_LOGS=compiled_autograd (#108991)
[Easy] log graphs in compiled_autograd if TORCH_LOGS=compiled_autograd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108991
Approved by: https://github.com/ezyang
ghstack dependencies: #108846
2023-09-12 00:15:02 +00:00
33c1136f89 Added limit on number of warps for coordesc autotuner (#108997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108997
Approved by: https://github.com/shunting314
2023-09-12 00:14:38 +00:00
241e84bf98 [quant][be] Rewrite xnnpack_quantizer_utils.py to use decorators (#108920)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108920
Approved by: https://github.com/kimishpatel
2023-09-12 00:09:13 +00:00
b2cba439b4 Introduce Tensor overload to linspace and logspace (#104889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889
Approved by: https://github.com/zou3519
ghstack dependencies: #107958
2023-09-11 23:30:40 +00:00
405f014c26 [jit] Skip NNAPI, test_ivalue, CPU NNC tests in fbcode (#108937)
Summary:
NNAPI: Internal test infra can't find test_nnapi.py. Easiest solution is to just skip these tests if test_nnapi.py can't be found
test_ivalue: fails due to qscheme op not implemented for CPU backend. In OSS, it doesn't run because it's not included in test_jit.py.
CPU NNC tests: test_working_byte_cpu_float32 is failing, but hard to repro; we don't use CPU NNC internally, so let's just skip CPU NNC tests internally.

Differential Revision: D48041615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108937
Approved by: https://github.com/eellison
2023-09-11 22:42:30 +00:00
293d3b89d8 Add Opinfos for the Tensor overload of linspace/logspace (#107958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107958
Approved by: https://github.com/zou3519
2023-09-11 22:30:19 +00:00
03fd3544a2 fixed lgamma documentation error (#108719)
Fixes #108527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108719
Approved by: https://github.com/zou3519
2023-09-11 22:29:06 +00:00
97d9188178 Speical treatment to build AOTInductor with cuda-12 from Meta internal (#108831)
Differential Revision: D49042577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108831
Approved by: https://github.com/bertmaher
2023-09-11 22:16:23 +00:00
29c29339e5 Add torch_lazy_enable_device_data_cache to disable lazy device data cache (#107827)
### Add python binding variables for enabling and disabling

These changes will be used in the pytorch/xla repository for lowering HLO for the AWS Neuron compiler.  For correct tensor lowerings the device cache size must be set to zero.

It is advantageous to be able to enable and disable the cache without deleting it.  This allows use of the XLA device, and HLO lowering in a single python file, isolating cache disablement to a python context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107827
Approved by: https://github.com/JackCaoG, https://github.com/wconstab, https://github.com/bdhirsh
2023-09-11 22:14:39 +00:00
03bf745e1d Fix the parameter error in test_device_mesh.py (#108758)
Fix the parameter error in test_device_mesh.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108758
Approved by: https://github.com/awgu
2023-09-11 21:39:13 +00:00
bb14805bcd fix an incorrect indent in documentation (#108273)
doc for `torch.distributed.send(tensor, dst, group=None, tag=0)` was rendering incorrectly here: https://pytorch.org/docs/stable/distributed.html due to lack of indent (it was interpreting the continuation as a new argument).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108273
Approved by: https://github.com/awgu, https://github.com/kit1980
2023-09-11 21:27:52 +00:00
a4138b1f99 [ez] Fix small type error in run_test (#109036)
This is really small but it has tripped me up at least 3 times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109036
Approved by: https://github.com/kit1980
2023-09-11 21:11:20 +00:00
5c8efa6077 [export] Fix export arg type declaration (#109060)
Summary: Its a arbitrary length tuple of anything. Tuple[Any] means 1 element.

Test Plan: ci

Differential Revision: D49161625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109060
Approved by: https://github.com/angelayi
2023-09-11 20:54:05 +00:00
b0656ac81f [pytorch-vulkan] move glsl random utils to Random.h (#108724)
Summary:
I plan to use the Box-Muller method for sampling from the normal distribution to implement `aten::randn_like`, which can use the existing uniform functions, so I move them out to a `random.h`.

https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform

Test Plan:
```
[ttingchulin@95660.od /data/sandcastle/boxes/fbsource (rand_lib)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*<test>*" eg.  -- --gtest_filter="*uniform*"

BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *uniform*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN      ] VulkanAPITest.uniform
[       OK ] VulkanAPITest.uniform (120 ms)
[----------] 1 test from VulkanAPITest (120 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (120 ms total)
[  PASSED  ] 1 test.
```

Reviewed By: yipjustin

Differential Revision: D48750679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108724
Approved by: https://github.com/yipjustin
2023-09-11 20:27:43 +00:00
e7bd9c5315 [CUDA][CUDA Graphs] Fix CUDAGraph::reset function (#108896)
The following two cases fail due to a small oversight `CUDAGraph::reset()` that causes failures in graph destructor
```Python
import torch

x = torch.zeros(4, device="cuda")
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    x = x + 1

g.reset()
del g
```
that fails with:
```
terminate called after throwing an instance of 'c10::Error'
  what():  uc >= 0 INTERNAL ASSERT FAILED at ".../pytorch/c10/cuda/CUDACachingAllocator.cpp":2157, please report a bug to PyTorch.
```

and reset and subsequent re-capture
```Python
import torch

x = torch.zeros(4, device="cuda")
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    x = x + 1

g.reset()

with torch.cuda.graph(g):
    x = x + 1
g.replay()
```
which fails with:
```
Traceback (most recent call last):
  File "test_graph.py", line 11, in <module>
    with torch.cuda.graph(g):
  File ".../pytorch/torch/cuda/graphs.py", line 192, in __enter__
    self.cuda_graph.capture_begin(
  File ".../pytorch/torch/cuda/graphs.py", line 77, in capture_begin
    super().capture_begin(pool=pool, capture_error_mode=capture_error_mode)
RuntimeError: This CUDAGraph instance already owns a captured graph. To capture a new graph, create a new instance.

```

This PR fixes `CUDAGraph::reset()` function for above to use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108896
Approved by: https://github.com/ezyang
2023-09-11 19:49:31 +00:00
fb288aa99b Add Bfloat16 support to CrossKernel.cu (#108941)
Fixes #108940
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108941
Approved by: https://github.com/mikaylagawarecki
2023-09-11 19:05:01 +00:00
5976a08eea [inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)
This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation.

Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano, https://github.com/atalman
2023-09-11 18:44:10 +00:00
2bcff92540 Add NestedTensor python subclass (#108314)
Description coming soon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108314
Approved by: https://github.com/jbschlosser
ghstack dependencies: #108808
2023-09-11 18:29:20 +00:00
4a4a2fc1a5 Enable Mypy Checking for torch/_inductor/fx_passes/fuse_attention.py (#107369)
Fixes #105230

Summary:

As suggested in https://github.com/pytorch/pytorch/issues/105230 mypy checking is enabled in torch/_inductor/fx_passes/fuse_attention.py

After Fix:

mypy --follow-imports=skip torch/_inductor/fx_passes/fuse_attention.py Success: no issues found in 1 source file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107369
Approved by: https://github.com/mikaylagawarecki
2023-09-11 18:08:26 +00:00
e276d70451 Revert "Add Opinfos for the Tensor overload of linspace/logspace (#107958)"
This reverts commit 106e0a0ef19c8dad088fc1ec10d7d93d76409352.

Reverted https://github.com/pytorch/pytorch/pull/107958 on behalf of https://github.com/clee2000 due to I think the newly added test test_mps.py::TestConsistencyCPU::test_output_match_logspace_tensor_overload_cpu_complex64 is broken, probably a landrace since the mergebase seems to be 21 days old 106e0a0ef1 https://github.com/pytorch/pytorch/actions/runs/6149523234/job/16685849126 ([comment](https://github.com/pytorch/pytorch/pull/107958#issuecomment-1714309905))
2023-09-11 17:38:04 +00:00
a7f5abeade Revert "Introduce Tensor overload to linspace and logspace (#104889)"
This reverts commit 57e52393213b6b4fba3b334654b96396a2904087.

Reverted https://github.com/pytorch/pytorch/pull/104889 on behalf of https://github.com/clee2000 due to sorry have to revert this to revert https://github.com/pytorch/pytorch/pull/107958 ([comment](https://github.com/pytorch/pytorch/pull/104889#issuecomment-1714305768))
2023-09-11 17:33:48 +00:00
1d42148fee [dynamo] preserve some FX node metadata of GraphModules (#107067)
Requested from @tugsbayasgalan: we want dynamo to preserve some FX node metadata when we trace `GraphModule`s (`nn_module_stack`, `source_fn`, `stack_trace`). This is helpful for the case when we export an aten-level `GraphModule`, add some (possibly non-torch or non-aten) ops, and we want to transform the graph back into an aten-level graph. Without preserving metadata, future passes that look at metadata (e.g. quantization passes) won't work.

This feature also has the additional benefit of being able to preserve origin line of code when `print_readable`'ing a `GraphModule`. This is helpful when debugging graphs that have passed through dynamo several times.

The added unit test demonstrates the added functionality of this PR.

~This PR is currently a proof-of-concept implementation that shows that preserving node metadata across dynamo is possible.~ This PR preserves node metadata across dynamo by doing the following:
- ~inject a counter variable into the `GraphModule` source code, which is incremented every time a node is run~
- Construct a line number -> node index map in `GraphModule` as the source code is being generated.
- pass a list of node metadata and the line number map to dynamo's bytecode analyzer
- ~dynamo traces the counter as a `ConstantVariable`, so when we create a new proxy, we can determine which original node index this proxy corresponds by looking at the value of the traced counter~
- When we create a new proxy, get the current instruction's line number, and get the node index using the line number map
- index into the original node metadata ~using the counter variable's tracked value.~

~Some things that should be addressed off the top of my head:~
- ~Is this feature even desirable? (Do we really want Dynamo to have special behavior for `GraphModules`? Should we expect users to re-export `GraphModules`?)~
- ~Is there a better approach than to use a counter? We considered using node names, line numbers, and assuming that proxies are created in the same order as the nodes, but each of these 3 have shortcomings. For node names, we only have access to new node names, not the old ones. Using line number is fragile. The third is problematic since not all created nodes go through `create_proxy` (e.g. inputs). We currently generate a line number to node index map when the `GraphModule`'s code is generated.~
- ~What's the best way to send data across the "CPython gap"? That is, it is not obvious how to cleanly pass data from dynamo's `eval_frame.py:_TorchDynamoContext.__call__` to `symbolic_convert.py:InstructionTranslatorBase.__init__`. In this PR, we use a global.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107067
Approved by: https://github.com/jansel
2023-09-11 17:11:51 +00:00
ba4782e3c0 cleanup typos; redundant parentheses (#109003)
- minor spelling fixes in `aten/src/ATen/core/TransformationHelper.h`
- remove redundant parentheses in control statements in `torch/distributed/algorithms/_quantization/quantization.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109003
Approved by: https://github.com/davidradl, https://github.com/H-Huang
2023-09-11 17:09:17 +00:00
3b265e021f Support Optional typehint without graph breaking (#108970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108970
Approved by: https://github.com/anijain2305
2023-09-11 16:42:44 +00:00
090fe45e1c Revert "make sure all torch._numpy tests run on CI (#108762)"
This reverts commit 7abeb92796635bd3ee216a0872bddd0395e97d10.

Reverted https://github.com/pytorch/pytorch/pull/108762 on behalf of https://github.com/clee2000 due to sorry but I think the asan test_scalarmath failure is real 7abeb92796 https://github.com/pytorch/pytorch/actions/runs/6132913963/job/16645381921 ([comment](https://github.com/pytorch/pytorch/pull/108762#issuecomment-1714214523))
2023-09-11 16:29:20 +00:00
3efc1882e8 Update CopySlices to not internal assert when grad_output is undefined (#108353)
Fixes https://github.com/pytorch/pytorch/issues/107928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108353
Approved by: https://github.com/albanD
ghstack dependencies: #107296, #107349
2023-09-11 16:26:05 +00:00
e8a402c56e [quant][pt2] Fix and rename move_model_to_eval (#108891)
Summary:
This commit fixes two silent correctness problems with
the current implementation of `move_model_to_eval`:

(1) Previously the user had to manually call `eliminate_dead_code`
before calling `move_model_to_eval`, otherwise the dropout pattern
won't actually get eliminated. This is because subgraph rewriter
complains the match is not self-contained, and so silently does
not do the replacement.

(2) We wish to error when the user calls `model.train()` or
`model.eval()` on an exported model. This error is raised
correctly immediately after export today, but no longer raised
after the user calls prepare or convert.

We fix (1) by moving the `eliminate_dead_code` call into
`move_model_to_eval`, and fix (2) by ensuring the respective
errors are thrown after prepare and convert as well.

Additionally, this commit renames `move_model_to_eval` to
`move_exported_model_to_eval` to be more explicit.

bypass-github-export-checks

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_disallow_eval_train
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_to_eval

Imported from OSS

Differential Revision: D49097293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108891
Approved by: https://github.com/jerryzh168
2023-09-11 15:37:01 +00:00
57e5239321 Introduce Tensor overload to linspace and logspace (#104889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889
Approved by: https://github.com/zou3519
ghstack dependencies: #107958
2023-09-11 15:29:39 +00:00
106e0a0ef1 Add Opinfos for the Tensor overload of linspace/logspace (#107958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107958
Approved by: https://github.com/zou3519
2023-09-11 15:29:39 +00:00
e19a855b4d [HSDP] Fix Node 1 unable receive parameters from Node 0 (#108331)
When use hybrid_shard mode FSDP,
state.process_group means gpu_0,1,,,~,7 on node 0,so gpus on node 1 cannot receive parameters, setting process_group to default_group(global_group)can fix this issue

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108331
Approved by: https://github.com/awgu
2023-09-11 15:13:28 +00:00
cyy
9a492fc27f Fix unknown c++ flag detection in CMake (#109000)
Unknown -Wno-XXX flags are still appended to GCC via append_cxx_flag_if_supported  because of the behavior mentioned in GCC document:
```
When an unrecognized warning option is requested (e.g., -Wunknown-warning),
GCC emits a diagnostic stating that the option is not recognized.
However, if the -Wno- form is used, the behavior is slightly different:
no diagnostic is produced for -Wno-unknown-warning unless other diagnostics are being produced.
This allows the use of new -Wno- options with old compilers,
but if something goes wrong, the compiler warns that an unrecognized option is present.
```
This PR tries to fix by detection the flag of the -WXXX form. Unfortunately, third_party/fbgemm/CMakeLists.txt redefines append_cxx_flag_if_supported and our version is overwritten. As a result, we have to re-include utils.cmake to overwrite it again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109000
Approved by: https://github.com/malfet
2023-09-11 08:32:07 +00:00
18225cc6aa inductor: add custom pass hooks in post_grad_passes (#108615)
Supports #107921

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108615
Approved by: https://github.com/jansel
2023-09-11 04:13:32 +00:00
a6b153b311 inductor: remove redundant memory copy when view a ExternKernelAlloc buffer (#108635)
When viewing a ExternKernelAlloc buffer, there always have a redundant memory copy:
```
buf0: ExternKernelSchedulerNode(MKLPackedLinear)
buf0.writes = [StarDep(name='buf0')]
buf0.unmet_dependencies = []
buf0.met_dependencies = [StarDep(name='arg1_1'), StarDep(name='constant0'), StarDep(name='constant1')]
buf0.users = [NodeUser(node=SchedulerNode(name='buf1'), can_inplace=True, is_weak=False)]
buf0.node.kernel = torch.ops.mkl._mkl_linear

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep('buf1', c0, {c0: 64})]
buf1.unmet_dependencies = [MemoryDep('buf0', c0, {c0: 64})]
buf1.met_dependencies = []
buf1.users = [NodeUser(node=OUTPUT, can_inplace=False, is_weak=False)]
buf1.group.device = cpu
buf1.group.iteration = ((64,), ())
buf1.sizes = ([64], [])
class buf1_loop_body:
    var_ranges = {z0: 64}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf0', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store
```

and the cpp backend-generated code is:
```
cpp_fused_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(float* in_out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(16L))
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i0));
                tmp0.store(in_out_ptr0 + static_cast<long>(i0));
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg1_1, = args
    args.clear()
    assert_size_stride(arg1_1, (4, 16), (16, 1))
    buf0 = torch.ops.mkl._mkl_linear(arg1_1, constant1, constant0, None, 4)
    del arg1_1
    buf1 = reinterpret_tensor(buf0, (4, 4, 4), (16, 4, 1)); del buf0  # reuse
    cpp_fused_view_0(c_void_p(buf1.data_ptr()))
    return (buf1, )
```

For the ExternKernelAlloc buffer, we can do a real view, rather than a memory copy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108635
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #108560
2023-09-11 01:19:37 +00:00
a6ada463ec inductor: make onednn linear inputs are always real contiguous (#108560)
For OneDNN linear, if packed linear inputs are not the default contiguous tensor, it always calls in ref pat and gets a worse performance, this PR will force its inputs to the actual default contiguous tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108560
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-09-11 01:11:36 +00:00
e716505345 Graph break within check_kwargs for higher order ops #108597 #108730 (#108821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108821
Approved by: https://github.com/anijain2305
2023-09-10 21:09:02 +00:00
1a3a07ac2c [inductor] Enable Mypy Checking for torch/_inductor/codegen/triton_utils.py (#108951)
Summary: Used monkeytype to generate the typehints and enabled mypy checking

Test Plan: `lintrunner torch/_inductor/codegen/*.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108951
Approved by: https://github.com/Skylion007
2023-09-10 19:18:51 +00:00
4a98c898e2 Refactor ios-build-test workflow to support binary release (#108322)
This refactors the logic from CircleCI iOS [build](https://github.com/pytorch/pytorch/blob/main/.circleci/config.yml#L1323-L1344) and [upload](https://github.com/pytorch/pytorch/blob/main/.circleci/config.yml#L1369-L1377) jobs to GHA.

* Nightly artifacts will be available again on `ossci-ios-build` S3 bucket, for example `libtorch_lite_ios_nightly_2.1.0.20230517.zip`.  The last one there was s3://ossci-ios-build/libtorch_lite_ios_nightly_2.1.0.20230517.zip from May 17th
  * [LibTorch-Lite-Nightly](https://github.com/CocoaPods/Specs/blob/master/Specs/c/3/1/LibTorch-Lite-Nightly/1.14.0.20221109/LibTorch-Lite-Nightly.podspec.json) on cocoapods
* Release artifacts will be on `ossci-ios` S3 bucket, for example `s3://ossci-ios/libtorch_lite_ios_1.13.0.zip` from Nov 3rd 2022
  * [LibTorch-Lite](https://github.com/CocoaPods/Specs/blob/master/Specs/c/c/3/LibTorch-Lite/1.13.0.1/LibTorch-Lite.podspec.json) on cocoapods
  * [LibTorch](https://github.com/CocoaPods/Specs/blob/master/Specs/1/3/c/LibTorch/1.13.0.1/LibTorch.podspec.json) on cocoapods

I will clean up Circle CI code in another PR.

### Testing

Generate new release artifacts for testing from main branch.  Simulator testing have all passed.

* With lite interpreter https://github.com/pytorch/pytorch/actions/runs/6093860118
  * https://ossci-ios.s3.amazonaws.com/libtorch_lite_ios_2.1.0.zip
  * https://ossci-ios.s3.amazonaws.com/LibTorch-Lite-2.1.0.podspec

* LibTorch binary can be built without lite interpreter https://github.com/pytorch/pytorch/actions/runs/6103616035 and uses TorchScript, but it has been long dead from my understanding.  The binary can still be built and tested though.
  * https://ossci-ios.s3.amazonaws.com/libtorch_ios_2.1.0.zip
  * https://ossci-ios.s3.amazonaws.com/LibTorch-2.1.0.podspec

### Next step for release

* Once the PR is committed.  I plan to use the workflow dispatch to build the binaries manually on `release/2.1` branch.  Once they looks good, we can publish them on cocoapods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108322
Approved by: https://github.com/atalman
2023-09-10 19:08:15 +00:00
63ae1051e1 MAINT: do not test numpy funcs in torch._numpy (#108807)
Remove testing of numpy functions which torch._numpy does not implement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108807
Approved by: https://github.com/lezcano
2023-09-10 17:18:30 +00:00
cyy
59254c75a1 [Reland] fix c10:TempFile APIs on Windows (#108508)
PR #106656 was reverted due to IOS failures. It seems that IOS builds don't have full support of std::filesystem. This PR discards std::filesystem changes and add temp file creation on Windows. It also moves the platform syscalls into a separate cpp file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108508
Approved by: https://github.com/ezyang
2023-09-10 16:58:41 +00:00
f81eacd30c typo fix strategy_comb in basic_strategy.py (#108972)
Typo fix `startegy_comb` -> `strategy_comb` in `basic_strategy.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108972
Approved by: https://github.com/Skylion007
2023-09-10 15:58:15 +00:00
2c61313ff3 [inductor] Parallelize Max Autotune step 2: Use all GPUs (#107983)
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them.

Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107983
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #107982
2023-09-10 15:43:03 +00:00
d685668003 [inductor] Parallelize Max Autotune step 1: Use Popen (#107982)
Summary: Step 1 in revamping subprocess autotune to support multiple GPUs: use Popen to create a new process with an entry point we control so we don't reinterpret the toplevel script.

Test Plan: `python test/inductor/test_max_autotune.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107982
Approved by: https://github.com/eellison, https://github.com/shunting314
2023-09-10 15:43:03 +00:00
89eb7a75a2 CI Sev - pin docker images for A100 workers (#108871)
Pinning docker images, trying to address SEV : https://github.com/pytorch/pytorch/issues/108862
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108871
Approved by: https://github.com/huydhn
2023-09-10 13:30:00 +00:00
2b9ad3d5c4 Fix setitem with SymInt (#108873)
Fixes https://github.com/pytorch/pytorch/issues/101939

Several fixes bundled together:

1. When we valueToTensor, we only handled non-symbolic inputs and not symbolic inputs. We support symbolic Scalar, so also handle symbolic values.
2. In the symbolic case, we MUST NOT lift_fresh, as you're not going to inline a constant into the graph, it's going to be from a `scalar_tensor` call (so no need to clone it to avoid mutations)
3. In indexing scalarToTensor, must not do the static, directly read out the scalar contents logic with the scalar is symbolic

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108873
Approved by: https://github.com/jansel
2023-09-10 06:44:22 +00:00
9b12a28d89 [MPS] Implement mul operation for complex types (#108395)
Using existing BinaryKernel template

Add `mul` as well as `kron` and `outer` to list of MPS ops that support complex types

This should add all the missing ops mentioned in https://github.com/pytorch/pytorch/issues/105665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108395
Approved by: https://github.com/albanD
ghstack dependencies: #108393, #108394
2023-09-10 05:39:12 +00:00
c7bb842d35 [MPS] Add complex add/sub (#108394)
Using `view_as_real` and running elementwise ops in resulted tensors
Add `add` and `sub` to list of complex ops that should work on MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108394
Approved by: https://github.com/albanD
ghstack dependencies: #108393
2023-09-10 05:39:12 +00:00
92de1d2d02 Revert "[Dynamo][Test]Add a testcase for module with training state (#108750)"
This reverts commit f90444cf0b979ba434391fdd42fb1e2afb98ac34.

Reverted https://github.com/pytorch/pytorch/pull/108750 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it starts failing this test https://github.com/pytorch/pytorch/issues/108838 without https://github.com/pytorch/pytorch/pull/108883 and the latter has been reverted ([comment](https://github.com/pytorch/pytorch/pull/108750#issuecomment-1712708800))
2023-09-10 04:45:00 +00:00
56c2386157 Revert "reland [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108883)"
This reverts commit d4230e55748c66c72e7a17b1cd08540b742b20a5.

Reverted https://github.com/pytorch/pytorch/pull/108883 on behalf of https://github.com/huydhn due to Per the discussion thread on D49122208, reverting this change ([comment](https://github.com/pytorch/pytorch/pull/108883#issuecomment-1712707853))
2023-09-10 04:40:02 +00:00
53a4ca4b58 [MPS][BE] Add dispatch_sync_with_rethrow (#108393)
And enable testing for match_output for complex types.
Most of them should throw an "unsupported XYZ" error, rather than crash.
This fixed several crashes when linalg ops were invoked with complex inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108393
Approved by: https://github.com/kit1980, https://github.com/kulinseth
2023-09-10 02:07:12 +00:00
2b138e4f7d [export] torch.export landing page (#108783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108783
Approved by: https://github.com/avikchaudhuri, https://github.com/gmagogsfm
2023-09-10 01:40:42 +00:00
7abeb92796 make sure all torch._numpy tests run on CI (#108762)
- Add `if __name__ == "__main__": run_tests()` stanzas to test files in `torch_np` folder so that these tests run on CI
- Skip / xfail things smoked out by this change
- remove a stray python file which should not have been added to tests in the first place.
- fix einsum if opt_einsum is present
- add skips for older numpies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108762
Approved by: https://github.com/lezcano
2023-09-09 20:05:27 +00:00
003c5bb156 Add checks to num_layers for RNN, LSTM, GRU (#108853)
Fixes #108223

As the title shown

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108853
Approved by: https://github.com/mikaylagawarecki
2023-09-09 19:33:52 +00:00
4c503f2451 [Inductor] Extend Pattern Matcher to Match Equivalent Function Invocation (#107832)
Fixes #104391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107832
Approved by: https://github.com/jansel
2023-09-09 19:19:11 +00:00
e4350d6d4e Functools partial support in dynamo (#108846)
The strategy for supporting functools partials is relatively straightforward.

There are 2 cases we need to support:

**1) Functools partials as input**
In this case, we are first seeing the functools partial and it is guaranteed to have a source. As such, the args, keywords, and func of the functools partial are passed through VariableBuilder. As this is the first time we are seeing these objects (as it is an input), we re-enter VariableBuilder with a source referencing the args, keywords, and func as attributes of the input to produce:

- func: A callable VariableTracker (UDF, TorchVariable, etc) depending on the value of `func`
- args: List[VariableTracker] - note, not ListVariableTracker!
- keywords: Dict[str, VariableTracker]

A major benefit of this structure is that it very elegantly matches the args to `call_function`.

We then compose a FunctoolsPartialVariable from the VariableTrackers made above.

**2) Functools partials created within compile**
In this case, we already have all the args as known VTs, and thus just compose a FunctoolsPartialVariable as we do for case (1).

For both (1) and (2) - we propagate all guards from the func, args, and keyword VTs to the FunctoolsPartialVariable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108846
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-09-09 17:25:02 +00:00
8ff00360a4 [ROCm] Add ROCm AMDGPU support for inductor cpp codegen (#105141)
Follows from previous enablement attempt: https://github.com/pytorch/pytorch/pull/101797

Adds support for hsaco binaries in inductor's cpp_wrapper codegen and enables the CUDA tests in test_cpp_wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105141
Approved by: https://github.com/jansel
2023-09-09 16:28:56 +00:00
0f88d93b10 decomposition spectral ops fixes (#108360)
Fixes https://github.com/pytorch/pytorch/issues/105986, https://github.com/pytorch/pytorch/issues/108204, https://github.com/pytorch/pytorch/issues/108205

Fix all issues flagged when making changes for https://github.com/pytorch/pytorch/pull/107421

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108360
Approved by: https://github.com/ezyang
2023-09-09 04:48:09 +00:00
d4230e5574 reland [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108883)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108883
Approved by: https://github.com/voznesenskym, https://github.com/huydhn
2023-09-09 03:12:31 +00:00
ed7f9cac91 [inductor] Add CPU-side profiler event names for templates and foreach kernels (#108449)
This passes in the descriptive kernel name as part of the triton_meta dict that gets passed to the CachingAutotuner, for foreach kernels and templates.

Before:
<img width="684" alt="Screenshot 2023-09-01 at 11 56 02 AM" src="https://github.com/pytorch/pytorch/assets/5067123/c14e13fc-0d9e-425a-a08b-613ef42aa264">

After:
<img width="562" alt="Screenshot 2023-09-01 at 2 13 00 PM" src="https://github.com/pytorch/pytorch/assets/5067123/551bb9a9-865b-401e-b6e0-8ebbe5431565">

This PR also refactors the "magic strings" (KERNEL_NAME and DESCRIPTIVE_KRNL_NAME) into an enum in utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108449
Approved by: https://github.com/jansel
2023-09-09 02:11:13 +00:00
311fbe43e6 [DeviceMesh] Fix __getitem__ docstring typo (#108837)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108837
Approved by: https://github.com/wanchaol
2023-09-09 01:46:14 +00:00
7b3efeaf42 Follow-up #108379 (#108905)
Fixes #108379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108905
Approved by: https://github.com/abock
2023-09-09 01:38:36 +00:00
2c3febb273 [dynamo] disable flaky test_unhandled_exception_in_dynamo2 (#108906)
Fix https://github.com/pytorch/pytorch/issues/106028.

The test `test_unhandled_exception_in_dynamo` should cover most cases. The disabled test `test_unhandled_exception_in_dynamo2` covered some weird case that I found when implementing dynamo 3.11.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108906
Approved by: https://github.com/yanboliang
2023-09-09 01:10:09 +00:00
324b23f337 MAINT: torch/_numpy: remove stubs raising NIError (#108902)
Remove remaining stubs. There is no point raising NotImplementedError, now that a missing function triggers a graph break just by being missing in `torch._numpy` namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108902
Approved by: https://github.com/lezcano
2023-09-09 00:11:14 +00:00
b41b189b71 Un-skip the linalg_ldl_solve tests (#108842)
There's a comment that says it segfaults, but it doesn't appear to do so
any more

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108842
Approved by: https://github.com/eellison
2023-09-08 23:34:16 +00:00
a5e1d38025 add check for torch_arg (#108397)
Fixes https://github.com/pytorch/pytorch/issues/108219
add check for torch_arg marco, as for inchannel/outchannel/groups, it should be greater than 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108397
Approved by: https://github.com/mikaylagawarecki
2023-09-08 23:18:27 +00:00
af8b04d5f6 Add create_graph_input debug log (#108836)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108836
Approved by: https://github.com/mlazos, https://github.com/voznesenskym
2023-09-08 23:00:57 +00:00
66f67d9a25 Print restart attempt as part of Dynamo log context (#108864)
Now looks like:

```
[2023-09-08 06:04:48,532] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_ATTR foo [ConstantVariable(int), NNModule
Variable()]
[2023-09-08 06:04:48,532] [0/0] torch._dynamo.convert_frame: [INFO]
Restarting analysis due to _dynamo/variables/nn_module.py
:138 in convert_to_unspecialized
[2023-09-08 06:04:48,533] [0/0_1] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing f /data/users/ezyang/c/pytorch/a.py:6
[2023-09-08 06:04:48,533] [0/0_1] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line f /data/users/ezyang/c/pytorch/a.py:6
```

I'm happy to bikeshed the exact formatting of the attempt number if you
want.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108864
Approved by: https://github.com/mlazos, https://github.com/voznesenskym
2023-09-08 23:00:19 +00:00
703cdd711f Revert "[export] Lift constant tensors as buffers (#108592)" (#108893)
This reverts commit e3407238f6be0583fe6dac7e2c4897f6c4480ed4.

I gave up trying to revert the original PR in the usual way https://github.com/pytorch/pytorch/pull/108592#issuecomment-1712135536, so let's manually revert it then.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108893
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2023-09-08 22:25:10 +00:00
f30f9fec87 Fix the issue described by #106769 (#108340)
Fixes #106769

Align the behavior of the C++ interface with the Python interface

1. Remove some checks in C++ frontend api ,which duplicate with below
50fa5880e8/aten/src/ATen/native/RNN.cpp (L676-L690)
3. Add some checks
4. support 1D
5. Add Test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108340
Approved by: https://github.com/mikaylagawarecki
2023-09-08 22:22:09 +00:00
8caaa4f4cd Revert "Re-land: Break graph on manual_seed. (#108647)"
This reverts commit c887309437817f39ea3ef484732af427b393899f.

Reverted https://github.com/pytorch/pytorch/pull/108647 on behalf of https://github.com/huydhn due to Ouch, we are hit again my another internal import error from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py#L205-L206 ([comment](https://github.com/pytorch/pytorch/pull/108647#issuecomment-1712230103))
2023-09-08 21:18:00 +00:00
296f015f42 [Dev Container]Add readme for devcontainer (#108848)
Following the PR https://github.com/pytorch/pytorch/pull/108766 , add a README to guide users through the usage of devcontainers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108848
Approved by: https://github.com/drisspg
2023-09-08 21:03:27 +00:00
137afe74e0 Don't fastpath conj copy when conj/neg bit mismatch (#108881)
Fixes https://github.com/pytorch/pytorch/issues/106051

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108881
Approved by: https://github.com/soulitzer
2023-09-08 20:44:43 +00:00
bd1229477d [ONNX] Add initial support for FP8 ONNX export (#107962)
This PR resurrects @tcherckez-nvidia's #106379 with changes to resolve conflicts against newer `main` and defines our own constants for the new ONNX types to [avoid breaking Meta's internal usage of an old ONNX](https://github.com/pytorch/pytorch/pull/106379#issuecomment-1675189340).

- `::torch::onnx::TensorProto_DataType_FLOAT8E4M3FN=17`
- `::torch::onnx::TensorProto_DataType_FLOAT8E5M2=19`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107962
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2023-09-08 20:40:39 +00:00
fa542cc4bb update triton pin (#108104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108104
Approved by: https://github.com/desertfire
ghstack dependencies: #107722
2023-09-08 20:01:57 +00:00
39ff80125f Add support for an operator level thread local observer (#108822)
Summary: Add support for an operator level thread local observer

Test Plan: Verified the interception as part of a pytorch model evaluation with static runtime.

Differential Revision: D49082250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108822
Approved by: https://github.com/davidberard98
2023-09-08 19:32:03 +00:00
68238606f3 Revert "Reland: Add PyObject preservation for UntypedStorage (#103907)"
This reverts commit 56b848157c259b4e53225e2516d603e9c8cfab79.

Reverted https://github.com/pytorch/pytorch/pull/103907 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing torchdistx build which uses check_pyobj here 9c1b9f5cb2/src/python/torchdistx/_C/deferred_init.cc (L87) ([comment](https://github.com/pytorch/pytorch/pull/103907#issuecomment-1712121158))
2023-09-08 19:27:07 +00:00
8d863560bd Allow adding extra dispatch keys to wrapper tensor subclass (#108808)
Updated version of https://github.com/pytorch/pytorch/pull/108313 which has more review comments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108808
Approved by: https://github.com/bdhirsh
2023-09-08 18:46:09 +00:00
aa3355da8a Refactor torch.onnx documentation (#108379)
* Distinguish both TorchScript-based exporter (`torch.onnx.export`) and the TorchDynamo-based exporter (`torch.onnx.dynamo_export`) exporters
* Merge ONNX diagnostics page with the exporter page
* Add initial version of a quick overview on the new exporter
* Updates `torch.compiler.html` with the right page for the ONNX Runtime backend for `torch.compile`
* Renamed doc files to clearly identify files belonging to the legacy and newer onnx exporters

Fixes #108274

https://docs-preview.pytorch.org/pytorch/pytorch/108379/index.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108379
Approved by: https://github.com/justinchuby, https://github.com/wschin, https://github.com/malfet
2023-09-08 18:23:48 +00:00
e91f66471c [reland][inductor] Switch to use the runtime interface for AOTInductor testing (#108878)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/108663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108878
Approved by: https://github.com/muchulee8
2023-09-08 17:58:35 +00:00
a81290ccb9 Add DLPack bool support (#108486)
Fixes #94463
Fixes https://github.com/pytorch/pytorch/issues/67081

- [X] Update DLPack header file
- [X] Add testing for DLPack boolean
- [X] Add boolean support to PyTorch's DLPack support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108486
Approved by: https://github.com/ezyang
2023-09-08 17:55:33 +00:00
b0de6a8002 [quant][executorch] Support inception_v4 in examples (#108382)
Summary: Verified that pt2e quant flow matches the fx flow with executorch backend config

Test Plan:
with-proxy buck2 run executorch/examples/quantization:example -- -m=ic4 --verify

```
[INFO 2023-08-31 16:08:06,923 example.py:77] prepare sqnr: inf
[INFO 2023-08-31 16:08:06,932 example.py:81] quant diff max: 0.0
[INFO 2023-08-31 16:08:06,936 example.py:85] quant sqnr: inf
```

full output: https://www.internalfb.com/intern/paste/P818520579/

Differential Revision: D48889075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108382
Approved by: https://github.com/kimishpatel
2023-09-08 17:39:31 +00:00
25d657c701 Fix possible naming collision issue (#107743)
Summary: As pointed out in https://github.com/pytorch/pytorch/pull/107479, using a set prevents collisions like "a" => "a", "a" => "a_1", "a_1" => "a_1" (but should go to "a_1_1"). We can combine using counters and a set to avoid this problem. Still gets us the performance benefit in the case of collisions with a very minor penalty in a case with no collision.

Test Plan:
Extract this code and run:
```
# New version
from typing import Dict, Set

class Net:
    _net_names_used_counters: Dict[str, int] = {}
    _net_names_used: Set[str] = set()

    staticmethod
    def current_prefix():
        return "test_prefix"

    staticmethod
    def _get_next_net_name(basename):
        basename = "/".join(x for x in [Net.current_prefix(), basename] if x)
        idx = Net._net_names_used_counters.get(basename, 0)
        while (name := basename if idx == 0 else f"{basename}_{idx}") in Net._net_names_used:
            idx += 1
        Net._net_names_used_counters[basename] = idx + 1
        Net._net_names_used.add(name)
        return name

print(Net._get_next_net_name("basename"))
print(Net._get_next_net_name("x_basename"))
print(Net._get_next_net_name("basename"))
print(Net._get_next_net_name("basename"))
print(Net._get_next_net_name("x_basename"))
print(Net._get_next_net_name("basename_1"))

> test_prefix/basename
> test_prefix/x_basename
> test_prefix/basename_1
> test_prefix/basename_2
> test_prefix/x_basename_1
> test_prefix/basename_1_1
```

Differential Revision: D48576516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107743
Approved by: https://github.com/zdevito
2023-09-08 17:39:27 +00:00
8990174676 [Dynamo] Should inline __new__ function rather than skipping frame (#108549)
Fixes #107460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108549
Approved by: https://github.com/jansel
2023-09-08 16:51:47 +00:00
9b83402666 Add support for symbolic repeat_interleave (#108763)
Fixes https://github.com/pytorch/pytorch/issues/108195

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108763
Approved by: https://github.com/Chillee
2023-09-08 16:48:32 +00:00
ef2bbe1ae1 Dynamo support for autograd.Function w/ once_differentiable (#108686)
Fixes #106893

There are two main changes:
- Before this PR, the function returned by once_differentiable was
included in skipfiles (because its .co_code is
torch/autograd/function.py). This PR adds a mechanism to tell Dynamo
to inline a function, no matter if it is included in skipfiles.
- A bugfix: when we are introspecting the backward, we need to turn the
grad mode off. This is to accurately model the eager-mode semantics:
In eager-mode PyTorch, if second-order gradients were not requested, then
the grad mode is off. torch.compile does not work with higher-order
gradients and just assumes we do first-order gradients, so this is OK.

Test Plan:
- new test

Differential Revision: [D49064185](https://our.internmc.facebook.com/intern/diff/D49064185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108686
Approved by: https://github.com/voznesenskym
2023-09-08 16:10:32 +00:00
cyy
16c2fb702b fix a CMake syntax warning (#108849)
Fix the CMake Warning
```
(dev) at torch/CMakeLists.txt:389:
Syntax Warning in cmake code at column 115
Argument not separated from preceding token by whitespace.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108849
Approved by: https://github.com/Skylion007
2023-09-08 16:10:12 +00:00
fa8bfe5ca2 Revert "increase clang-tidy coverage in torch/csrc (#103058)"
This reverts commit cdf7f3e78032a17600f701e9153e9bb49fad8ce7.

Reverted https://github.com/pytorch/pytorch/pull/103058 on behalf of https://github.com/atalman due to Sorry for reverting your change, breaks lint ([comment](https://github.com/pytorch/pytorch/pull/103058#issuecomment-1711906915))
2023-09-08 16:07:41 +00:00
cyy
cdf7f3e780 increase clang-tidy coverage in torch/csrc (#103058)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103058
Approved by: https://github.com/Skylion007
2023-09-08 15:07:32 +00:00
2028987bf7 Fix finding Intel MKL on Windows, as well as LAPACK, cuDNN and cuSPARSELt (#108040)
Fixes #108039

Intel MKL is now found correctly:

-- MKL libraries: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/intel64/mkl_intel_lp64.lib;C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/intel64/mkl_sequential.lib;C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/intel64/mkl_core.lib
-- MKL include directory: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/include

and LAPACK too (excerpt from build.ninja):

LINK_LIBRARIES = lib\c10.lib  lib\pthreadpool.lib  lib\cpuinfo.lib  lib\XNNPACK.lib  lib\fbgemm.lib  lib\libittnotify.lib  lib\gloo.lib  lib\foxi_loader.lib  lib\kineto.lib  "C:\Program Files (x86)\Intel\oneAPI\mkl\latest\lib\intel64\mkl_intel_lp64.lib"  "C:\Program Files (x86)\Intel\oneAPI\mkl\latest\lib\intel64\mkl_sequential.lib"  "C:\Program Files (x86)\Intel\oneAPI\mkl\latest\lib\intel64\mkl_core.lib"  "C:\Program Files (x86)\Intel\oneAPI\mkl\latest\lib\intel64\**mkl_lapack95_lp64.lib**"  "C:\Program Files (x86)\Intel\oneAPI\mkl\latest\lib\intel64\mkl_intel_lp64.lib"  "C:\Program Files (x86)\Intel\oneAPI\mkl\latest\lib\intel64\mkl_sequential.lib"  "C:\Program Files (x86)\Intel\oneAPI\mkl\latest\lib\intel64\mkl_core.lib"  "C:\Program Files (x86)\Intel\oneAPI\mkl\latest\lib\intel64\mkl_intel_lp64.lib"  "C:\Program Files (x86)\Intel\oneAPI\mkl\latest\lib\intel64\mkl_sequential.lib"  "C:\Program Files (x86)\Intel\oneAPI\mkl\latest\lib\intel64\mkl_core.lib"

cuSPARSELt is also found correctly:

-- Found CUSPARSELT: C:/Program Files/NVIDIA cuSPARSELt/v0.4/lib/cusparseLt.lib

Also cuDNN include directory is properly added for the test target cuda_cudnn_test:

build caffe2\CMakeFiles\cuda_cudnn_test.dir\__\aten\src\ATen\test\cuda_cudnn_test.cpp.obj: CXX_COMPILER__cuda_cudnn_test_RelWithDebInfo C$:\work\Repos\pytorch\aten\src\ATen\test\cuda_cudnn_test.cpp || cmake_object_order_depends_target_cuda_cudnn_test
  DEFINES = ....
  FLAGS = ....
  INCLUDES = -IC:\work\Repos\pytorch\build\aten\src -IC:\work\Repos\pytorch\aten\src ........... -external:IC:\work\Repos\pytorch\third_party\ittapi\include -external:IC:\work\Repos\pytorch\cmake\..\third_party\eigen -external:I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" -external:IC:\work\Repos\pytorch\torch\include -external:IC:\work\Repos\pytorch\third_party\ideep\include -external:IC:\work\Repos\pytorch\third_party\googletest\googletest\include -external:IC:\work\Repos\pytorch\third_party\googletest\googletest **-external:I"C:\Program Files\NVIDIA cuDNN\include"** -external:IC:\work\Repos\pytorch\cmake\..\third_party\cudnn_frontend\include -external:W0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108040
Approved by: https://github.com/ezyang
2023-09-08 14:41:00 +00:00
366baf690b Back out "[Dynamo x FSDP] Add support for params, buffers, submodules on FSDPManagedNNModuleVariable (#107923)" (#108823)
Summary:
Original commit changeset: 33650f7cb0fb

Original Phabricator Diff: D48833682

Test Plan: See T162942232 for how we figured out that this diff caused significant numeric difference.

Reviewed By: voznesenskym

Differential Revision: D49082219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108823
Approved by: https://github.com/xw285cornell
2023-09-08 14:39:43 +00:00
39180a8414 Comment about prune_dead_locals in dynamo (#107787)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107787
Approved by: https://github.com/mlazos
2023-09-08 14:37:28 +00:00
51c2b587c9 Back out "[PyPer][BE] Fix test_scripted_module in StatCollector" (#108588)
Differential Revision: D48908507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108588
Approved by: https://github.com/jerryzh168
2023-09-08 14:33:58 +00:00
ddbaad6d74 updated pad_sequence type hint (#108765)
Fixes #89623

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108765
Approved by: https://github.com/malfet, https://github.com/zou3519, https://github.com/ezyang
2023-09-08 13:06:03 +00:00
09f7cb0eaf fix typo of mkldnn linear dynamic shape path (#108330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108330
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #108220
2023-09-08 08:47:57 +00:00
a9c663c269 Revert "Flash Attention v2 (#105602)" (#108827)
This reverts commit add45aea1cc8048fd0b43445b28fec7d93281f00.

There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually.

The diff has been reverted internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827
Approved by: https://github.com/kit1980
2023-09-08 07:43:04 +00:00
e40d6ae0a7 Improve torch.cuda.amp type hints (#108630)
Fixes #108629

1. Add the following to their modules' `__all__` so that pyright considers them to be publicly exported:
* [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast)
* [`torch.cuda.amp.GradScaler`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler)
* [`torch.cuda.amp.autocast`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.autocast)
* [`torch.cuda.amp.custom_fwd`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.custom_fwd)
* [`torch.cuda.amp.custom_bwd`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.custom_bwd)
2. Add `overload`s for `torch.cuda.amp.GradScaler.scale` to differentiate when a `torch.Tensor` is returned vs. an `Iterable[torch.Tensor]` is returned based on the type of the `outputs` parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108630
Approved by: https://github.com/ezyang
2023-09-08 06:06:25 +00:00
6c7260407b Back out "Horizontally fuse input concatenation (#108115)" (#108793)
Summary:
Original commit changeset: f15956d96311

Original Phabricator Diff: D48996091

Test Plan: Reverting to Unbreak test

Differential Revision: D49065517

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108793
Approved by: https://github.com/Chillee
2023-09-08 05:14:57 +00:00
428f5f9e7e Revert "[inductor] Switch to use the runtime interface for AOTInductor testing (#108663)"
This reverts commit 366ce589d0b6bdde8f9ca2087f224b6925841a05.

Reverted https://github.com/pytorch/pytorch/pull/108663 on behalf of https://github.com/Chillee due to Sorry :'( Need to revert to resolve merge conflict for another revert ([comment](https://github.com/pytorch/pytorch/pull/108663#issuecomment-1711076411))
2023-09-08 05:01:27 +00:00
4965fffeda [dynamo] Move global state guards to C++ (#108624)
This combines a bunch of python global state guards into a single C++ guard and switches to checking them 100% of the time.  It also adds a few new guards for things that change inductor's behavior.   Even though we are checking more things, I expect this to be much faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108624
Approved by: https://github.com/anijain2305
2023-09-08 04:07:08 +00:00
258bc2d845 [vision hash update] update the pinned vision hash (#108818)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108818
Approved by: https://github.com/pytorchbot
2023-09-08 04:04:06 +00:00
30a33b76b9 [AOTInductor] Include constants in AOTInductor .so file. (#108473)
Summary:
Include constants in AOTInductor .so file.
Added some difference:
1) serialize with ctypes instead of the native of torch.storage
2) Use the underlying for_blob instead of from_blob to construct Tensor.

Test Plan:
Unit tests:
```
test/inductor/test_aot_inductor.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108473
Approved by: https://github.com/angelayi
2023-09-08 03:49:53 +00:00
72f24d0001 Revert "[dynamo][finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108528)"
This reverts commit 34bb74c4cf963f1939b4988b7e76b2cea5e2a914.

Reverted https://github.com/pytorch/pytorch/pull/108528 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it has some nasty merge conflicts after the revert of D48910794. I need to revert this so the conflict could be resolved. Please help rebase this tomorrow and reland the change ([comment](https://github.com/pytorch/pytorch/pull/108528#issuecomment-1711034781))
2023-09-08 03:49:41 +00:00
e45b290127 Revert "Revert "Flash Attention v2 (#105602)" (#108827)"
This reverts commit 24e9bbe22af296048f8242c6112d13cff726c588.

Reverted https://github.com/pytorch/pytorch/pull/108827 on behalf of https://github.com/huydhn due to I need to land this revert properly as there are new failures showing up on trunk ([comment](https://github.com/pytorch/pytorch/pull/108827#issuecomment-1711020924))
2023-09-08 03:25:45 +00:00
24e9bbe22a Revert "Flash Attention v2 (#105602)" (#108827)
This reverts commit add45aea1cc8048fd0b43445b28fec7d93281f00.

There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually.

The diff has been reverted internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827
Approved by: https://github.com/kit1980
2023-09-08 02:54:20 +00:00
8391e3fba4 fixed nn.Module.to type hint (#108767)
Fixes #108675

- [x] adds `str` as option for `device`
- [x] use `typing_extensions.Self` instead of `T`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108767
Approved by: https://github.com/ezyang
2023-09-08 02:40:53 +00:00
f90444cf0b [Dynamo][Test]Add a testcase for module with training state (#108750)
Add the problem mentioned in https://github.com/pytorch/pytorch/issues/105653 into tests. This issue has been addressed by https://github.com/pytorch/pytorch/pull/108528 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108750
Approved by: https://github.com/anijain2305
2023-09-08 02:39:42 +00:00
dec2b267d4 [dynamo] Add "Torch-Compiled Region" profiler event (#108462)
**Motivation**: We already have a `CompiledFunction` event that comes from the autograd.Function added by aot_autograd. However, this doesn't appear during inference, or if none of the inputs to a graph require grad. It also doesn't appear if your backend doesn't use aot_autograd.

This adds a profiler event that will always appear.

<img width="615" alt="Screenshot 2023-09-01 at 4 46 28 PM" src="https://github.com/pytorch/pytorch/assets/5067123/fed90ca9-a8e7-458c-80eb-b4160de55218">

Perf - increase in latency (with profiler turned off) was within noise when I measured a simple cpu-only torch-compiled function that returned `x.view_as(x)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108462
Approved by: https://github.com/anijain2305
2023-09-08 02:10:17 +00:00
38fcf77a1b Revert "[dynamo] Add BACKEND_MATCH guard to detect and recompile when backend changes (#107337)"
This reverts commit 1a64ec7dd48408d6839a5c2cceb55b0c4be2243b.

Reverted https://github.com/pytorch/pytorch/pull/107337 on behalf of https://github.com/huydhn due to Sorry for reverting your change but inductor perf smoke test starts to regress after this ([comment](https://github.com/pytorch/pytorch/pull/107337#issuecomment-1710974588))
2023-09-08 02:03:48 +00:00
cyy
e3280a7c88 fix returning in void function (#108774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108774
Approved by: https://github.com/Skylion007
2023-09-08 01:51:14 +00:00
a6dab86259 [C10d] Fix TCPSTore::wait to be robust to interruptions. (#108425)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108425
Approved by: https://github.com/daulet-askarov, https://github.com/fegin
2023-09-08 00:12:20 +00:00
fc2b980000 [Lint] Auto format graph_module.py (#108594)
Summary: Auto format the `graph_module.py` file

Test Plan: lint

Differential Revision: D48983066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108594
Approved by: https://github.com/jiayisuse
2023-09-08 00:04:21 +00:00
c458fa0d35 Decompose/add reference for view_as_complex (#108005)
Aten source: d4a99631dd/aten/src/ATen/native/ComplexHelper.h (L78)

Documentation reference:
https://pytorch.org/docs/stable/generated/torch.view_as_complex.html

Note: this adds a new primitive `view_of_dtype`, which is trivially implemented, as its meta function is already implemented elsewhere.

Finally, this is not registered as a decomposition (yet), because TorchInductor does not yet support complex types. It should be added once we do.

Closes https://github.com/pytorch/pytorch/issues/108020 as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108005
Approved by: https://github.com/peterbell10, https://github.com/ezyang
2023-09-07 23:49:20 +00:00
366ce589d0 [inductor] Switch to use the runtime interface for AOTInductor testing (#108663)
Summary: Switch AOTInductor unit tests and integration tests to invoke the same runtime interface. This is only an effort to unify the usage of the runtime. The interface scrutiny will come in later PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108663
Approved by: https://github.com/ezyang
ghstack dependencies: #108653
2023-09-07 23:38:11 +00:00
c55cb29bb2 enforce equalities (#108429)
Sometimes one might want to impose equalities that are not required by guards, e.g. say that you only want square images when rectangular images would suffice.

Curiously we never checked that the concrete values passed in example shapes actually satisfy such equality constraints. So, e.g., you could multiply two tensors of shapes MxK and KxN, specify that M and N must be equal, and then pass examples where they are not equal.

Relatedly, the symbolic shape dimensions for inputs in the exported graph were not forced to be equal.

However, runtime assertions still fire because they take into account all equality constraints. This would result in the strange situation where export would succeed but the exported program with the same example inputs would fail.

This PR fixes these issues.

Differential Revision: [D48910918](https://our.internmc.facebook.com/intern/diff/D48910918/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108429
Approved by: https://github.com/zhxchen17
2023-09-07 23:21:35 +00:00
247c603da9 Run mm decomposition tests for CPU and GPU (#108620)
Summary: Run mm decomposition tests for CPU and GPU

One nit - this will suppress CPU tests on hosts that have CUDA (i.e., TEST_CUDA is True), but doesn't have Triton because we don't have access to whether the test is actually for CPU or CUDA (which would require reading the device argument)
(This is a general limitation on torch.compile tests because on CUDA they require triton in the std config.)

Test Plan: sandcastle, github

Differential Revision: D48998215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108620
Approved by: https://github.com/bertmaher
2023-09-07 23:00:26 +00:00
1a64ec7dd4 [dynamo] Add BACKEND_MATCH guard to detect and recompile when backend changes (#107337)
**Motivation:**
We try to make torch.cond use torch.compile automatically so that we could error out when there is side-effects in the branches and correctly handle the closures.

Before this PR, we have a warning if we don't turn on a config raise_on_backend_change (turning it on gives us an error) for the following code:
```python
def foo()

# Inside torch.cond, we'd like to do something like
torch.compile(foo, backend="eager", fullgraph=True)(...)
...
# Users may then call torch.compile somewhere else.
# Dynamo will use the cached code of foo for "eager" backend
# but we expect dynamo to recompile with "inductor" backend.
torch.compile(foo, backend="inductor")(...)
```

This PR adds a BACKEND_MATCH guard. Effectively, it implements a per-backend cache. In the above example, the cached code for "eager" won't work for "inductor" due to guard check failures and the second torch.compile will do a re-compilation. In the future, it might be useful to have something like a configuration guard that guards against dynamo configuration changes across different compiles (e.g. compile a function with fullgraph=False then compile it again with fullgraph=True).

**Implementation:**
1. We add a guarded_backend_cache and check the most_recent_backend against the backend associated with cached code. We also remove the raise_on_backend_change flag.

2. Then newly added context manager and guard adds more lines for debug log so we change the uppper limit from 50 to 55.

**Test Plan:**
Removed original tests that raise on different backend and add a new test to test whether the BACKEND_MATCH guard can guard against backend change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107337
Approved by: https://github.com/jansel
2023-09-07 22:45:54 +00:00
b26af5d5ac [c10d] Add TCPSTore libuv backend support to c10d rendezvous. (#108284)
This enables libuv under env and tcp urls.

Under env either use the environment variable USE_LIBUV=1
or the url parameter use_lib=1.

Under tcp use the url parameter use_lib=1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108284
Approved by: https://github.com/H-Huang, https://github.com/XilunWu
2023-09-07 21:39:58 +00:00
96d269eab1 [Dev Container][CUDA]Fix linker path (#108766)
Building with CUDA in dev container leads to error: `cannot find -lcudart_static`. This is because the libraries are under a custom CUDA_HOME, and `ld` cannot find it.

Updating the `LDFLAGS` environment variable works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108766
Approved by: https://github.com/drisspg
2023-09-07 21:32:39 +00:00
09a17c512d Add better error messaging to scaled_mm (#108454)
Fixes #108411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108454
Approved by: https://github.com/vkuzo
2023-09-07 21:26:47 +00:00
1f20531939 fall back to eager on NotImplementedError (#107863)
Follow-up to https://github.com/pytorch/pytorch/pull/107710:

Help  dynamo fall back to eager when compiling unimplemented numpy constructs:

- arrays of strings
- (arg){min, max} for complex types
- various arguments typed as NotImplemented (`np.ones(4, order="F")` etc)
- numpy functions which torch._numpy does not implement

To test, run (we do not implement arrays of strings)

```
import torch
import numpy as np

@torch.compile(fullgraph=False)
def fn():
    return np.asarray(["L", "U"])
```

and observe it compiles with fullgraph=False and fails with fullgraph=True

Fixes https://github.com/pytorch/pytorch/issues/107970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107863
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-09-07 21:22:20 +00:00
8ba23e48fa Revert "[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)"
This reverts commit 53a27021c59f1df640cba88bb48f67ee977e07f8.

Reverted https://github.com/pytorch/pytorch/pull/106581 on behalf of https://github.com/atalman due to Sorry for reverting your change, but it broke rocm CI ([comment](https://github.com/pytorch/pytorch/pull/106581#issuecomment-1710776610))
2023-09-07 21:13:42 +00:00
774c822979 Fix expected test failures for predispatch export nested cond and out_dtype (#108715)
Before this PR, we use get_fake_value to get the fake_sub_args then call op(*fake_sub_args) to get the example value for out dtype.

This causes problem when the input proxy's op type is `get_attr`, get_fake_value for a `get_attr` node will actually look at the original param/buffer and **return a real tensor** instead of fake tensor.  This is OK for export, since export's fake_mode allows non_fake_inputs see [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/output_graph.py#L278). But it causes problem when nesting cond with out_dtype where cond will use torch.compile(full_graph=True) to inspect out_dtype and find the inputs to op are mixed FakeTensor and real tensor.

This PR changes how we get the example values from proxies by directly looking at node.meta["example_value"]. This meta data is guaranteed to exist for all proxies during dynamo tracing so it's safe to use ( it's also used by get_fake_value to get fake tensors from args for general ops see [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/utils.py#L1318)).

Test Plan:
existing tests + remove expected failure for a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108715
Approved by: https://github.com/zou3519
2023-09-07 18:13:00 +00:00
53a27021c5 [inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano
2023-09-07 17:40:45 +00:00
ab9fb03d6f Remove fixed skips (#108674)
These no longer fail with TEST_WITH_TORCHINDUCTOR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108674
Approved by: https://github.com/desertfire
2023-09-07 17:36:56 +00:00
77691e8bc3 Revert "[dynamo][activation checkpointing] Trace through ActivationWrapper (#108599)"
This reverts commit 9efe0f7bf2b397f5ba7ea778fe155e415f54ea67.

Reverted https://github.com/pytorch/pytorch/pull/108599 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but test_ddp_activation_checkpointing is failing distributed ROCm test in trunk ([comment](https://github.com/pytorch/pytorch/pull/108599#issuecomment-1710479387))
2023-09-07 16:47:40 +00:00
c75aec90d3 [dynamo] Record nn_module_stack also for unspecialized nn modules. (#108281)
Summary: Currently node metadata "nn_module_stack" is only being used by export. For some export model, we still want to retain nn_module_stack for unspecialized module for various purposes. This diff add a path to also record nn_module_stack when unspecialized module has a source available.

Test Plan: test_export_nn_module_stack_patched_module

Differential Revision: D48841193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108281
Approved by: https://github.com/yanboliang, https://github.com/tugsbayasgalan
2023-09-07 15:38:46 +00:00
121cfb60c0 fix the issue decribled by #108380 (#108759)
Fixes #108380

As the title shown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108759
Approved by: https://github.com/awgu
2023-09-07 13:53:36 +00:00
b928e08f3d Initial vmap + NT support with unbind fallback (#106786)
PoC demonstrating vmap + NT based on the [design doc](https://docs.google.com/document/d/1dVVk6TOqz93PLTIneU2T3xaxCs9qZ0MaJyCvOAp_bC0). This PR:
* Allows `BatchedTensorImpl`s to contain NTs
* Introduces a `BatchedNestedTensor` dispatch key for NT-specific batching rules
* Provides a batching rule fallback that unbinds the NTs -> performs computation on constituent -> rebinds results into NT

Restrictions:
* Only supports one level of vmap
* Only supports vmapping over dim=0 for NTs
    * For operations with mixed NT / dense inputs, support is also limited to dim=0 for the dense inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106786
Approved by: https://github.com/zou3519
2023-09-07 13:53:20 +00:00
cyy
e4f3e5434f [Reland] Elimates c10::guts::to_string (#108748)
Reland of PR #108480, after relanding another blocking PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108748
Approved by: https://github.com/huydhn
2023-09-07 13:35:17 +00:00
c887309437 Re-land: Break graph on manual_seed. (#108647)
Trying to re-land #107594.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108647
Approved by: https://github.com/eellison
2023-09-07 12:52:38 +00:00
9f37aec964 Add torch._check_is_size (#108685)
Check comments for what it does.  The key distinction is that if
you feed it an unbacked SymInt, we will also apply >= 2 assumption
at compile time.

This will get exercised when I reland
https://github.com/pytorch/pytorch/pull/107788

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108685
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-09-07 12:48:39 +00:00
e1aba2c8c3 [CI] Update the pinned timm version (#108076)
Summary: Unify the pinned timm version and install timm at the docker building time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108076
Approved by: https://github.com/ezyang
2023-09-07 11:38:13 +00:00
b193f295b6 Add capturable ASGD impl (#107857)
Add capturable ASGD impl + test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107857
Approved by: https://github.com/janeyx99
2023-09-07 06:30:30 +00:00
cyy
4fa283e0a4 [Reland] Simplify c10::string_view implementation (#108622)
PR #108479 was reverted because
```
In file included from xplat/caffe2/c10/util/Exception.h:5:
In file included from xplat/caffe2/c10/util/StringUtil.h:6:
xplat/caffe2/c10/util/string_view.h:576:31: error: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated [-Werror,-Wdeprecated]
    basic_string_view<CharT>::npos;
```
Now this is fixed and Wdeprecated generated no warnings on my host.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108622
Approved by: https://github.com/Skylion007
2023-09-07 06:24:22 +00:00
fae9547cb7 [inductor] Refactor wrapper.py (#108653)
Summary: Cherry-pick refactoring from https://github.com/pytorch/pytorch/pull/105331 to make the code review easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108653
Approved by: https://github.com/ezyang, https://github.com/khabinov
2023-09-07 05:27:59 +00:00
6a448816f5 [fx][split] Copy node metadata for placeholders (#107981)
- Follow-up to #107248 which copies metadata for placeholder nodes in the top-level FX graph
- Currently, top-level placeholders do not have their metadata copied over, causing loss of `TensorMetadata` in some `torch.compile` backends

Fixes https://github.com/pytorch/TensorRT/issues/2258
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107981
Approved by: https://github.com/angelayi
2023-09-07 04:44:17 +00:00
56b848157c Reland: Add PyObject preservation for UntypedStorage (#103907)
This relands #97470 after #102553 reverted it. This PR attempts to fix the internal failure by avoiding an unnecessary intermediate storage buffer allocation in `c10::newStorageImplFromRefcountedDataPtr`.

Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103907
Approved by: https://github.com/ezyang
2023-09-07 04:24:11 +00:00
35974234c4 [inductor] simplify time_and_log fallback (#108489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108489
Approved by: https://github.com/eellison
ghstack dependencies: #108468
2023-09-07 04:23:36 +00:00
96dd173fa0 [inductor] simplify cudagraph_fail_reasons printing (#108468)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108468
Approved by: https://github.com/eellison
2023-09-07 04:23:36 +00:00
7bc25e38c0 [HSDP] Raise error when HSDP device_mesh has a parent_mesh (#108603)
As we don't currently support HSDP + TP yet, raises an error for HSDP initialization if a device_mesh passed in has a parent mesh.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108603
Approved by: https://github.com/awgu
2023-09-07 04:17:10 +00:00
275e71c562 [inductor][easy] Enable Mypy Checking in torch/_inductor/kernel/ (#108678)
Summary: Looks like these already pass (and torch/_inductor/kernel/mm_plus_mm_new.py does not exist)

Test Plan: `lintrunner torch/_inductor/kernel/mm.py torch/_inductor/kernel/bmm.py torch/_inductor/kernel/__init__.py torch/_inductor/kernel/mm_plus_mm.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108678
Approved by: https://github.com/eellison
2023-09-07 03:29:22 +00:00
3f88e3105f Reland: Remove remaining global set_default_dtype calls from tests (#108088)
Fixes #68972

Relands #107246

To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088
Approved by: https://github.com/ezyang
2023-09-07 03:04:34 +00:00
54e73271c7 When patching dynamic shapes test class, don't run the original tests (#108681)
redo of https://github.com/pytorch/pytorch/pull/103523

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108681
Approved by: https://github.com/ezyang
2023-09-07 02:13:59 +00:00
027e3b7910 [Forward-fix] check if source is None when using tensor out variants (#108700)
Summary: As title

Test Plan: Sandcastle

Reviewed By: JacobSzwejbka

Differential Revision: D49029357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108700
Approved by: https://github.com/angelayi
2023-09-07 01:51:02 +00:00
34bb74c4cf [dynamo][finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108528)
**This PR is a 99% copy paste of Sam Gross** (@colesbury) work at https://github.com/pytorch/pytorch/pull/100642. Copied from there

--------
The NN_MODULE guard now subsumes guards on Module attributes. The check_fn will fail if the module attributes are changed (such as Module.training), parameters, submodules, and buffers are added or removed, and if fields are changed on the type itself.

This gives up specificity in the guard check -- if any field is changed the check_fn fails -- for faster overall checks.

-----

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108528
Approved by: https://github.com/ezyang
2023-09-07 01:45:47 +00:00
d830e4658a [export] Fix unlifting pass param name handling. (#108659)
Summary: Fixing an internal test.

Differential Revision: D49014757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108659
Approved by: https://github.com/huydhn
2023-09-07 01:39:07 +00:00
d301fb4022 Fix broken doc tests after #108482 (#108725)
Tiny fix so I don't wanna revert the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108725
Approved by: https://github.com/kit1980
2023-09-07 01:24:53 +00:00
e3407238f6 [export] Lift constant tensors as buffers (#108592)
When we retrace the graph containing constant tensors, they get lifted as buffer inputs.
AotInductor also wants to lift all the constants as inputs.
If we separate the constants as a separate thing, then it adds an additional complexity where we now have to keep track of 3 inputs (params, buffers, constants).

Cons: People might care about specifically what buffers are/are not buffers?

If people want to know specifically which buffers are constants, we can add an additional field in the graph signature to mark this.

Differential Revision: [D49017872](https://our.internmc.facebook.com/intern/diff/D49017872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108592
Approved by: https://github.com/zhxchen17
2023-09-07 01:14:30 +00:00
43527d41a2 Revert "Remove fixed skips (#108674)"
This reverts commit 518cfda2dd0e940603c74717b4cb33493a9ec908.

Reverted https://github.com/pytorch/pytorch/pull/108674 on behalf of https://github.com/huydhn due to Sorry for reverting this, but one test is failing on inductor 518cfda2dd, and it seems easier to revert this than disabling the test ([comment](https://github.com/pytorch/pytorch/pull/108674#issuecomment-1709310192))
2023-09-07 00:56:46 +00:00
27fe45eaf6 [inductor][easy] Enable Mypy Checking for torch/_inductor/decomposition.py (#108682)
Summary: Looks like one simple type mismatch between `get_decompositions()` and `remove_decompositions()`

Test Plan: `lintrunner torch/_inductor/decomposition.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108682
Approved by: https://github.com/eellison
2023-09-07 00:48:55 +00:00
9efe0f7bf2 [dynamo][activation checkpointing] Trace through ActivationWrapper (#108599)
Fixes https://github.com/pytorch/pytorch/issues/108269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108599
Approved by: https://github.com/rohan-varma
2023-09-07 00:32:18 +00:00
c1877e99c5 [Quant] Move to BFS instead of DFS to check for connectedness (#108572)
Summary:
Using dfs to check if two nodes are connecgted is making it very slow.
Use of BFS makes it much faster.

Test Plan:
https://gist.github.com/leslie-fang-intel/9cd828623f567a3afbf41564d3546398

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48971710](https://our.internmc.facebook.com/intern/diff/D48971710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108572
Approved by: https://github.com/jerryzh168, https://github.com/osalpekar
2023-09-07 00:26:28 +00:00
2a40fe2dbf [experimental] use EXCEPT_FOR env to suppress CPU tests from GPU RE (#108672)
Summary:
[experimental] use EXCEPT_FOR env to suppress CPU tests from GPU RE -- alternative implementation to D48997976 using preexisting PYTORCH_TESTING_DEVICE_EXCEPT_FOR facility and building remaining logic (for assert-positive listers like test_transformers)  on top of that.

Goal: save ~100 GPU (10% of capacity), enables us to fund more aggressive PyPer unit testing on GPU RE

Test Plan: sandcastle, github

Differential Revision: D48998582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108672
Approved by: https://github.com/bertmaher
2023-09-06 23:33:18 +00:00
6a304ed1f2 Revert "Skip ROCm jobs on PR (for now) (#108083)"
This reverts commit 9fdb5ef26b5c560d80d002ca9fea9632a523b94b.

Reverted https://github.com/pytorch/pytorch/pull/108083 on behalf of https://github.com/huydhn due to ROCm queue looks better now, reverting this to see if the queue looks ok before picking up https://github.com/pytorch/test-infra/issues/4516 ([comment](https://github.com/pytorch/pytorch/pull/108083#issuecomment-1709222748))
2023-09-06 22:47:25 +00:00
e73ec92ad2 Minor fixs to make torchbench runable on torch/xla (#107919)
`import torch_xla.core.xla_model as xm` no longer trigger the xla runtime to init, hence explictly create the device here. This is a workaround for https://github.com/pytorch/xla/issues/4174.

`is_correct` reference has been deleted, I think it is a deadcode.

After this patch, I am able to run

```
python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=openxla --only resnet50
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107919
Approved by: https://github.com/shunting314, https://github.com/wconstab
2023-09-06 22:35:53 +00:00
518cfda2dd Remove fixed skips (#108674)
These no longer fail with TEST_WITH_TORCHINDUCTOR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108674
Approved by: https://github.com/desertfire
2023-09-06 22:33:43 +00:00
e6042db0f1 Try to use linux.arm64.2xlarge runners (#107672)
Try to use linux.arm64.2xlarge runners.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107672
Approved by: https://github.com/atalman
2023-09-06 22:06:57 +00:00
cd6a332bc5 Use linux.24xlarge for conda linux nightly builds (#108695)
Fixes https://github.com/pytorch/pytorch/issues/108607
CI test: https://github.com/pytorch/pytorch/pull/108666
Lowering memory limit did not worked: https://github.com/pytorch/pytorch/pull/108669
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108695
Approved by: https://github.com/drisspg, https://github.com/seemethere, https://github.com/huydhn
2023-09-06 21:39:35 +00:00
d856f3b47d [export] Change _generate_new_graph_signature (#108571)
Summary:
Previously `_generate_new_graph_signature` had the assumption that all transformations were not in place. However, this is an incorrect assumption leading to mysterious failures when running passes doing in-place modifications.

This function is technically only needed in the case where the user output node or user input node name is changed. For example, if the user output node was "add" but a pass changes all the "add"s to "mul"s, then the output node will now be named "mul", which we have to update.

For cases where users change the number of user inputs/outputs, number of parameters/buffers, or the names of parameters/buffers it will require extra work on the user's side to update the graph signature, since there is no automatic way for us to detect where to put what.

Note: this doesn't actually change the names for the buffers_to_mutate part of the graph signature, but we're going to assume this is rare, because implementing auto-fixing for that is a little hard...

Test Plan: Running `buck test fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:` on top of D48710125, https://www.internalfb.com/intern/testinfra/testrun/5066549776877081

Differential Revision: D48917505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108571
Approved by: https://github.com/zhxchen17
2023-09-06 21:39:26 +00:00
089950b83a Fix inductor sub with symbolic integers. (#108518)
Fix: #108159

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108518
Approved by: https://github.com/peterbell10
2023-09-06 21:01:34 +00:00
3f74e57e34 add packaging to requirements.txt (#108554)
As stated in this https://github.com/pytorch/pytorch/pull/107207#issuecomment-1700674065, packaging is not a built-in python module and need to add it to requirements.txt.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108554
Approved by: https://github.com/albanD
2023-09-06 20:29:46 +00:00
8a76f8e6fe Enable mypy checking in torch/_inductor/sizevars.py (#107862)
Fixes #105230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107862
Approved by: https://github.com/eellison
2023-09-06 19:43:07 +00:00
32a16d4999 [quant][pt2e] Support int16 quantization (#108453)
Summary:
Previously we can only use native pytorch int dtypes that has corresponding quantized dtypes (e.g. quint8, qint8), this
PR removes this assumption in observers/fake_quants so that users can use all pytorch native dtypes (except for int64, we can add it later if need)
the main addition here is int16.

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108453
Approved by: https://github.com/kimishpatel
2023-09-06 19:31:20 +00:00
bee7e78130 [PT2 Inference] Prototype of Inference Runtime (#108482)
Summary:
This diff demonstrates a simplified E2E workflow for PT2 Inference stack:
1. Model author with `torch.export()`
2. Model processing with `aot_inductor.compile()`
3. Model served with a new Inference Runtime API, named `ModelRunner`

`torch.export()` and `aot_inductor.compile()` produces a zip file using `PyTorchStreamWriter`.
Runtime reads the zip file with `PyTorchStreamReader`.
The zip file contains
 {F1080328179}
More discussion on packaging can be found in https://docs.google.com/document/d/1C-4DP5yu7ZhX1aB1p9JcVZ5TultDKObM10AqEtmZ-nU/edit?usp=sharing

Runtime can now switch between two Execution modes:
1. Graph Interpreter mode, implemented based on Sigmoid's Executor
2. AOTInductor mode, implemented based on FBAOTInductorModel

Test Plan:
buck2 run  mode/dev-nosan mode/inplace -c fbcode.enable_gpu_sections=True //sigmoid/inference/test:e2e_test

Export and Lower with AOTInductor
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/inference:export_package

Run with GraphInterpreter and AOTInducotr
buck2 run mode/dev-nosan //sigmoid/inference:main

Reviewed By: suo

Differential Revision: D47781098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108482
Approved by: https://github.com/zhxchen17
2023-09-06 19:28:58 +00:00
5a4fe05a15 Revert "Force synced KJT to trace unbacked SymInt (#107788)" (#108684)
This reverts commit 3b92ef814de4571a125294f2aa95843d7d2e2aea.  So let's manually revert it instead.

(Not sure why the bot doesn't work on https://github.com/pytorch/pytorch/pull/107788)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108684
Approved by: https://github.com/ezyang
2023-09-06 19:15:45 +00:00
1aacbaed8b Revert "[export] Fix dict.get() to dict.setdefault() for param lookup. (#108587)"
This reverts commit c99a70c8dfc98dd5e6905990c41194c2ceb1318b.

Reverted https://github.com/pytorch/pytorch/pull/108587 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing one internal test.  Please take a look at the diff D48995555 for more details ([comment](https://github.com/pytorch/pytorch/pull/108587#issuecomment-1708933010))
2023-09-06 19:05:01 +00:00
27d5dcf589 Revert "Use global variables to register the return_types namedtuples (#107000)"
This reverts commit ae8eb7a3f9aee106affca3b27c1f4031bd216730.

Reverted https://github.com/pytorch/pytorch/pull/107000 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing internal build ([comment](https://github.com/pytorch/pytorch/pull/107000#issuecomment-1708862325))
2023-09-06 18:13:23 +00:00
e5e653a660 Revert "docs: Match open bracket with close bracket in unsqueeze (#95215)"
This reverts commit 9d04d376d81be2f01e5ea6b68943390346f2494c.

Reverted https://github.com/pytorch/pytorch/pull/95215 on behalf of https://github.com/kit1980 due to Incorrect assumptions ([comment](https://github.com/pytorch/pytorch/pull/95215#issuecomment-1708852420))
2023-09-06 18:04:10 +00:00
20812d69e5 Fix extension rebuilding on Linux (#108613)
On Linux, CUDA header dependencies are not correctly tracked. After you modify a CUDA header, affected CUDA files won't be rebuilt. This PR will fix this problem.

```console
$ ninja -t deps
rep_penalty.o: #deps 2, deps mtime 1693956351892493247 (VALID)
    /home/qc/Workspace/NotMe/exllama/exllama_ext/cpu_func/rep_penalty.cpp
    /home/qc/Workspace/NotMe/exllama/exllama_ext/cpu_func/rep_penalty.h

rms_norm.cuda.o: #deps 0, deps mtime 1693961188871054130 (VALID)

rope.cuda.o: #deps 0, deps mtime 1693961188954388632 (VALID)

cuda_buffers.cuda.o: #deps 0, deps mtime 1693961188797719768 (VALID)

...
```

Historically, this line of code has been changed twice. It was first implemented in #49344 and there's no `if IS_WINDOWS`, just like now. Then in #56015 someone added `if IS_WINDOWS` for unknown reason. That PR has no description so I don't know what bug he encountered. I don't think there's any bug with these flags on Linux, at least for today. CMake generates exactly the same flags for CUDA.

```ninja
#############################################
# Rule for compiling CUDA files.

rule CUDA_COMPILER__cpp_cuda_unscanned_Debug
  depfile = $DEP_FILE
  deps = gcc
  command = ${LAUNCHER}${CODE_CHECK}/opt/cuda/bin/nvcc -forward-unknown-to-host-compiler $DEFINES $INCLUDES $FLAGS -MD -MT $out -MF $DEP_FILE -x cu -c $in -o $out
  description = Building CUDA object $out
```

where `-MD` is short for `--generate-dependencies-with-compile` and `-MF` is short for `--dependency-output`. My words can be verified by `nvcc --help`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108613
Approved by: https://github.com/ezyang
2023-09-06 17:58:21 +00:00
4e042cfed5 Improve triton bsr_dense_mm performance on column-major ordered inputs with float32 dtype (#108512)
As in the title.

The bsr_dense_mm performance on inputs using column-major storage order is relevant for `linear(x, W)` operation that for BSR weights is defined as `bsr_dense_mm(W, x.transpose(-2, -1)).transpose(-2, 1)` so that the second argument to `bse_dense_mm` is a strided tensor using column-major storage order when `x` is C-contiguous.

For large inputs (size > 1000) and moderate sparsity in the BSR input, the speed up can be more than 3 times, as illustrated in the following figure (raw data: [bench_bsr_dense_mm_1_results.txt](https://github.com/pytorch/pytorch/files/12512245/bench_bsr_dense_mm_1_results.txt)):

![bench_bsr_dense_mm_1](https://github.com/pytorch/pytorch/assets/402156/c6372008-dfae-4d26-b119-2c3c944a74ae)

For small inputs (size=512), there exists a slight degradation of performance.

For row-major ordered inputs, there is no change in performance (see raw data above).

For inputs with float16 dtype, there is no considerable change in performance (see blue marks in the figure).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108512
Approved by: https://github.com/cpuhrsch
2023-09-06 17:30:06 +00:00
1dabfb68e7 Add TORCH_API to expose RPC module functions for RPC module device extension (#108553)
At present, I refer to the existing rpc backend tensorpipe backend, and implement our own rpc communication backend in our extension package. We found that these functions are not exposed during development, and direct use will cause our extension package to appear undefined symbol problem.

Add the TORCH_API macro to the functions required to implement the custom tensorpipe agent in the rpc module to expose them to developers,at the same time, we think this risk is very controllable and hope it can be merged into the version 2.1.

cc
@albanD, @kumpera
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108553
Approved by: https://github.com/kumpera, https://github.com/albanD
2023-09-06 17:24:46 +00:00
e471c12a01 Enable mypy checking in torch/_inductor/__init__.py (#108336)
Fixes #105230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108336
Approved by: https://github.com/ezyang
2023-09-06 17:14:54 +00:00
738106c1f7 Torchbench model tolerance changes (#108598)
Move detectron2_fcos_r_50_fpn to amp. The minifier showed the following snippet as causing the divergence, where inductor has better numerics than eager:

```
import torch

def foo(x):
    return x > .2

inp = torch.tensor([.2002], device="cuda", dtype=torch.bfloat16)
print(foo(inp))

print(torch.compile(foo)(inp))
```

doctr_reco_predictor had very minimal divergence (.002 vs .001 required), bumping tolerance here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108598
Approved by: https://github.com/shunting314
2023-09-06 16:52:29 +00:00
aa89f0a1fd [Doc] Move Dynamo IPEX backend to training/inference category (#108643)
As title.
Since dynamo IPEX backend supports training, move it to the category above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108643
Approved by: https://github.com/msaroufim
2023-09-06 15:57:12 +00:00
79bc4eeb2b Fix empty vector segfault during version parsing in quantized serialization (#108418)
Hi!

I've been fuzzing different pytorch modules with with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch), and found a SEGV that occurs during data parsing for quantized conv deserialization. The crash occurs because of empty `optional` vector.

Docker to reproduce found error: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).

### PoC:
[crash-aaa72b1c1431ac556118e34099ba163052dc0f96.txt](https://github.com/pytorch/pytorch/files/12499249/crash-aaa72b1c1431ac556118e34099ba163052dc0f96.txt)

### ASAN report
```
==1003193==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000cbd1b1 bp 0x7fffffff8490 sp 0x7fffffff7a30 T0)
==1003193==The signal is caused by a READ memory access.
==1003193==Hint: address points to the zero page.
    #0 0xcbd1b1 in c10::optional_base<at::Tensor>::optional_base(c10::optional_base<at::Tensor> const&) /pytorch/c10/util/Optional.h:222:17
    #1 0x2b32336 in c10::optional<at::Tensor>::optional(c10::optional<at::Tensor> const&) /pytorch/c10/util/Optional.h:631:3
    #2 0x2b32336 in std::tuple<long, std::vector<long, std::allocator<long> >, std::vector<c10::optional<at::Tensor>, std::allocator<c10::optional<at::Tensor> > > > parse_conv_serialized_state<2u>(c10::IValue) /pytorch/aten/src/ATen/native/quantized/cpu/conv_serialization.h:183:17
    #3 0x2b30276 in int register_conv_params<2>()::'lambda'(c10::IValue)::operator()(c10::IValue) const /pytorch/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp:410:49
    #4 0x2b30014 in std::enable_if<!(std::is_member_pointer<std::decay<int register_conv_params<2>()::'lambda'(c10::IValue) const&>::type>::value), std::invoke_result<int register_conv_params<2>()::'lambda'(c10::IValue) const&, c10::IValue>::type>::type c10::guts::invoke<int register_conv_params<2>()::'lambda'(c10::IValue) const&, c10::IValue>(int register_conv_params<2>()::'lambda'(c10::IValue) const&, c10::IValue&&) /pytorch/c10/util/C++17.h:203:10
    #5 0x2b2f7e7 in torch::class_<ConvPackedParamsBase<2> >& torch::class_<ConvPackedParamsBase<2> >::def_pickle<int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&), int register_conv_params<2>()::'lambda'(c10::IValue)>(int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)&&, int register_conv_params<2>()::'lambda'(c10::IValue)&&)::'lambda'(c10::tagged_capsule<ConvPackedParamsBase<2> >, c10::IValue&&)::operator()(c10::tagged_capsule<ConvPackedParamsBase<2> >, c10::IValue&&) const /pytorch/torch/custom_class.h:328:11
    #6 0x2b2f570 in c10::guts::infer_function_traits<int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)>::type::return_type torch::detail::call_torchbind_method_from_stack<torch::class_<ConvPackedParamsBase<2> >& torch::class_<ConvPackedParamsBase<2> >::def_pickle<int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&), int register_conv_params<2>()::'lambda'(c10::IValue)>(int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)&&, int register_conv_params<2>()::'lambda'(c10::IValue)&&)::'lambda'(c10::tagged_capsule<ConvPackedParamsBase<2> >, c10::IValue&&), false, 0ul, 1ul>(int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)&, std::vector<c10::IValue, std::allocator<c10::IValue> >&, std::integer_sequence<unsigned long, 0ul, 1ul>) /pytorch/torch/custom_class_detail.h:139:10
    #7 0x2b2f408 in c10::guts::infer_function_traits<int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)>::type::return_type torch::detail::call_torchbind_method_from_stack<torch::class_<ConvPackedParamsBase<2> >& torch::class_<ConvPackedParamsBase<2> >::def_pickle<int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&), int register_conv_params<2>()::'lambda'(c10::IValue)>(int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)&&, int register_conv_params<2>()::'lambda'(c10::IValue)&&)::'lambda'(c10::tagged_capsule<ConvPackedParamsBase<2> >, c10::IValue&&), false>(int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)&, std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/custom_class_detail.h:153:10
    #8 0x2b2f408 in torch::detail::BoxedProxy<void, torch::class_<ConvPackedParamsBase<2> >& torch::class_<ConvPackedParamsBase<2> >::def_pickle<int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&), int register_conv_params<2>()::'lambda'(c10::IValue)>(int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)&&, int register_conv_params<2>()::'lambda'(c10::IValue)&&)::'lambda'(c10::tagged_capsule<ConvPackedParamsBase<2> >, c10::IValue&&)>::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&, torch::class_<ConvPackedParamsBase<2> >& torch::class_<ConvPackedParamsBase<2> >::def_pickle<int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&), int register_conv_params<2>()::'lambda'(c10::IValue)>(int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)&&, int register_conv_params<2>()::'lambda'(c10::IValue)&&)::'lambda'(c10::tagged_capsule<ConvPackedParamsBase<2> >, c10::IValue&&)&) /pytorch/torch/custom_class_detail.h:174:5
    #9 0x2b2f38d in torch::jit::Function* torch::class_<ConvPackedParamsBase<2> >::defineMethod<torch::class_<ConvPackedParamsBase<2> >& torch::class_<ConvPackedParamsBase<2> >::def_pickle<int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&), int register_conv_params<2>()::'lambda'(c10::IValue)>(int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)&&, int register_conv_params<2>()::'lambda'(c10::IValue)&&)::'lambda'(c10::tagged_capsule<ConvPackedParamsBase<2> >, c10::IValue&&)>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::'lambda'(std::vector<c10::IValue, std::allocator<c10::IValue> >&)::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/custom_class.h:407:7
    #10 0x2b2f38d in int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&) std::__invoke_impl<void, torch::jit::Function* torch::class_<ConvPackedParamsBase<2> >::defineMethod<torch::class_<ConvPackedParamsBase<2> >& torch::class_<ConvPackedParamsBase<2> >::def_pickle<int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&), int register_conv_params<2>()::'lambda'(c10::IValue)>(int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&)&&, int register_conv_params<2>()::'lambda'(c10::IValue)&&)::'lambda'(c10::tagged_capsule<ConvPackedParamsBase<2> >, c10::IValue&&)>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int register_conv_params<2>()::'lambda'(c10::intrusive_ptr<ConvPackedParamsBase<2>, c10::detail::intrusive_target_default_null_type<ConvPackedParamsBase<2> > > const&), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::initializer_list<torch::arg>)::'lambda'(std::vector<c10::IValue, std::allocator<c10::IValue> >&)&, std::vector<c10::IValue, std::allocator<c10::IValue> >&>(std::__invoke_other, int register_conv_params<2>()::'lambda'(c10::IValue)&&, std::vector<c10::IValue, std::allocator<c10::IValue> >&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14
    #11 0x125654e in torch::jit::Function::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) /pytorch/aten/src/ATen/core/function.h:62:5
    #12 0xec2c1c6 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_1::operator()(c10::StrongTypePtr const&, c10::IValue) const /pytorch/torch/csrc/jit/serialization/import.cpp:172:7
    #13 0xec2c1c6 in c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > std::__invoke_impl<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> >, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_1&, c10::StrongTypePtr, c10::IValue>(std::__invoke_other, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_1&, c10::StrongTypePtr&&, c10::IValue&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14
    #14 0xec2b9a0 in std::enable_if<is_invocable_r_v<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> >, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_1&, c10::StrongTypePtr, c10::IValue>, c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > >::type std::__invoke_r<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> >, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_1&, c10::StrongTypePtr, c10::IValue>(torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_1&, c10::StrongTypePtr&&, c10::IValue&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:113:9
    #15 0xec2b8ae in std::_Function_handler<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue), torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_1>::_M_invoke(std::_Any_data const&, c10::StrongTypePtr&&, c10::IValue&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:291:9
    #16 0xeda0c63 in std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)>::operator()(c10::StrongTypePtr, c10::IValue) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:622:14
    #17 0xed8062d in torch::jit::Unpickler::readGlobal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_9::operator()() const /pytorch/torch/csrc/jit/serialization/unpickler.cpp:863:20
    #18 0xed8062d in void std::__invoke_impl<void, torch::jit::Unpickler::readGlobal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_9&>(std::__invoke_other, torch::jit::Unpickler::readGlobal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_9&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14
    #19 0xed877c6 in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:545:7
    #20 0xed85b27 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:253:27
    #21 0xed85781 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:206:3
    #22 0xec9c7be in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) /pytorch/torch/csrc/jit/serialization/import_read.cpp:53:20
    #23 0xec2b168 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import.cpp:184:10
    #24 0xec27235 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize(c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:287:19
    #25 0xec25644 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:389:25
    #26 0xec2dcbe in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:325:10
    #27 0xec30659 in torch::jit::load(std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:485:10
    #28 0x8d8636 in LLVMFuzzerTestOneInput /load.cc:42:14
    #29 0x8d835d in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7
    #30 0x8d8168 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c
    #31 0x8d7d28 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10
    #32 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #33 0x817add in _start (/load_afl+0x817add)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /pytorch/c10/util/Optional.h:222:17 in c10::optional_base<at::Tensor>::optional_base(c10::optional_base<at::Tensor> const&)
==1003193==ABORTING

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108418
Approved by: https://github.com/Skylion007
2023-09-06 15:45:50 +00:00
ebed490c2f [sdpa decomp] change sdpa decomp to be consistent with flash attention (#108608)
Summary: See the comment in code for the reasons of the change

Test Plan:
buck2 test executorch/examples/export/test:test_export --
test_vit_export_to_executorch

Differential Revision: D48992180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108608
Approved by: https://github.com/larryliu0820
2023-09-06 15:34:03 +00:00
6edd06441a Fix copy=True behavior for torch.asarray when device is not None/cpu (#108511)
Fixes #108408

See issue for details

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108511
Approved by: https://github.com/ysiraichi, https://github.com/rgommers, https://github.com/ezyang
2023-09-06 15:16:30 +00:00
aebb86fef7 Back out "Faster gc_count update for CUDACachingAllocator" (#108632)
Summary:
Original commit changeset: 1d04ae368fd8

Original Phabricator Diff: D48481557

block.pool is not guaranteed to be not nullptr

Test Plan: CI

Differential Revision: D49003756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108632
Approved by: https://github.com/houseroad
2023-09-06 14:57:41 +00:00
ca9f4222e1 Inductor cpp wrapper: fix codegen of positional args with default value (#108552)
Fixes https://github.com/pytorch/pytorch/issues/108323.
Cpp wrapper has functionality regression on `llama` and `tnt_s_patch16_224` due to recent support of scaled dot product flash attention in inductor.

The schema of this OP is as follows:
```
- func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)
```

For `llama` and `tnt_s_patch16_224`, the OP is called in the below way, where the three positional args with default values are not passed (`float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False`).
```python
y = torch.ops.aten._scaled_dot_product_flash_attention.default(x0, x1, x2, scale = 0.125)
```

This PR fixes the cpp wrapper support for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108552
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-09-06 13:15:12 +00:00
60bd30ee0b [inductor] Move AOTInductor runtime headers (#108564)
Summary: Move AOTInductor runtime header files into its own subdirectory, to separate them from to-be-added libtorch C interface.

Reviewed By: frank-wei

Differential Revision: D48905038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108564
Approved by: https://github.com/frank-wei
2023-09-06 11:50:41 +00:00
b60273b88a [MPS] Pixel shuffle unshuffle support (#99306)
Fixes #83196

Now, MPS implementation is blazingly fast.

Though, I have several questions on improving this PR:

1. I copied code from `test_nn.py`. Is there better way to test this?
2. I decided to use `usepixelshuffleorder:YES`. Am I right performance-wise? According to docs:
```
`usePixelShuffleOrder` can be
used to control how the data within spatial blocks is ordered in the
`depthAxis` dimension: with `usePixelShuffleOrder=YES` the values within the
spatial blocks are stored contiguosly within the `depthAxis` dimension whereas
otherwise they are stored interleaved with existing values in the `depthAxis` dimension.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99306
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-09-06 09:11:39 +00:00
ca2cdb3009 [DeviceMesh] Minor docstring update for init_device_mesh and rename variables (#108391)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108391
Approved by: https://github.com/wanchaol
2023-09-06 08:27:11 +00:00
3fe8417643 [PyTorch] Add the lazy init call for p2p access function (#1991) (#108589)
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1991

Test Plan: sandcastle

Reviewed By: zdevito

Differential Revision: D48939723

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108589
Approved by: https://github.com/zdevito
2023-09-06 05:52:56 +00:00
49aa8d19dd [DTensor] Replace usage of compute_local_offset by compute_local_shape_and_global_offset (#108547)
This PR removes four usages of compute_local_offset() in PyTorch repo and replaces it with the new API compute_local_shape_and_global_offset().

We will be removing compute_local_offset() API in the next diff, as there are usages internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108547
Approved by: https://github.com/wanchaol
2023-09-06 04:53:44 +00:00
ce4967ad18 [vision hash update] update the pinned vision hash (#108611)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108611
Approved by: https://github.com/pytorchbot
2023-09-06 03:55:17 +00:00
3b92ef814d Force synced KJT to trace unbacked SymInt (#107788)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107788
Approved by: https://github.com/voznesenskym
2023-09-06 03:18:26 +00:00
c8e72a4a5c Improve mem efficiency of constant folding (#108421)
Couple changes to make it more efficient.

- Because we replacing nodes that only have a single value, only store a single value instead of the whole tensor for node replacement
- torch.fx.Interpreter will preserve a Tensor in the env as long as it has more uses. That also applies even to output uses, but we are not going to constant fold that use. Instead of using last use for garbage collection, use last non output use.

If reviewers would prefer I ghstack this bc of code movement let me know.

Fix for https://github.com/pytorch/pytorch/issues/108388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108421
Approved by: https://github.com/jansel
2023-09-06 02:19:30 +00:00
28c5b62210 [inductor] Use empty_strided to create output tensors when testing AOTInductor (#108364)
Summary: This will fix 3 fail_accuracy failures in HF.

Test Plan:
```
python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy --inference --device cuda --export-aot-inductor --only  T5Small
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108364
Approved by: https://github.com/angelayi
ghstack dependencies: #108412
2023-09-06 02:04:32 +00:00
d494b923a9 [pytorch-vulkan] aten::.rand_like (#108086)
Summary: Before implementing `aten::.randn_like` as requested (T152843033), I think it worth to extend `aten::rand_like` from existing `aten::uniform`, since they're so similar.

Test Plan:
```
[ttingchulin@6945.od /data/sandcastle/boxes/fbsource (rand_like)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*<test>*" eg.  -- --gtest_filter="*rand_like*"
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN      ] VulkanAPITest.rand_like
[       OK ] VulkanAPITest.rand_like (136 ms)
[----------] 1 test from VulkanAPITest (136 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (136 ms total)
[  PASSED  ] 1 test.

[ttingchulin@6945.od /data/sandcastle/boxes/fbsource (rand_like)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*<test>*" eg.  -- --gtest_filter="*uniform*"
Building: finished in 0.1 sec (100%) 329/329 jobs, 0/329 updated
  Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *uniform*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN      ] VulkanAPITest.uniform
[       OK ] VulkanAPITest.uniform (131 ms)
[----------] 1 test from VulkanAPITest (131 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (131 ms total)
[  PASSED  ] 1 test.

[ttingchulin@6945.od /data/sandcastle/boxes/fbsource (rand_like)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin
ALL PASS
```

Differential Revision: D48710273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108086
Approved by: https://github.com/yipjustin
2023-09-06 01:25:03 +00:00
d471eaeb1d fix inline_container.cc inplace loading (#108573)
Summary:
bypass-github-pytorch-ci-checks
bypass-github-export-checks
force-merge-on-github

Differential Revision: D48971847

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108573
Approved by: https://github.com/wqfish
2023-09-06 00:02:42 +00:00
ff28b4b908 Fix dynamo benchmark config --print-graph-breaks (#108584)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108584
Approved by: https://github.com/anijain2305
2023-09-05 23:31:43 +00:00
cyy
bae14b3d9f Update clang7 CI jobs to clang9 (#108339)
This PR updates the remaining clang7 CI job to clang9. However, I have no permission to push the new docker image so Android CI tests would fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108339
Approved by: https://github.com/ezyang
2023-09-05 22:46:47 +00:00
c99a70c8df [export] Fix dict.get() to dict.setdefault() for param lookup. (#108587)
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108587
Approved by: https://github.com/angelayi
2023-09-05 22:08:51 +00:00
eab57145ab fix matrix_power documentation bug (#108585)
The torch.linalg.matrix_power documentation suggests using the formula
`matrix_power(torch.linalg.solve(A, B), n) == matrix_power(A, -n)  @ B`
to avoid negative matrix powers. But the ordering of the left side is not correct. This patch fixes it to:
`torch.linalg.solve(matrix_power(A, n), B) == matrix_power(A, -n)  @ B`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108585
Approved by: https://github.com/lezcano
2023-09-05 22:08:46 +00:00
208fd1cb84 [RFC] Somewhat BC breaking: make checkpoint_wrapper default to NO_REENTRANT (#108435)
We should use no_reentrant. There are a lot of users of this API, but
it is in a prototype state so should be fine to change.

Differential Revision: [D48898148](https://our.internmc.facebook.com/intern/diff/D48898148/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108435
Approved by: https://github.com/awgu
ghstack dependencies: #108032, #108033
2023-09-05 21:43:41 +00:00
db6d09c086 [RFC][FSDP] Don't move ignored params / buffers to device (#108033)
Since these are ignored by FSDP, don't move them.

Differential Revision: [D48727044](https://our.internmc.facebook.com/intern/diff/D48727044/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108033
Approved by: https://github.com/awgu
ghstack dependencies: #108032
2023-09-05 21:43:41 +00:00
3334ec3a00 [RFC] Don't materialize ignored modules for FSDP (#108032)
Per title. This seems needed for cases where I have a large embedding
I want to separately manage, but FSDP would initialize it and thus consume the
memory.

Currently the interaction with torchdistX materialize_module is not tested,
this can be done as follow up work.

Differential Revision: [D48722046](https://our.internmc.facebook.com/intern/diff/D48722046/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108032
Approved by: https://github.com/awgu
2023-09-05 21:43:41 +00:00
fee9fc1df0 [pytorch] Update docstring for FSDP.set_state_dict_type (#103864)
Summary: I noticed optim_state_dict_config was missing from the Args section

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D46670165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103864
Approved by: https://github.com/rohan-varma, https://github.com/fegin, https://github.com/fduwjj
2023-09-05 21:43:31 +00:00
64ad16a5e1 [XNNPACK] Enable XX kernels (#108440)
Summary: Enables copy, pad, fill etc. kernels in XNNPACK library. This shouldn't be too bad size implication.

Test Plan: CI

Differential Revision: D48915384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108440
Approved by: https://github.com/Skylion007
2023-09-05 21:37:24 +00:00
66af4f6ec7 [HSDP] Add device_mesh to FSDP kwarg and add dtensor state_dict support for HSDP (#107533)
This PR:
1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg.
2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict.
3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol
2023-09-05 21:21:21 +00:00
b1729d8bbe Fix doc preview page url at CONTRIBUTING.md (#108580)
The URL for previewing documentation directly on PR has changed and CONTRIBUTING.md got outdated. There is also a minor fix to a non-existent document URL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108580
Approved by: https://github.com/svekars, https://github.com/kit1980
2023-09-05 20:17:55 +00:00
fac7a1f730 fix issue with lift_fresh_copy when using export + compile (#108243)
Fixes https://github.com/pytorch/pytorch/issues/105327. The problem is that `lift_fresh_copy()`'s functionalization implementation currently assumes that the input is always not functional. This is apparently too limiting: when you have "user" code like this (which can potentially come from exporting a model and then running compile on the resulting graph):
```
tensor_constant0 = torch.tensor(2)
lift_fresh = torch.ops.aten.lift_fresh_copy.default(tensor_constant0)
```

When we run this through AOTAutograd, the first call (torch.tensor(2)) will **already** be lifted into a functional tensor wrapper - so the `lift_fresh_copy` call doesn't need to do any "lifting" anymore - it just needs to do a clone.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108243
Approved by: https://github.com/albanD
ghstack dependencies: #108081, #108235
2023-09-05 20:02:35 +00:00
da914aed21 error when using _dynamo.optimize_ddp=True and _inductor.keep_output_stride=False together (#108235)
From talking to @wconstab, we agreed that because of the way DDPOptimizer is written, it is (sort of) incompatible with inductor's `keep_output_stride=False` optimizations (and will cause silent correctness problems if you use them ogether). Added an assertion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108235
Approved by: https://github.com/wconstab
ghstack dependencies: #108081
2023-09-05 20:02:35 +00:00
def33d4d7a Fix inductor <> ddp_optimizer issue (#108081)
@wconstab pointed out that inductor found a graph with 6 input mutations and only 1 output, and seemed to be (incorrectly) chopping off the first "6" outputs from the graph (even though there is only 1). It looks like this is because:

(1) AOTAutograd has special handling for input mutations in inference vs. training graphs. In a training graph, whenever AOTAutograd sees an input mutation, it will add an **extra** output to the graph, corresponding to the updated input (and then at runtime, it will grab the updated input, and perform the actual mutation outside of the graph).

In inference, AOTAutograd is smarter and can leave the input mutations directly in the graph for inductor to optimize (doing this in training is harder). In inference, AOTAutograd will **not** add any extra graph outputs for input mutations.

It looks like inductor was unconditionally assuming that input mutations counted as extra outputs in the graph, which is wrong for the inference case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108081
Approved by: https://github.com/wconstab
2023-09-05 20:02:35 +00:00
ae8eb7a3f9 Use global variables to register the return_types namedtuples (#107000)
Fixes #69221

@pytorchbot label "topic: not user facing"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107000
Approved by: https://github.com/zou3519
2023-09-05 20:00:29 +00:00
d64e1c5f9d Fix error message concatanation (#108581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108581
Approved by: https://github.com/mikaylagawarecki
2023-09-05 19:46:52 +00:00
7cdfc38433 [inductor] Update how AOTInductor resizes output tensors (#108412)
Summary: Improve https://github.com/pytorch/pytorch/pull/107848 so that there is no resize_ needed for output tensors when existing the main function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108412
Approved by: https://github.com/jansel
2023-09-05 19:33:26 +00:00
1b76a5c24b Revert "Use std::filesystem in c10 tempfile and tempdir (#106656)"
This reverts commit 7b91f762b65ea250b87aaa2e2b67e429a9d29f16.

Reverted https://github.com/pytorch/pytorch/pull/106656 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing internal iOS build.  This was missed by period mobile build I think ([comment](https://github.com/pytorch/pytorch/pull/106656#issuecomment-1707187814))
2023-09-05 19:22:56 +00:00
a9a6423261 Revert "[export] Copy gm before calling PassManager" for test or build failures (#108441)
Test Plan: CI

Differential Revision: D48916322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108441
Approved by: https://github.com/cccclai
2023-09-05 19:21:01 +00:00
0b44fdfaec fix use_deterministic_algorithms docstring (#108551)
I fixed an error in the example.
`k` in `torch.Tensor.kthvalue(k)` is 1-indexed, so `torch.randn(10, device='cuda').kthvalue(0)` should be `torch.randn(10, device='cuda').kthvalue(1)`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108551
Approved by: https://github.com/mikaylagawarecki
2023-09-05 18:44:23 +00:00
23e8a11fef [c10d] Introduce TCPStore client metrics collection. (#108348)
We collect timing and counts of every operation.
They are acessible from python using TCPStore::collect_client_counters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108348
Approved by: https://github.com/XilunWu
2023-09-05 18:36:27 +00:00
4a472d9e95 [jit] Verify stack size and index to prevent off-by-one error (#108413)
Hi!

I've been fuzzing different pytorch modules with with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch), and found a heap buffer overflow error that occurs by incorrect loop condition in torch::jit::unpickler.cpp. This bug can be triggered by `torch::distributed::rpc::deserializeRequest()` method in RPC module.

Docker to reproduce found error: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).

### PoC for deserealizeRequest():
[crash-001e49dcd3a3c439e2b1273d580049309e052bdd.txt](https://github.com/pytorch/pytorch/files/12498999/crash-001e49dcd3a3c439e2b1273d580049309e052bdd.txt)

### ASAN report
```
==339982==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x619000086a88 at pc 0x000000996fa4 bp 0x7fffffff9c50 sp 0x7fffffff9c48
READ of size 4 at 0x619000086a88 thread T0
    #0 0x996fa3 in c10::IValue::IValue(c10::IValue const&) /pytorch/aten/src/ATen/core/ivalue.h:226:33
    #1 0xdf99a38 in std::pair<c10::impl::DictIterator<c10::IValue, c10::IValue, ska_ordered::detailv3::sherwood_v3_table<std::pair<c10::IValue, c10::IValue>, c10::IValue, c10::detail::DictKeyHash, ska_ordered::detailv3::KeyOrValueHasher<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyHash>, c10::detail::DictKeyEqualTo, ska_ordered::detailv3::KeyOrValueEquality<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyEqualTo>, std::allocator<std::pair<c10::IValue, c10::IValue> >, std::allocator<ska_ordered::detailv3::sherwood_v3_entry<std::pair<c10::IValue, c10::IValue> > > >::templated_iterator<std::pair<c10::IValue, c10::IValue> > >, bool> c10::Dict<c10::IValue, c10::IValue>::insert_or_assign<c10::IValue&, c10::IValue&>(c10::IValue&, c10::IValue&) const /pytorch/aten/src/ATen/core/Dict_inl.h:136:5
    #2 0xed966c7 in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:490:14
    #3 0xed94377 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:253:27
    #4 0xed93fd1 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:206:3
    #5 0xece09ee in torch::jit::unpickle(std::function<unsigned long (char*, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:126:20
    #6 0xece0dac in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:136:10
    #7 0x1006a4e7 in torch::distributed::rpc::PythonRemoteCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/python_remote_call.cpp:40:16
    #8 0x101d02e1 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:111:14
    #9 0x8db738 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27
    #10 0x8d84cd in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7
    #11 0x8d82d8 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c
    #12 0x8d7e98 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10
    #13 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #14 0x817c4d in _start (/message_deserialize_afl+0x817c4d)

0x619000086a88 is located 8 bytes to the right of 1024-byte region [0x619000086680,0x619000086a80)
allocated by thread T0 here:
    #0 0x8d54ca in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3

SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:226:33 in c10::IValue::IValue(c10::IValue const&)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108413
Approved by: https://github.com/ezyang
2023-09-05 18:28:17 +00:00
a74f50d524 torch.compile-functorch interaction: update docs (#108130)
Doc Preview: https://docs-preview.pytorch.org/pytorch/pytorch/108130/torch.compiler_faq.html#torch-func-works-with-torch-compile-for-grad-and-vmap-transforms

Will also cherry-pick this for release branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108130
Approved by: https://github.com/zou3519
2023-09-05 18:24:08 +00:00
42f94d7e9f add Half support for maxpool on CPU (#98819)
### Testing
Single socket (28 cores):

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 4.12895 | 6.9669 | 5.30297 | 0.55775 | 1.98917 | 0.72233
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 0.85093 | 1.88813 | 1.38063 | 5.5742 | 36.5086 | 10.58552
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 22.37212 | 37.90383 | 30.94482 | 6.85868 | 10.6116 | 3.9993
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 5.41658 | 4.71098 | 4.66578 | 6.69875 | 14.7171 | 5.1167
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 10.69831 | 18.0468 | 13.71657 | 2.61192 | 4.96172 | 1.68635
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.52637 | 2.0096 | 2.0055 | 2.60314 | 7.2093 | 2.49843
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.47605 | 0.88398 | 0.65326 | 0.06525 | 0.115489 | 0.0674
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.10902 | 0.25293 | 0.157475 | 0.11386 | 0.53319 | 0.17836

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 90.9809 | 163.473 | 126.1276 | 6.57721 | 41.40833 | 11.82505
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 9.88405 | 38.39137 | 29.62069 | 7.10636 | 36.97535 | 11.0525
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 476.782 | 855.4769 | 648.2248 | 46.6488 | 219.2586 | 67.10599
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 80.29271 | 91.33854 | 87.80345 | 48.81692 | 203.9974 | 63.39004
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 235.2113 | 419.0799 | 315.4284 | 20.6049 | 107.1524 | 32.39169
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 29.47653 | 33.54905 | 32.82823 | 22.59674 | 98.5586 | 30.05763
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 7.90684 | 13.9208 | 10.03272 | 0.23725 | 1.35269 | 0.41728
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.33638 | 3.36894 | 2.64635 | 0.26535 | 1.244 | 0.38895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98819
Approved by: https://github.com/mingfeima, https://github.com/mikaylagawarecki
2023-09-05 18:23:41 +00:00
1e0e55c504 [xplat][buck2][typing] Fix typechecker issue (#108525)
Test Plan: CI

Reviewed By: JakobDegen

Differential Revision: D48817210

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108525
Approved by: https://github.com/osalpekar
2023-09-05 18:18:45 +00:00
8da04e023e Revert "Eliminate c10::guts::to_string (#108480)"
This reverts commit 4146be192ead477360a2763c5005e46a9485c3bf.

Reverted https://github.com/pytorch/pytorch/pull/108480 on behalf of https://github.com/huydhn due to Sorry for reverting this, but this is needed to keep trunk green after https://github.com/pytorch/pytorch/pull/108479 was reverted.  Both will need to be relanded ([comment](https://github.com/pytorch/pytorch/pull/108480#issuecomment-1707067595))
2023-09-05 18:04:53 +00:00
5b31a41841 Revert "[NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph capture (#104487)"
This reverts commit db63bf3d7e5eef320dde9c2d4b7976eb5fcddbd6.

Reverted https://github.com/pytorch/pytorch/pull/104487 on behalf of https://github.com/huydhn due to Sorry for reverting you change, it is failing internal build ([comment](https://github.com/pytorch/pytorch/pull/104487#issuecomment-1707055346))
2023-09-05 17:57:19 +00:00
29f1097891 [dynamo] Reduce cache size limit to 8 (#108526)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108526
Approved by: https://github.com/ezyang
2023-09-05 17:56:26 +00:00
03aac0bff6 add input check at the beginning for C++ API interpolate (#108506)
Fixes https://github.com/pytorch/pytorch/issues/108346
add the input check to the beginning for  C++ API `interpolate`, raise an error when got an invalid input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108506
Approved by: https://github.com/ezyang
2023-09-05 17:56:17 +00:00
9f71a4ebd4 Revert "Simplify c10::string_view implementation (#108479)"
This reverts commit ce03b78a8f463139c87a4bf42e8f37ebabca5b0f.

Reverted https://github.com/pytorch/pytorch/pull/108479 on behalf of https://github.com/huydhn due to Sorry for reverting you change, it is failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/108479#issuecomment-1707033082))
2023-09-05 17:39:54 +00:00
e8005781be Softmax in functorch example fixed (#107988)
The output of softmax was overwritten by the output of fc2 in the following line. So, the output of the softmax is never utilized. Now, the final output of the model includes softmax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107988
Approved by: https://github.com/zou3519
2023-09-05 17:18:48 +00:00
e787708ad7 [jit] Validate statement parsing during class deserialization (#108417)
Hi!

I've been fuzzing different pytorch modules with with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch), and found a SEGV that occurs during class deserialization in jit module.

Docker to reproduce found error: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).

### PoC:
[crash-bfbab61bf86755aa712bb978e26057ae76d75fe4.txt](https://github.com/pytorch/pytorch/files/12499228/crash-bfbab61bf86755aa712bb978e26057ae76d75fe4.txt)

### ASAN report
```
==1003115==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x00000db61680 bp 0x7fffffff5e30 sp 0x7fffffff5a60 T0)
==1003115==The signal is caused by a READ memory access.
==1003115==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
    #0 0xdb61680 in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::retain_() /pytorch/c10/util/intrusive_ptr.h:265:54
    #1 0xdb6721c in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::intrusive_ptr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch/c10/util/intrusive_ptr.h:354:5
    #2 0xdb6721c in torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch/torch/csrc/jit/frontend/tree_views.h:270:49
    #3 0xdbf73b9 in torch::jit::Maybe<torch::jit::Expr>::get() const /pytorch/torch/csrc/jit/frontend/tree_views.h:212:12
    #4 0xecac171 in torch::jit::SourceImporterImpl::importClass(c10::QualifiedName const&, torch::jit::ClassDef const&, bool) /pytorch/torch/csrc/jit/serialization/import_source.cpp:454:64
    #5 0xeca0ada in torch::jit::SourceImporterImpl::importNamedType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::ClassDef const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:288:5
    #6 0xeca7422 in torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:140:5
    #7 0xeca295c in torch::jit::SourceImporterImpl::resolveType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:261:10
    #8 0xdd03bc8 in torch::jit::ScriptTypeParser::parseTypeFromExpr(torch::jit::Expr const&) const /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:238:24
    #9 0xdcfc9b6 in torch::jit::ScriptTypeParser::parseType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:312:10
    #10 0xecbac43 in torch::jit::SourceImporter::loadType(c10::QualifiedName const&) const /pytorch/torch/csrc/jit/serialization/import_source.cpp:786:27
    #11 0xec2b5d3 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0::operator()(c10::QualifiedName const&) const /pytorch/torch/csrc/jit/serialization/import.cpp:146:33
    #12 0xec2b5d3 in c10::StrongTypePtr std::__invoke_impl<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(std::__invoke_other, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14
    #13 0xec2b4a0 in std::enable_if<is_invocable_r_v<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>, c10::StrongTypePtr>::type std::__invoke_r<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:113:9
    #14 0xec2b3a0 in std::_Function_handler<c10::StrongTypePtr (c10::QualifiedName const&), torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0>::_M_invoke(std::_Any_data const&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:291:9
    #15 0xec95f7c in std::function<c10::StrongTypePtr (c10::QualifiedName const&)>::operator()(c10::QualifiedName const&) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:622:14
    #16 0xed78721 in torch::jit::Unpickler::readGlobal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/unpickler.cpp:844:9
    #17 0xed87821 in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:520:7
    #18 0xed85b27 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:253:27
    #19 0xed85781 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:206:3
    #20 0xec9c7be in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) /pytorch/torch/csrc/jit/serialization/import_read.cpp:53:20
    #21 0xec2b168 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import.cpp:184:10
    #22 0xec27235 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize(c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:287:19
    #23 0xec25644 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:389:25
    #24 0xec2dcbe in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:325:10
    #25 0xec30659 in torch::jit::load(std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:485:10
    #26 0x8d8636 in LLVMFuzzerTestOneInput /load.cc:42:14
    #27 0x8d835d in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7
    #28 0x8d8168 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c
    #29 0x8d7d28 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10
    #30 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #31 0x817add in _start (/load_afl+0x817add)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /pytorch/c10/util/intrusive_ptr.h:265:54 in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::retain_()
==1003115==ABORTING

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108417
Approved by: https://github.com/ezyang
2023-09-05 17:09:25 +00:00
96d74073f8 Horizontally fuse input concatenation (#108115)
Fixes https://github.com/pytorch/pytorch/issues/106688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108115
Approved by: https://github.com/jansel
2023-09-05 16:55:32 +00:00
6a1a893f8f Bump version 2.1.0 -> 2.2.0 (#108156)
Same as: https://github.com/pytorch/pytorch/pull/95790

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 50063bb</samp>

> _`PyTorch` version up_
> _Nightly and release builds change_
> _Autumn of progress_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108156
Approved by: https://github.com/osalpekar, https://github.com/albanD
2023-09-05 15:56:23 +00:00
a16b0aa26a [dynamo] Fix return type of Tensor.shape (#108240)
This should be `torch.Size` but was returning a plain tuple under dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108240
Approved by: https://github.com/ezyang
ghstack dependencies: #108239
2023-09-05 14:58:39 +00:00
7c931f2491 [dynamo] Add dynamic shapes support to torch.Size.numel (#108239)
Currently numel only supports static shapes, but this expands it to support
generating symbolic arithmetic into the graph. e.g.
```
# x.size().numel with x.size() = [s0, 1, s1]
size = l_x_.size()
getitem = size[0]
getitem_2 = size[2];  size = None
mul = getitem * getitem_2;  getitem = getitem_2 = None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108239
Approved by: https://github.com/ezyang
2023-09-05 14:58:39 +00:00
b2c6383f44 [pytorch] Small fix to docstring of FSDP.optim_state_dict_to_load (#108383)
Summary: Fix ordering of args in docstring

Test Plan: N/A

Differential Revision: D48889668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108383
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wz337
2023-09-05 14:56:56 +00:00
0ef2556351 Update sparse_funcs to include primtorch types (#107421)
Fixes #107335.

A few issues have been identified while enabling this test and filed:
https://github.com/pytorch/pytorch/issues/105986
https://github.com/pytorch/pytorch/issues/108204
https://github.com/pytorch/pytorch/issues/108205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107421
Approved by: https://github.com/ezyang
2023-09-05 14:34:48 +00:00
e27ddd2cee s390x SIMD: update abs() function for complex numbers (#108515)
It propagated NANs when it should have not.
Also, replace it with std::abs due to precision mismatching.

This change fixes test_python_ref__refs_abs_cpu_complex32 test in test/test_ops.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108515
Approved by: https://github.com/ezyang
2023-09-05 14:20:00 +00:00
0a8296da7d ReduceLROnPlateau: inherit LRScheduler (#108464)
Fixes #106767
FIxes #104687
Fixes #49369
Fixes #63143
Fixes #50715
Fixes #21981
Fixes #2829

Hoping this is just a simple fix, but we'll see.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108464
Approved by: https://github.com/ezyang
2023-09-05 13:48:54 +00:00
cyy
efc7c366f4 Remove auto_gil.h (#108492)
auto_gil.h has been deprecated for a long time. We can switch to pybind11.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108492
Approved by: https://github.com/Skylion007
2023-09-05 08:26:13 +00:00
cyy
a9d9803bfd Enable MKLDNN ASAN tests (#108478)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108478
Approved by: https://github.com/ezyang
2023-09-05 08:22:13 +00:00
cyy
468660d03e use std::initialization_list for vector literals (#108504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108504
Approved by: https://github.com/Skylion007
2023-09-05 08:17:13 +00:00
3d2938b1fc [inductor] Add an aot_inductor class in inductor config (#108369)
Summary: Introduce an aot_inductor class to group AOTInductor specific configs

Differential Revision: D48880684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108369
Approved by: https://github.com/frank-wei
2023-09-05 07:11:19 +00:00
ff38c0e2f9 [Inductor] Make aot-inductor work with pip installed torch (#108319)
It seems pip-installed torch is built with `D_GLIBCXX_USE_CXX11_ABI=0` and it fails the inductor/test_aot_inductor.py with:
```
ERROR: test_with_offset (__main__.AotInductorTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2388, in wrapper
    method(*args, **kwargs)
  File "/home/ubuntu/src/pytorch/test/inductor/test_aot_inductor.py", line 112, in test_with_offset
    actual = AOTInductorModelRunner.run(model, example_inputs, expected)
  File "/home/ubuntu/src/pytorch/test/inductor/test_aot_inductor.py", line 63, in run
    optimized, exported, output_tensors, output_spec = AOTInductorModelRunner.load(
  File "/home/ubuntu/src/pytorch/test/inductor/test_aot_inductor.py", line 50, in load
    optimized = torch.utils.cpp_extension.load_inline(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1635, in load_inline
    return _jit_compile(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1736, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2136, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 565, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1173, in create_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
ImportError: /tmp/torchinductor_ubuntu/cqrzlw3yizrsx2us5bnjosr4tzct24h6qwb6xbbx654fxvdupoub/cr6ndwlgeorw34etxhwvs547kbnftyxtwwrsmbdraa4hjeevsvji.so: undefined symbol: _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
```
I'm not sure how to test this in CI, maybe run tests with prebuilt wheels?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108319
Approved by: https://github.com/ezyang
2023-09-04 19:57:38 +00:00
159ce22694 [rpc] Fix assertion on vector length during message parsing (#108414)
Hi!

I've been fuzzing different pytorch modules with with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch), and found a heap buffer overflow error that occurs during Python object deserialization routine. Vector with `IValues` is verified to contain at least 3 elements, which are subsequently removed from vector. The rest of vector is passed further, where it is expected to contain at least one more element. The crash occurs on empty vector.

Docker to reproduce found error: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).

### PoC:
[crash-6d634f38a76bfeaa1fffc9472e8ea7b88ee8e776.txt](https://github.com/pytorch/pytorch/files/12499089/crash-6d634f38a76bfeaa1fffc9472e8ea7b88ee8e776.txt)

### ASAN report
```
==339647==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x604000105388 at pc 0x000000c2b3bc bp 0x7fffffffb8d0 sp 0x7fffffffb8c8
READ of size 4 at 0x604000105388 thread T0
    #0 0xc2b3bb in c10::IValue::isString() const /pytorch/aten/src/ATen/core/ivalue.h:685:27
    #1 0xc2b3bb in c10::IValue::toStringRef[abi:cxx11]() const /pytorch/aten/src/ATen/core/ivalue_inl.h:2308:3
    #2 0x101ce65f in torch::distributed::rpc::SerializedPyObj::fromIValues(std::vector<c10::IValue, std::allocator<c10::IValue> >) /pytorch/torch/csrc/distributed/rpc/types.cpp:103:39
    #3 0x1006a7a0 in torch::distributed::rpc::PythonRemoteCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/python_remote_call.cpp:58:26
    #4 0x101d02e1 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:111:14
    #5 0x8db738 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27
    #6 0x8d84cd in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7
    #7 0x8d82d8 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c
    #8 0x8d7e98 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10
    #9 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #10 0x817c4d in _start (/message_deserialize_afl+0x817c4d)

0x604000105388 is located 8 bytes to the left of 48-byte region [0x604000105390,0x6040001053c0)
allocated by thread T0 here:
    #0 0x8d54ca in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3

SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:685:27 in c10::IValue::isString() const
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108414
Approved by: https://github.com/ezyang
2023-09-04 19:32:15 +00:00
48286d34a4 Revert "Break graph on manual_seed. (#107594)"
This reverts commit 6ad5568cbc7122356b58789a1d3bcd16d5faf775.

Reverted https://github.com/pytorch/pytorch/pull/107594 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it has an import issue that breaks internal code ([comment](https://github.com/pytorch/pytorch/pull/107594#issuecomment-1705584405))
2023-09-04 18:00:37 +00:00
e08577aec5 Spelling fix (#108490)
Fixes spelling mistake: non-deterinistic -> non-deterministic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108490
Approved by: https://github.com/ezyang
2023-09-04 16:59:35 +00:00
51c2e22e94 When byteorder record is missing load as little endian by default (#108343)
Fixes #101688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108343
Approved by: https://github.com/mikaylagawarecki
2023-09-04 15:20:22 +00:00
7e878c9d10 Add decomposition for aten.take_along_dim (#108185)
xref #107875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108185
Approved by: https://github.com/lezcano
2023-09-04 13:49:53 +00:00
cyy
4146be192e Eliminate c10::guts::to_string (#108480)
This PR replace c10::guts::to_string with std::to_string. The major part of changes is using void* as optimizer state key since string is used only for serialization and using pointers as hashing keys is more efficient than a string.
Some other guts functions in the affected source files are also replaced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108480
Approved by: https://github.com/Skylion007
2023-09-04 08:12:53 +00:00
06b173780d [dynamo] "TorchDynamo Cache Lookup" event: use C++ api (#108436)
**Background**: "TorchDynamo Cache Lookup" events appear in traces to indicate a dynamo cache lookup; it's useful to check when cache lookups are taking a long time. To add a profiler event, one can use the `torch.profiler.record_function` context manager, or the C++ equivalent. Previously, the python version was used; first, when the profiler was enabled, callbacks for record_function_enter and record_function_exit were registered; then those would be called before and after every cache lookup.

**This PR**: Instead of calling the python bindings for `torch.profiler.record_function`, directly call the C++ implementation. This simplifies a lot of the code for binding C/C++. It also improves performance; previously there was a lot of overhead in the "TorchDynamo Cache Lookup" event, making the event artificially take a long time. After this change the events now appear shorter, because there's less overhead in starting/stopping the event: in other words, the profiler no longer distorts the results as much.

**Performance results**:
I ran using the script below on a cpu-only 1.6GHz machine. I report the median time (from 100 measurements) of a "TorchDynamo Cache Lookup" event before and after this PR. I think it is reasonable to consider the difference to be due to a reduction in overhead.

<details>

<summary>Benchmarking script</summary>

```python
def fn(x, y):
    return (x * y).relu()

a, b = [torch.rand((4, 4), requires_grad=True) for _ in range(2)]

opt_fn = torch.compile(fn)

opt_fn(a, b)
opt_fn(a, b)

with torch.profiler.profile() as prof:
    opt_fn(a, b)
```

</details>

Median before PR: 198-228 us (median of 100, measured 5 times)
Median after PR: 27us

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108436
Approved by: https://github.com/anijain2305, https://github.com/jansel
2023-09-04 04:37:26 +00:00
cyy
621463a3e6 Update libfmt submodule to 10.1.1 (#108431)
This PR updates libfmt to version 10.1.1. We also set utf-8 source encoding earlier before include third party libraries on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108431
Approved by: https://github.com/Skylion007
2023-09-03 23:44:39 +00:00
cyy
ce03b78a8f Simplify c10::string_view implementation (#108479)
Remove unnecessary code in C++17

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108479
Approved by: https://github.com/Skylion007
2023-09-03 17:45:12 +00:00
cyy
aff7fdcb4c Add a missing argument (#108477)
Fix a tiny bug in string formatting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108477
Approved by: https://github.com/Skylion007
2023-09-03 16:42:27 +00:00
cc50e654d4 [aten decomp] Update sdpa decom (#108371)
Summary:
Earlier decomp was routing _flash* variant to _match variant and this
was result in failure during torch.export, for some reason that I
couldnt trace.

However, it seems that we should really have a decomp for
scaled_dot_product_attention, instead of
scaled_dot_product_flash_attention. Right?

This diff adds that. Plus it adds a test to check if the model exported
via two stage export, has decomposed the op. This test needs improvement
to figur eout what the core aten opset is and check for anything that is
not inside.

Test Plan:
test_model_exports_to_core_aten

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48917461](https://our.internmc.facebook.com/intern/diff/D48917461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108371
Approved by: https://github.com/larryliu0820
2023-09-03 15:17:08 +00:00
ba9acbebfc [Doc] Update the dynamo deepdive doc (#108147)
With a new tool `depyf` to decompile bytecode into human readable source code, understanding dynamo becomes much more easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108147
Approved by: https://github.com/jansel
2023-09-03 13:08:13 +00:00
cyy
7b91f762b6 Use std::filesystem in c10 tempfile and tempdir (#106656)
This PR simplifies c10::TempFile and c10::TempDir. It also deletes Windows temp files in c10::~TempFile, this behavior is absent on the current version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106656
Approved by: https://github.com/ezyang
2023-09-03 13:03:10 +00:00
1b3dc05c3e Use contiguous() to handle noncontiguous outputs during elementwise decomposition (#108140)
Fixes https://github.com/pytorch/pytorch/issues/108218

Use contiguous() API to handle noncontiguous outputs during elementwise decomp

With this change, ops is decomposing properly (testcase from the bug):
```
graph():
    %arg0_1 : [#users=3] = placeholder[target=arg0_1]
    %abs_1 : [#users=1] = call_function[target=torch.ops.aten.abs.default](args = (%arg0_1,), kwargs = {})
    %floor : [#users=1] = call_function[target=torch.ops.aten.floor.default](args = (%abs_1,), kwargs = {})
    %sign : [#users=1] = call_function[target=torch.ops.aten.sign.default](args = (%arg0_1,), kwargs = {})
    %mul : [#users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%floor, %sign), kwargs = {})
    %sub : [#users=1] = call_function[target=torch.ops.aten.sub.Tensor](args = (%arg0_1, %mul), kwargs = {})
    return (sub,)
```
Output:
```
tensor([[ 0.2871,  0.7189,  0.7297],
        [ 0.8782, -0.4899,  0.7055]], device='hpu:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108140
Approved by: https://github.com/ezyang
2023-09-03 04:32:22 +00:00
e5548f8195 NT support for cat with dim > 0 when representable as jagged (#108428)
Used in SAM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108428
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #108361, #108370, #108362
2023-09-03 01:50:32 +00:00
76ccf6c770 NT support for narrow() on dim=0 (#108362)
Satisfies request here: https://github.com/pytorch/pytorch/issues/105913#issuecomment-1652249934
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108362
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #108361, #108370
2023-09-02 23:48:37 +00:00
01b662bafe [gen_operators_yaml] add arguments to control include_all_overloads (#108396)
Summary:
In SelectiveBuildOperator, we can specify argument `include_all_overloads`. If True, all overloaded operators (for example, `aten::to.dtype_layout`, `aten::to.prim_Device"` are considered as overloaded operators of `aten::to`), will be built and linked to the final binary. This can significantly increases the final binary size, which could be a deal breaker for on-device deployment.

In this diff, we make back-compatible changes to add new arguments `--not-include-all-overloads-static-root-ops` and `--not-include-all-overloads-closure-ops`. When they are set, we set `include_all_overloads` flag to False for static root ops and closure ops, and rely on code analyzer to decide the actual used overloaded operator.

Test Plan:
- unit test
```
buck test //xplat/caffe2/tools:gen_operators_yaml_test
```
- See test plan in D48771544 where we reduce the shared lib file `libmrengine.lib` from 16653072 bytes to 13686032 bytes.
- See detailed document: https://fburl.com/gdoc/mc93h6kb

Reviewed By: larryliu0820

Differential Revision: D48772302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108396
Approved by: https://github.com/larryliu0820
2023-09-02 17:37:36 +00:00
b9dfdc091b [AOTInductor][Reland] Proxy Executor for Extern Fallback kernels (#107279) (#108350)
Summary:

This is a prototype for running extern fallback kernels with a host side proxy executor.

Sample of generated cpp wrapper call:
```
        at::Tensor buf0;  // output buffer
        void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0};
        int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81};
        proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0);
```

- In my current implementation, proxy executor interprets the raw pointers according to the ops schema.
This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption)
- I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels.

- These are all the input argument types I have support so far.
       union Argument {
         # Bool value does not matter
         1: bool asNone;
         2: TensorArgument asTensor;
         3: list<TensorArgument> asTensors;
         5: i64 asInt;
         7: list<i64> asInts;
         8: double asFloat;
         9: list<double> asFloats;
         10: string asString;
         10.5: list<string> asStrings;
         11: SymIntArgument asSymInt;
         12: list<SymIntArgument> asSymInts;
         13: ScalarType asScalarType;
         14: MemoryFormat asMemoryFormat;
         15: Layout asLayout;
         16: Device asDevice;
         17: bool asBool;
         18: list<bool> asBools;
       }

- Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC  implications.
1. requires exported fx graph to explicitly populate default values, if users doesn't specify.
2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify.
3. Proxy executor look up from opSchema for default values.

For fixing T162112344

Test Plan:
frontend:
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main

test:
 buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops

backend:
buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main

buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)'

Reviewed By: suo

Differential Revision: D48747417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350
Approved by: https://github.com/izaitsevfb
2023-09-02 17:14:10 +00:00
b9fc6d7ded [Dynamo] Update the implementation of _debug_get_cache_entry_list (#108335)
In https://github.com/pytorch/pytorch/pull/106673 , I created a private API `_debug_get_cache_entry_list` to help pull out cache entries from compiled functions.

Recently, I find that @anijain2305 commented in the code that this API should be revisited, and so I created this PR.

First, this API cannot be removed even if cache entry becomes a first-class python class`torch._C._dynamo.eval_frame._CacheEntry`. The facts that `extra_index` is static, and `get_extra_state` is inline static, make them not accessible elsewhere. This API `_debug_get_cache_entry_list` is the only way for users to get all the cache entries from code.

Second, since the`torch._C._dynamo.eval_frame._CacheEntry` class is a python class, I simplified the C-part code, and remove the necessity of creating a namedtuple for this in the python code.

Third, I also add a small improvement, that if the argument is a function, we can automatically pass its `__code__` to the API.

The above change will slightly change the output, from list of named tuple to list of `torch._C._dynamo.eval_frame._CacheEntry`. I will update the corresponding docs that use this API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108335
Approved by: https://github.com/jansel, https://github.com/anijain2305
2023-09-02 16:38:59 +00:00
de58600126 Improve docs for torch.unique dim argument (#108292)
Fixes #103142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108292
Approved by: https://github.com/albanD
2023-09-02 11:09:09 +00:00
cyy
0cc2f06aec [Reland] Improve MKL related logic in FindOpenMP.cmake (#104224)
Reland of PR #94924. The purpose of this PR is to deal with the complicated interactions between MKL and OpenMP.
There are two improvements:
1. It uses a flag to avoid infinite mutual recursion in calling find_package(MKL) and find_package(OpenMP) in some cases.
2. The logic of finding iomp5 is improved and now we can test  MKLDNN under ASAN.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104224
Approved by: https://github.com/malfet
2023-09-02 07:55:11 +00:00
ffc0c46092 [Quantization] Add metadata porting for nodes added by quantization (#107107)
Summary:
This diff adds adding metadata to q-dq nodes by inferring the
quatization intent from node annotations. Annotations on the node are
way for user to specify how a node or subgraph is supposed to be
quantized. We continue to use that information to copy metadata on Q/DQ
node from appropriate nodes.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48488416](https://our.internmc.facebook.com/intern/diff/D48488416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107107
Approved by: https://github.com/jerryzh168
ghstack dependencies: #107105, #107106, #107899, #107900
2023-09-02 06:38:14 +00:00
cyy
d6a9c2b4b5 [BC BREAKING] Remove outdated python submodules (#108236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108236
Approved by: https://github.com/malfet
2023-09-02 06:24:20 +00:00
eb67c452c8 [Quant] Add DQ duplication pass (#107900)
Summary:
During convert step observers are first replaced by Q-DQ pair. In some
scenarios like following output DQ has a fan out.

                 ---> OP2 -> Q -> DQ
                /
OP -> Q -> DQ -
                \
                 ---> OP3 -> Q -> DQ

If either op OP2 or OP3 are configured to be quantized, then the input
is expected to quantized. In this case quantized equivalent of some
pattern, that quantizer asked to be quantized, should look like:
[DQ -> {pattern} -> Q]. However, in scenario like above where DQ node
is shared between multiple "quantized" patterns, boundary of "quantized"
pattern is not clear because DQ now belongs to multiple quantized
patterns.

This poses challenge for:
- Porting metadata: which "quantized" partition this DQ node belongs
- Quantized representation, equivalently, needs to identify
self-contained quantized pattern that is replaced by its equivalent pattern
that captures compute in the quantized precision.

Test Plan:
test_duplicate_dq_pass

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48663147](https://our.internmc.facebook.com/intern/diff/D48663147)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107900
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14, https://github.com/leslie-fang-intel
ghstack dependencies: #107105, #107106, #107899
2023-09-02 06:20:03 +00:00
f8d1ca9835 [Quant] Bug fix (#107899)
Summary:
When two layers are quantized differently, observer map update updates
map for key (observed_node, node), whereas it should really be
(original_input, node)

Test Plan:
Test in the next diff adds a test where it otherwise fails

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48663145](https://our.internmc.facebook.com/intern/diff/D48663145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107899
Approved by: https://github.com/jerryzh168
ghstack dependencies: #107105, #107106
2023-09-02 06:20:03 +00:00
37b0d76e35 [Quantization] Make annotation util functions return annotated nodes (#107106)
Summary:
Having annotation functions return nodes that are annotated is useful
specifically for adding "quantization_tag" to those nodes

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48488415](https://our.internmc.facebook.com/intern/diff/D48488415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107106
Approved by: https://github.com/jerryzh168
ghstack dependencies: #107105
2023-09-02 06:19:55 +00:00
99168c1fa9 [Quant] Use input_qspec_map for weight quantization of linear (#107105)
Summary:
In prepararation for metadata porting diff, it is required that weight
quant annotation happens via edge quantization, i.e. input_qspec_map.

Reason: Metadata is ported via associating DQ node's metadata with its
consumer while associating Q node's metadata with its producer.
Furthermore, such porting must be qualified via user intent to see if
the consumder of DQ, or producer of Q, actually specified intent of
quantization

By making quantization annotation on linear node's weight via
input_qspec_map, we can enable associating DQ of [weight -> Q -> DQ],
with the linear module.

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48488414](https://our.internmc.facebook.com/intern/diff/D48488414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107105
Approved by: https://github.com/jerryzh168
2023-09-02 06:19:50 +00:00
ab6a86dccd [vision hash update] update the pinned vision hash (#108460)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108460
Approved by: https://github.com/pytorchbot
2023-09-02 03:52:25 +00:00
ed92d9345e Refactorings for constant folding (#108450)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108450
Approved by: https://github.com/jansel
2023-09-02 03:49:05 +00:00
5f5caed25a do not cast all inputs in benchmarks (#108456)
Fixes why stable diffusion is not showing up in inference dashboard even though it shows up in training dashboard

The reason is stable diffusion in torchbench has a line like `input_tensor = input_tensor.long().to(self.device)` and if you cast this to a bfloat16 the inference will fail

<img width="1705" alt="Screenshot 2023-09-01 at 4 37 49 PM" src="https://github.com/pytorch/pytorch/assets/3282513/ada0d381-1af0-4378-8e8b-2375b39c3713">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108456
Approved by: https://github.com/cpuhrsch
2023-09-02 03:13:17 +00:00
b8af8ac784 [CUDACaching Allocator] Release the allocator lock on the slow path (#108367)
Summary: This diff is to release the global allocator lock on the slow path when we do synchronous cudaMalloc call.

Differential Revision: D48750077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108367
Approved by: https://github.com/zdevito
2023-09-02 02:52:25 +00:00
4084d039b7 Only add triton dependency to CUDA and ROCm binaries if it hasn't been set as an installation requirement yet (#108424)
The dependency was added twice before in CUDA and ROCm binaries, one as an installation dependency from builder and the later as an extra dependency for dynamo, for example:

```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8)
Provides-Extra: dynamo
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8) ; extra == 'dynamo'
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```

In the previous release, we needed to remove this part from `setup.py` to build release binaries https://github.com/pytorch/pytorch/pull/96010.  With this, that step isn't needed anymore because the dependency will come from builder.

### Testing

Using the draft https://github.com/pytorch/pytorch/pull/108374 for testing and manually inspect the wheels artifact at https://github.com/pytorch/pytorch/actions/runs/6045878399 (don't want to go through all `ciflow/binaries` again)

* torch-2.1.0.dev20230901+cu121-cp39-cp39-linux_x86_64
```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8) <-- This will be 2.1.0 on the release branch after https://github.com/pytorch/builder/pull/1515
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```

* torch-2.1.0.dev20230901+cu121.with.pypi.cudnn-cp39-cp39-linux_x86_64
```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8)
Requires-Dist: nvidia-cuda-nvrtc-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu12 (==8.9.2.26) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cublas-cu12 (==12.1.3.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu12 (==11.0.2.54) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu12 (==10.3.2.106) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu12 (==11.4.5.107) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu12 (==12.1.0.106) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu12 (==2.18.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: triton (==2.1.0) ; platform_system == "Linux" and platform_machine == "x86_64" <--This is 2.1.0 because it already has https://github.com/pytorch/pytorch/pull/108423, but the package doesn't exist yet atm
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```

* torch-2.1.0.dev20230901+rocm5.6-cp38-cp38-linux_x86_64
```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton-rocm (==2.1.0+34f8189eae) <-- This will be 2.1.0 on the release branch after https://github.com/pytorch/builder/pull/1515
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108424
Approved by: https://github.com/atalman
2023-09-02 01:16:18 +00:00
2e3fce5450 Add dynamo support for rdiv dunder method. (#108422)
Fix: #106646

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108422
Approved by: https://github.com/eellison
2023-09-02 00:59:22 +00:00
fa8edd93b7 [inductor] Handle aten.full's dtype in the decomposition (#108443)
In the lowering we don't have `SymFloat` and `SymInt`, we just have `sympy.Expr`
so it is impossible to accurately determine the expected dtype of a `full` call.
For example, `sym_float(int_expr)` has `is_integer=True` but should be treated
as a float. In the decomposition though, we can get this right.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108443
Approved by: https://github.com/lezcano
2023-09-02 00:53:04 +00:00
2c1f0772d5 Revert "Horizontally fuse input concatenation (#108115)"
This reverts commit 5911faeb8fc3f625f9c3a42e58d45f7b7578ab8a.

Reverted https://github.com/pytorch/pytorch/pull/108115 on behalf of https://github.com/osalpekar due to Broke internal benchmarking job. See [D48890838](https://www.internalfb.com/diff/D48890838) ([comment](https://github.com/pytorch/pytorch/pull/108115#issuecomment-1703546520))
2023-09-02 00:19:00 +00:00
a27f01083d [S362716] turn off constant folding (#108389)
Summary: Constant folding is using a lot of memory and is causing OOM. Turn if off in fbcode. Also filed an issue https://github.com/pytorch/pytorch/issues/108388

Test Plan: Cloned a failed job and it's working now

Differential Revision: D48871102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108389
Approved by: https://github.com/eellison
2023-09-01 23:36:48 +00:00
e3933609d4 Make make_fx cond preserve node meta (#108356)
**Motivation:**
Currently, for the following code that exports cond operator:
```python
import torch
from functorch.experimental.control_flow import cond

class MySubModule(torch.nn.Module):
    def foo(self, x):
        return x.cos()

    def forward(self, x):
        return self.foo(x)

class CondBranchClassMethod(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.subm = MySubModule()

    def bar(self, x):
        return x.sin()

    def forward(self, x):
        return cond(x.shape[0] <= 2, self.subm.forward, self.bar, [x])

from torch._export import capture_pre_autograd_graph

example_inputs = (torch.randn(1, 3, 3, 3),)
m = CondBranchClassMethod()
m.eval()
gm = capture_pre_autograd_graph(m, example_inputs)
print(gm)

# source_fn for original cond op, getattr submodule op are all cond op
for n in gm.graph.nodes:
    print("n:", n.format_node(), n.meta)

print("\n\n\n")
# source_fn for submodule nodes are all cond op
# Expected: ideally this should be the real ops, e.g. torch.sin, aten.cos, etc
for n in gm.submodule_0.graph.nodes:
    print("n:", n.format_node(), n.meta)
```

Output is like below:
```
GraphModule(
  (submodule_0): GraphModule()
  (submodule_1): GraphModule()
)

def forward(self, arg_0):
    arg0_1, = fx_pytree.tree_flatten_spec([arg_0], self._in_spec)
    submodule_0 = self.submodule_0
    submodule_1 = self.submodule_1
    cond = torch.ops.higher_order.cond(True, submodule_0, submodule_1, [arg0_1]);  submodule_0 = submodule_1 = arg0_1 = None
    return pytree.tree_unflatten((cond,), self._out_spec)

# To see more debug info, please use `graph_module.print_readable()`
n: %arg0_1 : [num_users=1] = placeholder[target=arg0_1] {'val': FakeTensor(..., size=(1, 3, 3, 3)), 'tensor_meta': None, 'is_torch_exported': True, 'stack_trace': 'NoneType: None\n'}
n: %submodule_0 : [num_users=1] = get_attr[target=submodule_0] {'stack_trace': 'NoneType: None\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': None, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('conditional', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>)], 'seq_nr': -1}
n: %submodule_1 : [num_users=1] = get_attr[target=submodule_1] {'stack_trace': 'NoneType: None\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': None, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('conditional', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>)], 'seq_nr': -1}
n: %cond : [num_users=1] = call_function[target=torch.ops.higher_order.cond](args = (True, %submodule_0, %submodule_1, [%arg0_1]), kwargs = {}) {'stack_trace': 'NoneType: None\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': None, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('conditional', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 3, 3, 3)), 'tensor_meta': None, 'is_torch_exported': True}
n: return (cond,) {'stack_trace': 'NoneType: None\n', 'from_node': [('output', 'output')], 'seq_nr': -1, 'is_torch_exported': True, 'val': (FakeTensor(..., size=(1, 3, 3, 3)),), 'tensor_meta': (None,)}

n: %arg0_1 : [num_users=1] = placeholder[target=arg0_1] {'stack_trace': '  File "<ipython-input-9-2a8c7c0498ed>", line 36, in forward\n    return cond(x.shape[0] <= 2, self.subm.forward, self.bar, [x])\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': None, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('arg0_1', 'arg0_1')], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 3, 3, 3)), 'tensor_meta': None}
n: %cos_default : [num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%arg0_1,), kwargs = {}) {'stack_trace': '  File "<ipython-input-9-2a8c7c0498ed>", line 36, in forward\n    return cond(x.shape[0] <= 2, self.subm.forward, self.bar, [x])\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': <OpOverload(op='aten.cos', overload='default')>, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('cos', <OpOverload(op='aten.cos', overload='default')>), ('cos_default', <OpOverload(op='aten.cos', overload='default')>)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 3, 3, 3)), 'tensor_meta': None}
n: return cos_default {'stack_trace': '  File "<ipython-input-9-2a8c7c0498ed>", line 36, in forward\n    return cond(x.shape[0] <= 2, self.subm.forward, self.bar, [x])\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': None, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('output', 'output')], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 3, 3, 3)), 'tensor_meta': None}
```

As we can see, the meta of nodes in subgrarphs are overriden with the cond's metat data. This is because the function _set_current_meta is only invoked at the top-level graph module in interpreter. When we're calling into cond and dealing with the submodules here, we didn't set the current_meta to the meta of nodes of subgraph properly.

**Implementation:**
This pr fixes it by: in trace_cond, we optionally use an fx.interpreter to interpret the subgraphs so that the meta data is preserved only when the following conditions are satisfied:
- The subgraphs are graph_module: this is necessary that we use the fx.Interpreter
- The current make_fx has turned preserve_node_meta on (as is the case for capture_pre_autograd_graph).

**Test Plan**
See added tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108356
Approved by: https://github.com/SherlockNoMad
2023-09-01 22:43:55 +00:00
ac42b4ea4d [pt2] Turn on cudagraph tree in fbcode (#108416)
Summary:
cudagraph tree will significantly reduce the memory usage>
Memory consumption wise: {F1081833757}

with cudagraph tree: 65GB
w/o cudagraph tree: 83GB

Differential Revision: D48907239

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108416
Approved by: https://github.com/eellison
2023-09-01 22:39:43 +00:00
ad032a76f3 print equalities (#108427)
Differential Revision: [D48910802](https://our.internmc.facebook.com/intern/diff/D48910802/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108427
Approved by: https://github.com/angelayi
2023-09-01 22:37:22 +00:00
add45aea1c Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-09-01 22:14:44 +00:00
234f00e1cd [PyTorch][Vulkan] Add a matrix multiplication performance test binary and fix GPU latency measurement (#108266)
Summary:
- Added a new matmul perf test binary as target `pt_vulkan_mm_perf_test_bin`
- Also renamed the existing `vulkan_perf_test_bin` to `vulkan_conv_arithmetic_perf_test_bin` with associated source file name change
- **Fixed the manual time benchmark measurement for both performance binaries, which was not tracking the correct opnames (e.g. checked for runtime of nonexistent "mm" instead of "vulkan.mm")**

Test Plan:
# pt_vulkan_mm_perf_test_bin

- build the matrix multiplication performance test binary
```
~/fbsource »  buck2 build  -c ndk.debug_info_level=0  -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_mm_perf_test_binAndroid  --show-output  -c pt.vulkan_full_precision=1
[...]
BUILD SUCCEEDED
fbsource//xplat/caffe2:pt_vulkan_mm_perf_test_binAndroid buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_mm_perf_test_binAndroid__/pt_vulkan_mm_perf_test_binAndroid
```
- test on arm32 android device
```
~/fbsource » adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_mm_perf_test_binAndroid__/pt_vulkan_mm_perf_test_binAndroid /data/local/tmp/
~/fbsource » adb shell /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid
```
- output P817269023, excerpt below
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4336072
vulkan.nchw_to_image     {250, 250, 1}                    1106716
vulkan.nchw_to_image     {1, 1, 1}                           7228
vulkan.mm                {250, 250, 1}                  132570256

[...]

vulkan.mm                {250, 250, 1}                   80492152
vulkan.image_to_nchw     {500, 500, 1}                    1420328
-------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                     Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------------------
mm_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1                         91047 ms          143 ms            5

```

# pt_vulkan_conv_arithmetic_perf_test_bin
- build the convolution and arithmetic performance test binary
```
~/fbsource »  buck2 build  -c ndk.debug_info_level=0  -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_conv_arithmetic_perf_test_binAndroid  --show-output  -c pt.vulkan_full_precision=1
[...]
BUILD SUCCEEDED
fbsource//xplat/caffe2:pt_vulkan_conv_arithmetic_perf_test_binAndroid buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_conv_arithmetic_perf_test_binAndroid__/pt_vulkan_conv_arithmetic_perf_test_binAndroid
```
- test on arm32 android device
```
~/fbsource » adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_conv_arithmetic_perf_test_binAndroid__/pt_vulkan_conv_arithmetic_perf_test_binAndroid /data/local/tmp/
~/fbsource » adb shell /data/local/tmp/pt_vulkan_conv_arithmetic_perf_test_binAndroid
2023-07-20T20:23:26+00:00

```
- output P817267332, excerpt below

```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.add               {193, 221, 30}                  39475696
vulkan.image_to_nchw     {193, 221, 30}                  13463424
vulkan.add               {193, 221, 30}                  72950176
vulkan.image_to_nchw     {193, 221, 30}                  17792684

[...]

vulkan.add               {193, 221, 30}                  72986368
vulkan.image_to_nchw     {193, 221, 30}                  15921672
----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                  Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------------
add_op_benchmark/N:3/C:40/H:221/W:193/iterations:100/manual_time/threads:1             73242 ms          602 ms          100
libc++abi: terminating due to uncaught exception of type c10::Error: Copy of vulkan quantized tensors to cpu is currently disabled!
```

Reviewed By: yipjustin

Differential Revision: D48798710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108266
Approved by: https://github.com/manuelcandales
2023-09-01 22:11:35 +00:00
8f02884569 add Half support for GroupNorm on CPU (#100234)
### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s| forward / s| backward / s| backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 fp16 | fp32 | mixed fp32 fp16
[10,   128, 10, 10] | 2.45E-05 | 3.26E-05 | 6.87E-05 | 7.40E-05
[10,   128, 80, 80] | 0.000726 | 0.000606 | 0.002183 | 0.001112

* Channels Last:

shape | forward / s| forward / s| backward / s| backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 fp16 | fp32 | mixed fp32 fp16
[10,   128, 10, 10] | 2.88E-05 | 2.72E-05 | 6.56E-05 | 6.63E-05
[10,   128, 80, 80] | 0.00076 | 0.000256 | 0.002385 | 0.000735

Single core:

* Contiguous:

shape | forward / s| forward / s| backward / s| backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 fp16 | fp32 | mixed fp32 fp16
[10,   128, 10, 10] | 9.47E-05 | 1.90E-04 | 2.03E-04 | 3.10E-04
[10,   128, 80, 80] | 6.25E-03 | 8.98E-03 | 0.016485 | 0.01369

* Channels Last:

shape | forward / s| forward / s| backward / s| backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 fp16 | fp32 | mixed fp32 fp16
[10,   128, 10, 10] | 8.66E-05 | 7.89E-05 | 1.95E-04 | 1.43E-04
[10,   128, 80, 80] | 5.97E-03 | 3.13E-03 | 0.01626 | 8.70E-03

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100234
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki
2023-09-01 21:25:24 +00:00
54dcb0ea61 NT support for matmul of (B, *, C, D) NT with dense (D, E) (#108370)
Used in SAM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108370
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #108361
2023-09-01 20:45:44 +00:00
a78b78cd76 [DTensor][random] add DTensor constructor: randn (#108285)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108285
Approved by: https://github.com/wanchaol
2023-09-01 20:28:41 +00:00
c67ebae344 Put logging in run_tests (#107987)
Logging regarding which tests are serial + parallel + what tests actually get run on the shard got removed, which can be pretty helpful, so this adds it back in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107987
Approved by: https://github.com/huydhn, https://github.com/Neilblaze
2023-09-01 20:23:30 +00:00
29f17e1f14 Fix full on symbolic value. (#108166)
Fix: #108067

This PR adds checks for `sympy.Expr` when extracting the dtype from a value inside the
`full` lowering.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108166
Approved by: https://github.com/lezcano
2023-09-01 20:16:40 +00:00
fc1c862e62 [export] Properly handle duplicated params. (#108415)
Summary:

Test Plan:
python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy
--inference --device cuda --export --only  BertForMaskedLM

python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy
--inference --device cuda --export-aot-inductor --only  BertForMaskedLM

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108415
Approved by: https://github.com/angelayi
2023-09-01 19:44:32 +00:00
2d9a828900 enabled AT_USE_JITERATOR() for tan and tanh kernels. (#102427)
This MR fixes the test failures for `jiterator` implemenation for `tan` and `tanh` Unary kernels as mentioned in #100842.

The failures were fixed by adjusting tolerances but some failures were in `test_unary_ufuncs.py` required adjusting input values as well. Since the jiterator kernels are using libstdc++, the supported input range is smaller than thrust implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102427
Approved by: https://github.com/malfet
2023-09-01 19:16:10 +00:00
6ba2b6e147 [ONNX] Show sarif_report_path (#108398)
`sarif_report_path` was not formatted correctly in the error message

@BowenBao

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108398
Approved by: https://github.com/thiagocrepaldi
2023-09-01 19:11:52 +00:00
e58d3ed81d [inductor] Generalize pointless_cumsum_replacement pattern (#108373)
The current pattern transforms:
```
ones([x, y]).cumsum(1) -> arange(1, 1 + y).expand([x, y])
```
but this generalizes it to
```
full(shape, fill_value).cumsum(d) ->
    (fill_value * arange(1, 1 + shape[d])).view([1..., shape[d], 1...]).expand(shape)
```

So we handle any fill value, any number of dimensions, and broadcasting to any dimension.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108373
Approved by: https://github.com/lezcano
2023-09-01 17:12:09 +00:00
0f1a225f33 [CI] Enable max-autotune for Sunday dashboard run (#108386)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108386
Approved by: https://github.com/huydhn
2023-09-01 14:55:24 +00:00
2a6ef9b04d [dynamo] Avoid recompilation when the PyTorch function accepts scalars (#108162)
Before, it would create a 0D tensor with the input, which would incur in
a guard and specialisation.

It's not clear whether the guard and specialisation is the right behaviour
when we create 0D tensors, but that's a story for another day.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108162
Approved by: https://github.com/ev-br, https://github.com/peterbell10
2023-09-01 14:35:42 +00:00
591cb776af [FSDP][state_dict][optim_state_dict] Log slow optim and model state_dict paths (#108290)
This PR adds SimpleProfiler for FSDP state_dict/load_state_dict logging purpose. SimpleProfiler use class variables to record profiling results and it does everything in the Python which can be slow. So it is only suitable for logging slow actions such as initialization and state_dict/load_state_dict.

This PR uses SimpleProfiler to log some critical/slow paths of the model and optimizer state_dict/load_state_dict.

Differential Revision: [D48774406](https://our.internmc.facebook.com/intern/diff/D48774406/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108290
Approved by: https://github.com/wz337
2023-09-01 06:57:59 +00:00
db63bf3d7e [NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph capture (#104487)
#103503 addresses the situation where additional work is enqueued for the NCCL watchdog to poll during a graph capture---something we want to avoid as the subsequent polling will query an event and crash the capture.

However, there is currently no check that there was not work _already_ enqueued for the watchdog to poll. If there was already work that was enqueued and not cleaned up before the start of a graph capture, then we run into a similar problem where the event query will crash the capture. We've observed this causing crashes on several workloads, although it is somewhat flaky (if the watchdog happens to poll just before the graph capture and cleanup, then we dodge the crash).

This is a bit of a tricky issue as it involves making sure that no process group has enqueued work at the start of a capture, and as such the simplest solution is to add a bit of global state to track all process groups. This PR forces the start of the graph capture to wait until all enqueued work is completed and cleaned up or times out.

I did consider the alternative of simply having the watchdog skip cleanup if we detect that we are in the middle of a graph capture, but I think deferring the cleanup until later could result in false positive timeouts (e.g., if we defer cleanup on work that has completed long ago, checking timers after the graph capture could yield a "timeout").

CC @Aidyn-A
@bottler @kwen2501 @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104487
Approved by: https://github.com/kwen2501
2023-09-01 05:42:08 +00:00
4a9c6f1b73 [PyPer][BE] Fix test_scripted_module in StatCollector (#108232)
Summary: D41985889 removed the cast to int for the inputs to torch.histc below, allowing the inputs to still be tensors. These tensors still have require_grad_ set to True, causing issues with the call to torch.histc.

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//dper3/dper3/modules/low_level_modules/tests:stat_collector_test -- --exact 'dper3/dper3/modules/low_level_modules/tests:stat_collector_test - test_scripted_module (dper3.dper3.modules.low_level_modules.tests.stat_collector_test.StatCollectorTest_1)'

Differential Revision: D48800879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108232
Approved by: https://github.com/jerryzh168
2023-09-01 04:23:57 +00:00
d96446b9c2 [export] Fix duplicated params for AOTInductor. (#108354)
Summary:

Test Plan:
python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy
--inference --device cuda --export --only  BertForMaskedLM

python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy --inference --device cuda --export-aot-inductor --only  BertForMaskedLM

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108354
Approved by: https://github.com/angelayi, https://github.com/desertfire
2023-09-01 03:18:49 +00:00
e18f512b81 Update accuracy checking for nan, floats (#108202)
Fixes inference accuracy for `doctr_reco_predictor` and `pyhpc_turbulent_kinetic_energy`.

For the `same(float, float)` comparison we weren't going through the more rigorous tensor comparison path which takes into account the fp64 base results.

Also return True when fp64 base result are not well formed (nan).

I debugged these models and the source of divergence were innocuous:
`doctr_reco_predictor` - can be fixed by turning off layout optimization, decomp for batch norm

`pyhpc_turbulent_kinetic_energy` - divergence caused because fused kernel keeps precision in fp32 instead of casting back and forth from/to fp32/bf16. Fused kernel is better precision, anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108202
Approved by: https://github.com/jansel
2023-09-01 02:54:01 +00:00
90ef3b82d1 [DeviceMesh] Add unique mesh_dim_name check in init_device_mesh() (#108326)
Each mesh_dim_name in mesh_dim_names need to be unique. This PR adds check when calling init_device_mesh().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108326
Approved by: https://github.com/wanchaol
2023-09-01 02:14:18 +00:00
3702980717 dynamo: trace autograd.Function with tensor subclass input (#108093)
Summary:

Enables dynamo eager mode tracing for the following situation:
1. we have a torch.autograd.Function
2. the input to that function is a tensor subclass which is an intermediary

This is useful for float8 training UX.

Test Plan:

```
python test/dynamo/test_autograd_function.py -k intermediary_input
```

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108093
Approved by: https://github.com/bdhirsh, https://github.com/wanchaol
2023-09-01 02:12:38 +00:00
414cb26ded NT support for cat with dim=0 (#108361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108361
Approved by: https://github.com/cpuhrsch
2023-09-01 02:02:53 +00:00
a9fe0b5b74 [quant][pt2e] Move propagate_annotation from quant flow to quantizer (#108320)
Summary:
Previously we run propagate_annotation by default in quantization flow to propagate annotations for ops like reshape, view etc.

Not all quantizers would need this so we moved this to xnnpack_quantizer_utils for now.

Next Step:
* make propagate_annotation function configurable with a custom list of ops
* remove unneeded ops in `_is_share_obs_or_fq_op`

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48856985](https://our.internmc.facebook.com/intern/diff/D48856985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108320
Approved by: https://github.com/kimishpatel
2023-09-01 01:49:19 +00:00
ab5b4c4419 Revert "[HSDP] Add device_mesh to FSDP and add dtensor state_dict support for HSDP (#107533)"
This reverts commit cc220e45a80d7c01a4a58b0f386ca07236a6927a.

Reverted https://github.com/pytorch/pytorch/pull/107533 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it is failing in trunk with the same failure on test_dynamo_distributed cc220e45a8 ([comment](https://github.com/pytorch/pytorch/pull/107533#issuecomment-1701983247))
2023-09-01 01:26:30 +00:00
8289ad8e5e Support is_mtia attribute. (#108307) (#108310)
Summary:

FBGEMM uses `self.iter.is_cuda` to check if the tensor is for CUDA. This diff enables similar feature `self.iter.is_mtia` for tensors with MTIA device key.

Test Plan: See diff D48693225

Reviewed By: jackm321

Differential Revision: D48809191

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108310
Approved by: https://github.com/albanD
2023-09-01 01:25:40 +00:00
d569e506ab Revert "Flash Attention v2 (#105602)"
This reverts commit 9df3d882c8fe1e57914315aa250664ad5003d4fd.

Reverted https://github.com/pytorch/pytorch/pull/105602 on behalf of https://github.com/huydhn due to I think we miss a case here for sm80 build on inductor workflow as it is now OOM on trunk https://github.com/pytorch/pytorch/actions/runs/6042843139 ([comment](https://github.com/pytorch/pytorch/pull/105602#issuecomment-1701974862))
2023-09-01 01:15:01 +00:00
ee0e04ac48 Allow float dtype when Autocast CPU Disabled (#107348)
**Summary**
Fix the https://github.com/pytorch/pytorch/issues/100565 by allowing float32 data type when Autocast CPU is disabled. Current behavior is:
- When autocast is disabled and user passes in float data type, it works well.
- When autocast is enabled and user passes in float data type, a warn message throws `UserWarning: In CPU autocast, but the target dtype is not supported. Disabling autocast.` to disable autocast automatically

**TestPlan**

```
python -u -m pytest -s -v test_autocast.py -k test_autocast_disabled_with_fp32_dtype
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107348
Approved by: https://github.com/jgong5, https://github.com/Neilblaze, https://github.com/albanD
2023-09-01 00:49:44 +00:00
6c342ec368 Revert PR-107951 to only support new graph capture API in Quantization (#108317)
**Summary**
Revert the changes in https://github.com/pytorch/pytorch/pull/107951 to make the utils function only support graph captured by `capture_pre_autograd_graph`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108317
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #108214
2023-09-01 00:47:10 +00:00
fb808c30c7 x86_inductor_quantizer switches to new graph capture API (#108214)
**Summary**
Update `X86InductorQuantizer` and related testcase to the new graph capture API `capture_pre_autograd_graph`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108214
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-09-01 00:43:45 +00:00
aadd86b1e8 [DCP]Add unit test for tp checkpoint (#108286)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108286
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-09-01 00:30:13 +00:00
63eee52ba7 Add Drq to BF16 Higher Tolernace (#108368)
This passes for me on aws gpu but not devgpu, and was already in the `REQUIRE_HIGHER_FP16_TOLERANCE` set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108368
Approved by: https://github.com/shunting314
2023-09-01 00:29:27 +00:00
9178deedff removing some redundant str splits (#106089)
drop some redundant string splits, no factual changes, just cleaning the codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106089
Approved by: https://github.com/albanD, https://github.com/malfet
2023-09-01 00:22:58 +00:00
cc220e45a8 [HSDP] Add device_mesh to FSDP and add dtensor state_dict support for HSDP (#107533)
This PR:
1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg.
2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict.
3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol
2023-09-01 00:15:00 +00:00
a29b9101fa [dynamo] fix dynamo + DTensor to work with 2d (#108329)
pair debugged with @wconstab and we found some issue in both dynamo and
the TP's fsdp extension side. This PR fixes the dynamo + DTensor integration
so that the current graph break FSDP can work with tensor parallel by moving
the torch.compile after FSDP wrapping.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108329
Approved by: https://github.com/Skylion007, https://github.com/wconstab
2023-08-31 22:46:26 +00:00
eafc05887f [dtensor] fix two more requires_grad callsite (#108358)
redistribute return a new DTensor and those returned DTensors should
follow the input DTensor requires_grad instead of the input tensor local
tensor's requires_grad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108358
Approved by: https://github.com/fduwjj
2023-08-31 22:25:40 +00:00
3e75fd06e2 Pin pandas version for inductor Docker image (#108355)
Building docker in trunk is failing atm https://github.com/pytorch/pytorch/actions/runs/6033657019/job/16370683676 with the following error:

```
+ conda_reinstall numpy=1.24.4
+ as_jenkins conda install -q -n py_3.10 -y --force-reinstall numpy=1.24.4
+ sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 conda install -q -n py_3.10 -y --force-reinstall numpy=1.24.4
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... unsuccessful initial attempt using frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... unsuccessful initial attempt using frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - numpy=1.24.4

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch
```

This was pulled in by pandas 2.1.0 released yesterday https://pypi.org/project/pandas/2.1.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108355
Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/malfet
2023-08-31 21:58:40 +00:00
bae409388c [MPS] Fix .item() for multi-dim scalar (#107913)
By refactoring `_local_scalar_dense_mps` to use `_empty_like` to allocate CPU tensor.
Also, print a more reasonable error message when dst dim is less than src in mps_copy_

This fixes regression introduced by https://github.com/pytorch/pytorch/pull/105617 and adds regression test.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at abd06e6</samp>

> _Sing, O Muse, of the valiant deeds of the PyTorch developers_
> _Who strive to improve the performance and usability of tensors_
> _And who, with skill and wisdom, fixed a bug in the MPS backend_
> _That caused confusion and dismay to many a user of `item()`_

Fixes https://github.com/pytorch/pytorch/issues/107867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107913
Approved by: https://github.com/albanD
2023-08-31 21:08:29 +00:00
5b6ba4110b Fallback to eager for float8 ops in inductor (#108293)
# Summary
As a stop gap to supporting the FP8 Dtype within inductor we would like to fallback to eager. Currently there are 3 ops that are needed for this:
`_scaled_mm` ( matmul for fp8 types)
`clone` (for creating new copies of fp8 tensors)
`to` ( for converting to and from fp8 types).

This PR registers a fallback for _scaled_mm. And adds fp8 to trigger `unsupported_input_tensor`

Prior to these changes this was failing with:
``` Shell
  File "/home/drisspg/meta/pytorch/torch/_inductor/codegen/triton_utils.py", line 11, in signature_of
    tye = JITFunction._type_of(arg.dtype)
  File "/home/drisspg/miniconda3/envs/dev/lib/python3.10/site-packages/triton/runtime/jit.py", line 229, in _type_of
    return key if isinstance(key, str) else f"*{tys[dtype_str]}"
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
KeyError: 'float8_e4m3fn'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108293
Approved by: https://github.com/peterbell10
2023-08-31 20:48:18 +00:00
49df1de383 Cudagraphs support for compiled optimizers (#107504)
Marks all params/optimizer state as static addresses and a finalizer which cleans up the graph attributes when the optimizer goes out of scope.

**Note: this does not mark grads as static because this will increase memory usage significantly

There are two cases:
1. The upstream graph is cudagraphed - this case will work fine OOTB
2. The upstream graph is not cudagraphed - in this case, there will be a lot of copies introduced from the upstream (to copy the grads) into cudagraphed-owned memory, unless the user explicitly marks the grads as static. If the user does this, this will also require not deallocating the grads in zero_grad() (either the mod or optimizer version) by setting them to zero vs None. There is a PR (https://github.com/pytorch/pytorch/pull/107853) in flight to throw an error if zero_grad attempts to set static grads to None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107504
Approved by: https://github.com/eellison
2023-08-31 20:47:18 +00:00
d5ff8ca4ef Relax divsibilty by 16 for leading dimension of mat1 in scaled_gemm (#108308)
# Summary
CublasLT requires that the matrices be 16 byte aligned. If mat1.size(-1) % 16 == 0 and the matrix is row major than the leading dimension can be any value. See this coment: https://github.com/pytorch/pytorch/pull/107341#discussion_r1310934737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108308
Approved by: https://github.com/eqy, https://github.com/vkuzo
2023-08-31 20:31:47 +00:00
aeb4d6d5c5 Fix constant folding of arithmetic operations with symbolic values. (#108160)
Partial fix: #108067

This PR fixes an inductor bug where it assumed the type of arithmetic nodes arguments were
all `Tensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108160
Approved by: https://github.com/lezcano
2023-08-31 20:26:35 +00:00
eb8659fe81 pass inference accuracy check for detectron2_fcos_r_50_fpn (#108328)
We need a higher tolerance to pass the inference accuracy check for detectron2_fcos_r_50_fpn .

Command:
```
python benchmarks/dynamo/torchbench.py --backend inductor --bfloat16 --accuracy --only detectron2_fcos_r_50_fpn --disable-cudagraphs --inference
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108328
Approved by: https://github.com/jansel
2023-08-31 20:21:20 +00:00
95f268e426 Add examples for nn.CosineEmbeddingLoss (#108215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108215
Approved by: https://github.com/mikaylagawarecki
2023-08-31 20:01:24 +00:00
f8c93df2d1 Fix boolean tensor for map (#108289)
torch.empty_strided is able to create a new tensor based on the meta data. For boolean tensor, we call a clone directly, however, we'll get a functional tensor if input is a functional tensor and that functional tensor won't be tracked by tracer's tensor_tracker after dispatching so it become a tensor\_constant in the graph if create_arg. So we manually unwrap the functional tensor before calling clone.

Test Plan:
See added test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108289
Approved by: https://github.com/angelayi
2023-08-31 19:17:28 +00:00
46f0d17498 Change to torch.ops.higher_order.cond in verifier (#108302)
We need to match against torch.ops.higher_order.cond in verifier.

Test Plan:
 added test to guard against change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108302
Approved by: https://github.com/angelayi
2023-08-31 19:12:07 +00:00
74ff028839 [dtensor] fix new_empty_strided op (#107835)
This PR fixes the new_empty_strided op to become replicate from sharding
when necessary, this is a quick fix to resolve https://github.com/pytorch/pytorch/issues/107661

We'll need to think more about the behavior of this op when it comes to
sharding, one possibility is to follow the input sharding, but given the
output shape of this op might not be the same as the input, it's hard to
say we should follow the input sharding, further improvement needed once
we figure out the op syntax
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107835
Approved by: https://github.com/fduwjj
2023-08-31 18:27:35 +00:00
46cd2fef3f Create empty host tensor for MTIA device type. (#108198)
Summary: Before copying tensor from CPU memory to device memory, for MTIA device, it doesn't need to pin the host memory first.

Test Plan: See diff D48761820

Reviewed By: jackm321

Differential Revision: D48456471

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108198
Approved by: https://github.com/cx-yin, https://github.com/fduwjj
2023-08-31 18:12:59 +00:00
dabdb97087 [Dynamo] Graph break on functions using tensor out variants (#108182)
Fixes #108021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108182
Approved by: https://github.com/eellison
2023-08-31 17:49:14 +00:00
877561f388 Enable Mypy Checking in torch/_inductor/dependencies.py (#107675)
Fixes #105230

Summary:

As suggested in https://github.com/pytorch/pytorch/issues/105230 mypy checking is enabled in torch/_inductor/dependencies.py

After Fix:

mypy --follow-imports=skip torch/_inductor/dependencies.py Success: no issues found in 1 source file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107675
Approved by: https://github.com/jansel
2023-08-31 17:36:43 +00:00
2e1e7ed610 Revert "Fallback to eager for float8 ops in inductor (#108293)"
This reverts commit 98aa3745c258827cde8d081d0713ba2cd67c864e.

Reverted https://github.com/pytorch/pytorch/pull/108293 on behalf of https://github.com/huydhn due to Sorry for reverting your change, it is failing on ROCM 98aa3745c2 ([comment](https://github.com/pytorch/pytorch/pull/108293#issuecomment-1701446105))
2023-08-31 17:21:20 +00:00
335767e7da Raise an error for unsupported ctx managers (#108272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108272
Approved by: https://github.com/anijain2305
2023-08-31 17:20:36 +00:00
5727b07ac6 TD: logging bugfix (#108288)
Fix bug where logging metrics don't get emitted unless the 'keep-going' label is specified on the PR

Also adds some extra logging to make debugging easier
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108288
Approved by: https://github.com/Skylion007
2023-08-31 16:51:49 +00:00
06d74e6b24 Revert "[AOTInductor] Include constants in AOTInductor .so file. (#10… (#108349)
This reverts commit c3239442a3dd1040b251ff33bef40589cba40e1c due to internal test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108349
Approved by: https://github.com/aakhundov, https://github.com/zhxchen17
2023-08-31 16:26:02 +00:00
01dfa7620d MAINT: np.unique works with f16 directly (#108228)
(follow up on gh-107768)

Remove a f16->f32 workaround from np.unique, since torch.unique and np.unique seem to just work with float16 tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108228
Approved by: https://github.com/lezcano
2023-08-31 16:21:13 +00:00
cbf7c91883 inductor: make fallback for cpu scatter_add (#108220)
For inductor cpu backend, the scatter_add will use ```atomic_add```, which get a worse performance, currently, we make fallback for it to avoid performance regression compared with eager mode(single socket of SKX):
```
basic_gnn_gin 1.16x(after) Vs 0.509x(before)

basic_gnn_sage  1.064x(after) Vs 0.496x (before)

basic_gnn_gcn 1.373x(aftre) Vs 0.720x(before)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108220
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-08-31 16:11:07 +00:00
9df3d882c8 Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-08-31 16:02:20 +00:00
239ee76177 Add refs/decomps for dot/vdot (#108194)
Follow-up on https://github.com/pytorch/pytorch/issues/108127#issuecomment-1698142427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108194
Approved by: https://github.com/peterbell10
ghstack dependencies: #108188
2023-08-31 15:30:23 +00:00
239fed7e1e Add reference for linalg.vecdot (#108188)
Was addressing https://github.com/pytorch/pytorch/issues/108127, but
then I realised that vecdot is already CompositeImplicit. Pushing anyway
as a short-and-sweet PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108188
Approved by: https://github.com/peterbell10
2023-08-31 15:30:23 +00:00
150088a9cd Revert "Use ctypes to serialize raw content for tensors. (#108287)"
This reverts commit 43f28beffc572474e8c5f6ba6c33115e9dc69be9.

Reverted https://github.com/pytorch/pytorch/pull/108287 on behalf of https://github.com/desertfire due to Internal test failure from https://github.com/pytorch/pytorch/pull/107718. Revert this one first and then revert 107718. ([comment](https://github.com/pytorch/pytorch/pull/108287#issuecomment-1701138632))
2023-08-31 14:17:04 +00:00
691e0e9799 [export] Copy gm before calling PassManager (#108321)
Test Plan: CI

Reviewed By: angelayi, cccclai

Differential Revision: D48801487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108321
Approved by: https://github.com/kimishpatel, https://github.com/mcr229
2023-08-31 13:34:08 +00:00
31ef33871d [vmap][dynamo] run vmap under python dispatcher (#107947)
Before `test_op_has_batch_rule_cholesky_solve_cpu_float32` failed:
```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/functorch/test_vmap.py -k test_op_has_batch_rule_cholesky_solve_cpu_float32
test/functorch/test_vmap.py terminate called after throwing an instance of 'pybind11::error_already_set'
 what():  RuntimeError: /home/kshiteej/Pytorch/pytorch_functorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd.cpp:2214: SymIntArrayRef expected to contain only concrete integers
```

After this PR the test cases

NOTE: We can't be 100% of tests on CI till we figure out https://github.com/pytorch/pytorch/issues/107444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107947
Approved by: https://github.com/zou3519
2023-08-31 13:16:44 +00:00
58268137f1 [pytree] Allow register_pytree_node to take in 5 inputs (#108256)
Summary: This is currently breaking internal old torch.packages from when _register_pytree_node took 5 inputs.

Test Plan: CI

Differential Revision: D48834742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108256
Approved by: https://github.com/zhxchen17, https://github.com/zou3519
2023-08-31 11:06:17 +00:00
50fa5880e8 [vmap] symintify alias and squeeze (#107577)
Following tests now pass (both ops call into `alias` on certain paths)

```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/functorch/test_vmap.py -k test_squeeze -v
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/functorch/test_vmap.py -k test_conj -v
```

NOTE: Ideally, this symint version should work with non symint version as well but that would mean changes at multiple places. Wanted to get a review for this fix before-hand.

Other sites which use the `IntArrayRef` overload.
5f56c4fb32/aten/src/ATen/native/TensorShape.cpp (L1707-L1713)

`view_impl` is called from `view` and `_unsafe_view`.
5f56c4fb32/aten/src/ATen/native/TensorShape.cpp (L3295-L3306)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107577
Approved by: https://github.com/zou3519
2023-08-31 08:08:33 +00:00
138fafe72d [export] Fix torch.export() issues for server use cases. (#108275)
Test Plan: In D48788843

Differential Revision: D48811793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108275
Approved by: https://github.com/tugsbayasgalan
2023-08-31 07:19:18 +00:00
43f28beffc Use ctypes to serialize raw content for tensors. (#108287)
Summary:
There's a deadlock in current storage's implementation if the size of tensor is too large. Use ctypes to do serialization.

Test Plan:
python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy --inference --device cuda --export-aot-inductor --only MT5ForConditionalGeneration

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108287
Approved by: https://github.com/desertfire, https://github.com/malfet
2023-08-31 06:59:18 +00:00
cyy
c24d0d3163 clang8=>clang9 in jobs (#107144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107144
Approved by: https://github.com/malfet
2023-08-31 06:51:58 +00:00
cyy
a20fac89c8 [4/N] fix clang-tidy warnings in torch/csrc (#108305)
Fixes clang-tidy warnings in torch/csrc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108305
Approved by: https://github.com/Skylion007
2023-08-31 06:47:42 +00:00
d72b990bab [ONNX] Move large scale models without non-persistent buffers to runtime test (#108084)
Fixes https://github.com/pytorch/pytorch/issues/107715

Update models with their config to save CI running time and memories. Move some of models that doesn't need non-persistent buffers to runtime test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108084
Approved by: https://github.com/thiagocrepaldi
2023-08-31 06:05:19 +00:00
9ed0b3fcd9 [release_note_tool] Update test and skip commits that errors out (#108252)
Summary:
att

Test Plan:
python test_release_notes.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108252
Approved by: https://github.com/drisspg
2023-08-31 04:38:53 +00:00
9862c7196b [Dynamo] SetVariable supports contains (#108189)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108189
Approved by: https://github.com/voznesenskym
2023-08-31 04:28:49 +00:00
98aa3745c2 Fallback to eager for float8 ops in inductor (#108293)
# Summary
As a stop gap to supporting the FP8 Dtype within inductor we would like to fallback to eager. Currently there are 3 ops that are needed for this:
`_scaled_mm` ( matmul for fp8 types)
`clone` (for creating new copies of fp8 tensors)
`to` ( for converting to and from fp8 types).

This PR registers a fallback for _scaled_mm. And adds fp8 to trigger `unsupported_input_tensor`

Prior to these changes this was failing with:
``` Shell
  File "/home/drisspg/meta/pytorch/torch/_inductor/codegen/triton_utils.py", line 11, in signature_of
    tye = JITFunction._type_of(arg.dtype)
  File "/home/drisspg/miniconda3/envs/dev/lib/python3.10/site-packages/triton/runtime/jit.py", line 229, in _type_of
    return key if isinstance(key, str) else f"*{tys[dtype_str]}"
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
KeyError: 'float8_e4m3fn'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108293
Approved by: https://github.com/peterbell10
2023-08-31 04:09:01 +00:00
0e4752bafc Allow registering decomps for HigherOrderOp; add decomp for out_dtype (#108080)
We allow registering decomps for HigherOrderOp via the existing decomp
mechanisms:
- I refactored those APIs to accept torch._ops.OperatorBase, which is the base
  class for torch.ops.HigherOrderOperator and torch.ops.OpOverload
- HigherOrderOps must directly call maybe_handle_decomp in their
  ProxyTorchDispatchMode handling in order to resolve decompositions. We
  can change this in the future so that they do not need to do this.

Next, we add an inductor decomp for out_dtype. This decomp shouldn't be
generally available because we want to preserve out_dtype to the backend
for other use cases (i.e. executorch).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108080
Approved by: https://github.com/HDCharles
2023-08-31 03:15:38 +00:00
95e3126370 Revert "[BE] Pin scipy to 1.10.1 (#108270)"
This reverts commit cd3860cf160838a997ecbbed1ff58823c252e5b3.

Reverted https://github.com/pytorch/pytorch/pull/108270 on behalf of https://github.com/huydhn due to Some inductor tests start failing after this change. The failure comes from numba so I suspect that updating Docker pulls in an unwanted dependency update again ([comment](https://github.com/pytorch/pytorch/pull/108270#issuecomment-1700302953))
2023-08-31 03:06:13 +00:00
11860d9d41 Added info for each artifact option, added a help option to TORCH_LOGS, and changed the error message (#107758)
New message when invalid option is provided
<img width="1551" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/8b61534a-ee55-431e-94fe-2ffa25b7fd5c">

TORCH_LOGS="help"
<img width="1558" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/72e8939c-92fa-4141-8114-79db71451d42">

TORCH_LOGS="+help"
<img width="1551" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/2cdc94ac-505a-478c-aa58-0175526075d2">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107758
Approved by: https://github.com/ezyang, https://github.com/mlazos
ghstack dependencies: #106192
2023-08-31 02:12:35 +00:00
cd3860cf16 [BE] Pin scipy to 1.10.1 (#108270)
As older version leaked memory and there are no good reason to still
test Python-3.8 against scipy-1.8.3

Fixes https://github.com/pytorch/pytorch/security/dependabot/7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108270
Approved by: https://github.com/kit1980
2023-08-31 01:44:47 +00:00
b535ed2c1a Update to RNN documentation (issue #106085) (#106222)
Addresses [issue #106085](https://github.com/pytorch/pytorch/issues/106085).

In `torch/nn/modules/rnn.py`:
- Adds documentation string to RNNBase class.
- Adds parameters to __init__ methods for RNN, LSTM, and GRU, classes.
- Adds type annotations to __init__ methods for RNN, LSTM, and GRU.

In `torch/ao/nn/quantized/dynamic/modules/rnn.py`:
- Adds type specifications to `_FLOAT_MODULE` attributes in RNNBase, RNN, LSTM, and GRU classes.
> This resolves a `mypy` assignment error `Incompatible types in assignment (expression has type "Type[LSTM]", base class "RNNBase" defined the type as "Type[RNNBase]")` that seemed to be a result of fully specified type annotations in `torch/nn/modules/rnn.py`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106222
Approved by: https://github.com/mikaylagawarecki
2023-08-31 00:50:32 +00:00
23a6706c7d Fix triton upload channel detection (#108291)
This should be nightly for nightly and test for release candidates.  There are 2 bugs:

* The shell needs to set to `bash` explicitly, otherwise, GHA uses `sh` which doesn't recognized `[[` as shown in https://github.com/pytorch/pytorch/actions/runs/6030476858/job/16362717792#step:6:10
*`${GITHUB_REF_NAME}` is un-quoted.  This is basically https://www.shellcheck.net/wiki/SC2248 but this wasn't captured by actionlint, and shellcheck doesn't work with workflow YAML file.  I will think about how to add a lint rule for this later then.

### Testing

https://github.com/pytorch/pytorch/actions/runs/6031330411 to confirm that setting the channel is performed correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108291
Approved by: https://github.com/osalpekar, https://github.com/atalman
2023-08-31 00:40:06 +00:00
7cb4bf675b [inductor] no-side-effect codegen (#107617)
Inductor kernel codegen previously have the following side effect:
- in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers
- during codegen, we do memory allocation/free.

These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617
Approved by: https://github.com/jansel
2023-08-31 00:25:17 +00:00
3817de5d84 Fix layernorm cpu precision issues (#108089)
#108072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108089
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-08-30 23:55:10 +00:00
8a089f632e [inductor] Fix MKL issue with test_indirect_device_assert (#108172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108172
Approved by: https://github.com/peterbell10
2023-08-30 23:13:22 +00:00
b2fe5eb710 [inductor] Ignore sympy.PolynomialError while simplifying (#108280)
Fixes #108276
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108280
Approved by: https://github.com/Skylion007
2023-08-30 22:57:15 +00:00
6830480999 [inductor] Move test_inductor_sequence_nr out of test_aot_inductor (#108237)
Summary: The initial PR that added test_inductor_sequence_nr (https://github.com/pytorch/pytorch/pull/103129) seems to think test_aot_inductor is to test aot_autograd + inductor. Move the test to test_torchinductor instead, which can simplify the logic for test_aot_inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108237
Approved by: https://github.com/davidberard98, https://github.com/Neilblaze
2023-08-30 22:42:41 +00:00
7fb131043c [memory snapshots] _record_memory_history_legacy bug fix (#108260)
The argment order for the legacy path got swapped in a recent patch.
Because there is still a blog post documenting the legacy interface
people are hitting this pathway.

This patch fixes #108208
I will also update the blog post to the new API so that people are
more likely to use the newer `_record_memory_history` API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108260
Approved by: https://github.com/awgu
2023-08-30 22:33:04 +00:00
5911faeb8f Horizontally fuse input concatenation (#108115)
Fixes https://github.com/pytorch/pytorch/issues/106688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108115
Approved by: https://github.com/jansel
2023-08-30 21:57:11 +00:00
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
6dacf52f88 [submodule] [C10] Update gloo. (#107236)
This brings in an improved rendezvous for the TCP backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107236
Approved by: https://github.com/H-Huang, https://github.com/XilunWu
2023-08-30 20:59:13 +00:00
39130c7433 Add reinplacing pass for scatters + incremental fake tensor updating (#106192)
mutation for params)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106192
Approved by: https://github.com/jansel, https://github.com/eellison
2023-08-30 20:41:37 +00:00
d0b725ea8a reduce overhead in split and chunk for NestedTensor (#108213)
GH first copy of #108207

Uses raw pointers to reduce construction overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108213
Approved by: https://github.com/dracifer, https://github.com/jbschlosser
2023-08-30 20:40:44 +00:00
071f9ccd8b [inductor] Add input generation fn option for autotuning (#108242)
Summary: In certain cases, the content of some inputs is important for consistent behavior (and performance signals) of an operator. One example is fbgemm jagged tensor operators where offsets Tensor's content must be consistent with the shape of the values Tensor (i.e. `values.size(0) == offsets[-1]` + monotonicity).

This is particularily important in the context of autotuning, where the inputs are currently generated as random (for float types) or all-zero (for int types) `torch.Tensors`. Even if the extern kernel and Triton tempalte are robust enough to tolerate improper input content, the performance signals would likely be useless.

In this PR, we add an option to pass input-generating functions for a subset of inputs of the autotuned op (to the `AlgorithmSelectorCache.__call__`).

Test Plan:

```
$ python test/inductor/test_max_autotune.py

...

----------------------------------------------------------------------
Ran 17 tests in 80.146s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48831225](https://our.internmc.facebook.com/intern/diff/D48831225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108242
Approved by: https://github.com/jansel
2023-08-30 20:19:26 +00:00
ad17e5ec4e Faster gc_count update for CUDACachingAllocator (#108071)
Summary: Modify the way we update gc_count in CUDACachingAlloctor to make it faster.

Reviewed By: jaewonlee-fb

Differential Revision: D48481557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108071
Approved by: https://github.com/zdevito
2023-08-30 18:51:44 +00:00
238cc84af9 [TD] Emit metrics to compare heuristic quality (#108192)
When a test fails, we will now emit fine grained details about how accurately heuristics predicted the relevance of that test.

## Context
Why only look at failing tests? Our only signal that a PR is most likely relevant to a test is whether or not a test fails on it. Green tests don't tell us if the success was due to the code being good vs being irrelevant.  This isn't a perfect measure, since it can miscategorize unstable and flaky failures as having been "missed" by the heuristics, but it's a reasonable approximation.

## What's measured?
The metrics this PR collects are designed to answer the following questions

### How comprehensive are the heuristics?
- What's the false negative rate, the % of failures that ideally should have been prioritized but weren't? (Both at an aggregate level and at a per heuristic level)

### How precise are the heuristics?
- What % of failed tests were prioritized by a given heuristic? What % was prioritized overall?
- How relevant was a failed test was considered to be? (Both a aggregate level and at a per heuristic level)
- What % of time was a given heuristic prioritizing a failing test higher than any other heuristic?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108192
Approved by: https://github.com/huydhn
ghstack dependencies: #108117
2023-08-30 18:28:18 +00:00
d695486f69 [Vulkan] Fix addmm & linear when bias needs to broadcast (#108199)
Test Plan:
On Devserver
```
LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run //xplat/caffe2:pt_vulkan_api_test_bin
```

Reviewed By: shubhraprakash1

Differential Revision: D48806899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108199
Approved by: https://github.com/digantdesai
2023-08-30 17:50:58 +00:00
5683ab74f4 [export] Fix autogenerated stacktrace (#108217)
Summary: Existing code is incorrectly overwriting the stacktrace to be None because since there is no exception happening, `traceback.format_exc` is None. Also we should only populate the stack trace if it not there in the first place.

Test Plan: CI

Differential Revision: D48818478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108217
Approved by: https://github.com/zhxchen17
2023-08-30 17:44:06 +00:00
6ad5568cbc Break graph on manual_seed. (#107594)
Fix: #107187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107594
Approved by: https://github.com/eellison
2023-08-30 17:24:11 +00:00
b1b9a3646a Increased logging threshold for profiler matching (#108010)
Fixed test_memory_profiler::TestMemoryProfilerE2E::test_memory_timeline by changing the (arbitrary) threshold for logging. With increased size allocations on newer AMD GPUs, a higher threshold over 1024 is preferred to account for those differences and yet satisfy the test requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108010
Approved by: https://github.com/colesbury, https://github.com/pruthvistony
2023-08-30 17:16:36 +00:00
cyy
01fc6466d1 [Reland] [1/N] fix clang-tidy warnings in torch/csrc (#108114)
Reland of PR #107648 with auto replaced with Py_ssize_t in eval_frame.c. This PR applies fixes to some found issues by clang-tidy in torch/csrc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108114
Approved by: https://github.com/Skylion007
2023-08-30 17:11:16 +00:00
7be233f3a5 Remove commit hash when building triton wheel and conda in release mode (#108203)
This is the follow-up of https://github.com/pytorch/pytorch/pull/108187 to set the correct release version without commit hash for triton wheel and conda binaries when building them in release mode.

### Testing

* With commit hash (nightly): https://github.com/pytorch/pytorch/actions/runs/6019021716
* Without commit hash https://github.com/pytorch/pytorch/actions/runs/6019378616 (by adding `--release` into the PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108203
Approved by: https://github.com/atalman
2023-08-30 16:49:21 +00:00
057b807178 [quant] Move dropout replacement to move_model_to_eval (#108184)
Summary: This commit adds a public facing
`torch.ao.quantization.move_model_to_eval` util function
for QAT users. Instead of calling model.eval() on an exported
model (which doesn't work, see
https://github.com/pytorch/pytorch/issues/103681), the user
would call this new util function instead. This ensures special
ops such as dropout and batchnorm (not supported yet) will have
the right behavior when the graph is later used for inference.

Note: Support for an equivalent `move_model_to_train` will be
added in the future. This is difficult to do for dropout
currently because the eval pattern of dropout is simply a clone
op, which we cannot just match and replace with a dropout op.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_move_model_to_eval

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar

Differential Revision: [D48814735](https://our.internmc.facebook.com/intern/diff/D48814735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108184
Approved by: https://github.com/jerryzh168
2023-08-30 16:33:17 +00:00
0fb1c05c5a [pytorch] Add decomp rule for scaled_dot_product_attention (#108180)
`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity.

However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor.

Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108180
Approved by: https://github.com/SherlockNoMad
2023-08-30 15:52:08 +00:00
e31038d574 Check results dtype in index_out (#108167)
This logic exists for index_put and index_add, but for some reason not for `index.out`
Skip testing, as this function is not technically exposed on the Python level.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at c688cfd</samp>

> _`index_out` checks types_
> _avoiding errors in autumn_
> _complex tensors work_

Fixes https://github.com/pytorch/pytorch/issues/107698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108167
Approved by: https://github.com/albanD
2023-08-30 14:55:18 +00:00
fe1f26af8a Add support for PickleOpCode::APPEND in torch unpickler (#104027)
Reviewed By: qiminglu

Differential Revision: D46760650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104027
Approved by: https://github.com/ezyang
2023-08-30 14:24:50 +00:00
0297232053 Fix operator precedence (#108196)
Summary: Ensure that modules are only installed if they are not fsdp modules.

Differential Revision: D48810186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108196
Approved by: https://github.com/shunting314, https://github.com/anijain2305
2023-08-30 14:00:33 +00:00
813246c554 Add scalar conversion using avx instructions for half (#102140)
### Motivation

Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.

### Testing
Test maxpool, and compared with the results of #98819.
Single socket (28 cores):

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398

Single core:

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch
2023-08-30 13:26:53 +00:00
ca7249b80a Remove duplicate sentences in description of torch.linalg.eig (#108230)
This removes nearly identical sentences in the description of `torch.linalg.eig` describing the checks in the backward pass by @lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108230
Approved by: https://github.com/lezcano
2023-08-30 13:16:04 +00:00
3a79621c9d [Inductor] Add fused_attention pattern matcher with additional clone (#108141)
A previous PR https://github.com/pytorch/pytorch/pull/106274 decomposes `aten.dropout` and would create a `clone()` when `eval()` or `p=0`. This makes many SDPA-related models fail to match fused_attention pattern matchers.

This PR adds new fused_attention pattern matchers with an additional clone to re-enable the SDPA op matching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108141
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-08-30 05:07:35 +00:00
e45b391ebd Enable Mypy Checking in mkldnn_fusion.py and quantization.py (#108131)
**Summary**
As in issue: https://github.com/pytorch/pytorch/issues/105230, enable Mypy Checking in `torch/_inductor/fx_passes/mkldnn_fusion.py` and `torch/_inductor/fx_passes/quantization.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108131
Approved by: https://github.com/Skylion007, https://github.com/eellison
ghstack dependencies: #108125
2023-08-30 05:03:35 +00:00
1a5fdc2458 Re-enable some Quantization UTs after Quantization flow updates (#108125)
**Summary**
This diff mainly has done 2 things:

- Re-enable the testcases skipped in commit: 9ae3d7ca90 due to the quantization flow update.
- Break down the original testcases into small testcases to make each testcase simpler.

**TestPlans**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_relu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_add
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_add_relu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_dequant_promotion
python -m pytest test_mkldnn_pattern_matcher.py -k test_qlinear
python -m pytest test_mkldnn_pattern_matcher.py -k test_qlinear_relu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108125
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
2023-08-30 04:41:54 +00:00
620d267ef3 Refactor TestPrioritizations to support more priorities and reduce risk of accidental mutations (#108117)
Refactor TD code to make it easier to add additional categories later and also support the changes required to enable the metrics needed for TD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108117
Approved by: https://github.com/huydhn
2023-08-30 04:14:28 +00:00
5e0ec03a71 [inductor][easy] reuse a single is_aligned function (#108135)
Resolve comment: https://github.com/pytorch/pytorch/pull/107722#discussion_r1308117422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108135
Approved by: https://github.com/jansel
ghstack dependencies: #107722
2023-08-30 03:33:30 +00:00
bf517f4092 [vision hash update] update the pinned vision hash (#108201)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108201
Approved by: https://github.com/pytorchbot
2023-08-30 03:23:59 +00:00
283ce12aa9 Add channels_last3d support for mkldnn conv and mkldnn deconv (#95271)
### Motivation

- Add channels_last3d support for mkldnn conv and mkldnn deconv.
- Use `ideep::convolution_transpose_forward::compute_v3` instead of `ideep::convolution_transpose_forward::compute`.  compute_v3 uses `is_channels_last` to notify ideep whether to go CL or not to align with the memory format check of PyTorch.

### Testing
1 socket (28 cores):

- memory format: torch.contiguous_format

module | shape | forward / ms | backward / ms
-- | -- | -- | --
conv3d | input size: (32, 32, 10, 100, 100), weight size: (32, 32, 3, 3, 3) | 64.56885 | 150.1796
conv3d | input size: (32, 16, 10, 200, 200), weight size: (16, 16, 3, 3, 3) | 100.6754 | 231.8883
conv3d | input size: (16, 4, 5, 300, 300), weight size: (4, 4, 3, 3, 3) | 19.31751 | 68.31131

module | shape | forward / ms | backward / ms
-- | -- | -- | --
ConvTranspose3d | input size: (32, 32, 10, 100, 100), weight size: (32, 32, 3, 3, 3) | 122.7646 | 207.5125
ConvTranspose3d | input size: (32, 16, 10, 200, 200), weight size: (16, 16, 3, 3, 3) | 202.4542 | 368.5492
ConvTranspose3d | input size: (16, 4, 5, 300, 300), weight size: (4, 4, 3, 3, 3) | 122.959 | 84.62577

- memory format: torch.channels_last_3d

module | shape | forward / ms | backward / ms
-- | -- | -- | --
conv3d | input size: (32, 32, 10, 100, 100), weight size: (32, 32, 3, 3, 3) | 40.06993 | 114.317
conv3d | input size: (32, 16, 10, 200, 200), weight size: (16, 16, 3, 3, 3 | 49.08249 | 133.4079
conv3d | input size: (16, 4, 5, 300, 300), weight size: (4, 4, 3, 3, 3) | 5.873911 | 17.58647

module | shape | forward / ms | backward / ms
-- | -- | -- | --
ConvTranspose3d | input size: (32, 32, 10, 100, 100), weight size: (32, 32, 3, 3, 3) | 88.4246 | 208.2269
ConvTranspose3d | input size: (32, 16, 10, 200, 200), weight size: (16, 16, 3, 3, 3 | 140.0725 | 270.4172
ConvTranspose3d | input size: (16, 4, 5, 300, 300), weight size: (4, 4, 3, 3, 3) | 23.0223 | 37.16972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95271
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-08-30 02:53:30 +00:00
13e4cce83c [DTensor] Add util API to compute_local_shape_and_global_offset for checkpointing purpose (#107996)
The compute_local_shape_and_global_offset API does the following:
1) Calculate both local_shape and global_offset in one API to replace two API calls (compute_local_size and compute_local_shape).
2) Generate the correct global_offset for checkpointing purposes. We are currently using compute_local_offset for downstream checkpoint components, which could lead to incorrect results. For checkpointing, we need global_offset instead of local_offset. In some cases, global_offset does not equal to local_offset, when a dimension is sharded multipe times on different mesh dimension (e.g. placements = [Shard(0), Shard(0)]).

Follow-up PRs:
1) Replace related downstream components to use compute_local_shape_and_global_offset instead of compute_local_size and compute_local_offset.
2) Audit existing code base to see if we can remove compute_local_size and compute_local_offset, since they are currently being used.

cc. @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107996
Approved by: https://github.com/wanchaol
2023-08-30 02:46:50 +00:00
556bfe7cb5 [inductor] let codegen not rely on node order (#107320)
We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided.  But each fusion pass (`fuse_nodes_once`) can also change node orders.  So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done).

Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a):
```
  benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed
```

Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107320
Approved by: https://github.com/peterbell10, https://github.com/jansel
2023-08-30 02:34:20 +00:00
7264b75763 Remove Anaconda Prune (#108111)
Anaconda Prune job has been migrated to test-infra, so this job in CCI is defunct. We can remove it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108111
Approved by: https://github.com/huydhn
2023-08-30 00:47:25 +00:00
cd07214a41 Fix various issues on build-triton-wheel workflow (#108187)
There are more issues that I expect at the beginning:

* Triton was uploaded on `main` instead of `nightly` and release branch
* The environment `conda-aws-upload` wasn't used correctly in both wheel and conda upload
* Conda update wasn't run in a separate ephemeral runner
* Duplicated upload logic, should have just use `bash .circleci/scripts/binary_upload.sh` instead
* Handle `CONDA_PYTORCHBOT_TOKEN` and `CONDA_PYTORCHBOT_TOKEN_TEST` tokens in a similar way as https://github.com/pytorch/test-infra/pull/4530

Part of https://github.com/pytorch/pytorch/issues/108154
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108187
Approved by: https://github.com/atalman
2023-08-30 00:02:32 +00:00
2c87ef3dbf [inductor] Fix inputs with existing offsets (#108168)
This cherrypicks the reinterpret_tensor change from #102625 in order to fix a subtle correctness bug when the graph inputs already have a storage_offset set.

The view change also fixes some issues with quantized models in torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108168
Approved by: https://github.com/desertfire
2023-08-29 23:47:03 +00:00
c3239442a3 [AOTInductor] Include constants in AOTInductor .so file. (#107718)
Summary:
Include the constants into AOTInductor .so file.
We do not modify existing API signatures but create necessary format with weight lifted out instead.

Test Plan:
test/inductor/test_aot_inductor.py

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107718
Approved by: https://github.com/angelayi, https://github.com/eellison
2023-08-29 22:37:30 +00:00
fa49be2a49 [docs] Properly link register_post_accumulate_grad_hook docs (#108157)
it shows up now

![image](https://github.com/pytorch/pytorch/assets/31798555/0aa86839-b9c5-4b4b-b1b1-aa1c0c0abbab)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108157
Approved by: https://github.com/soulitzer, https://github.com/albanD
2023-08-29 22:13:33 +00:00
525b593954 Fix focus builds of macOS apps on apple silicon. (#96966) (#107816)
Summary:

Focus2 builds of some apps on apple silicon Macs are failing. We've determined that removing the `user.focus_enabled=true` config option allows the build to succeed.

Reviewed By: milend

Differential Revision: D44076509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107816
Approved by: https://github.com/kit1980
2023-08-29 21:57:01 +00:00
86bc50ae60 Add AMP support to linalg.vecdot. (#108165)
We follow the same rules as matmul.

Fixes https://github.com/pytorch/pytorch/issues/108127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108165
Approved by: https://github.com/albanD
2023-08-29 21:33:52 +00:00
75884f4e1d Error when someone calls train/eval on pre_autograd graph (#108143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108143
Approved by: https://github.com/andrewor14
2023-08-29 21:03:48 +00:00
cadd97feef Remove case for RecursionError on try_solve. (#108144)
This PR removes an `except` clause for `RecursionError`. It used to be there because
`sympy.solve` was being used at the time. Since we are using the simpler `try_solve`, it's
not needed anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108144
Approved by: https://github.com/Skylion007
2023-08-29 19:22:20 +00:00
68b518c13e Add check for out of range pointer. (#107510)
### Summary

Hi! We've been fuzzing pytorch with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz) and found an error of accessing arbitary address while parsing flatbuffer format using `torch::load` function.

pytorch version: 18bcf62bbcf7ffd47e3bcf2596f72aa07a07d65f (the last commit at the moment of reporting the issue)

### Details
The vulnerability appears while loading arbitrary user input using `torch::load` function. To detect the error the input must correspond to FlatbufferFileFormat, so the part of parsing flatbuffer in `import_ir_module` function must be executed.

Firstly error can occur in `GetMutableRoot` in `module.h`, where we add pointer to input data buffer with the value, got from dereference of this pointer (which data fully depends on the user input and can be arbitrary). so the resulting `flatbuffer_module` address can be corrupted.

Moreover, we can get the arbitrary address later at `flatbuffer_loader.cpp:305`, when we get `ival` pointer with `Get` method.
There in `IndirectHelper::Read` function we add pointer with the offset got from the dereference of this pointer, so the address can be corrupted again.

The corrupted `ival` pointer is dereferenced at `table.h` in flatbuffers project, where is used to get another address, which is later dereferenced again at `table.h` in flatbuffers project. The resulting corrupted address is written to `func` pointer at `flatbuffer_loader.cpp:274`, which is then used in `parseFunction`, where write access to the address occurs.

To fix the problem we can compute the end of memory area in `parse_and_initialize_mobile_module` function like this:
```
auto* end = static_cast<char*>(data) + size;
```
And then pass it to all the callees and insert corresponding checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107510
Approved by: https://github.com/albanD
2023-08-29 18:25:11 +00:00
78810d78e8 Fix the coredump described by #106702 (#108002)
Fixes #106702 and add some tests

As shown by [maxUnpool1d](https://pytorch.org/docs/master/generated/torch.nn.MaxUnpool1d)(`MaxUnpool2d`, `MaxUnpool3d` also), `Input` and `Output` support `(N,C,*)` or `(C,*)`, but the c++ api currently supports the former, and the latter will cause a coredump.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108002
Approved by: https://github.com/albanD
2023-08-29 17:14:16 +00:00
fa885baf04 [ROCm] Update ROCm pin to fix triton wheel lib issue (#108137)
ROCm nightly wheels are facing issue as so files required for triton wheel are not being copied to third_party after https://github.com/pytorch/pytorch/pull/107600

This PR is updating our pinned commit to bring in a revert a change to search for none numbered so's which seems to be causing the issue

Before PR 107600
https://github.com/pytorch/pytorch/actions/runs/5763938339/job/15626824908
```
2023-08-04T15:35:34.8049152Z Copying DRM Libraries
2023-08-04T15:35:34.8239451Z + ROCM_SO=("libhsa-runtime64.so.1" "libamdhip64.so.5" "libamd_comgr.so.2")
2023-08-04T15:35:34.8239816Z + mkdir -p python/triton/third_party/rocm/lib
2023-08-04T15:35:34.8295087Z + for lib in '"${ROCM_SO[@]}"'
2023-08-04T15:35:34.8295443Z + filepath=/tmp/vanilla_extract/opt/rocm-5.4.2/lib/libamdhip64.so.5
2023-08-04T15:35:34.8295891Z + cp /tmp/vanilla_extract/opt/rocm-5.4.2/lib/libamdhip64.so.5 python/triton/third_party/rocm/lib/
2023-08-04T15:35:34.8421258Z ++ echo libamdhip64.so.5
2023-08-04T15:35:34.8421621Z ++ sed -e 's/\.so.*/.so/g'
2023-08-04T15:35:34.8432022Z + LINKNAME=libamdhip64.so
2023-08-04T15:35:34.8432686Z + ln -sf libamdhip64.so.5 python/triton/third_party/rocm/lib/libamdhip64.so
...
2023-08-04T15:35:35.7473664Z copying triton/third_party/rocm/lib/libamdhip64.so -> build/lib.linux-x86_64-cpython-38/triton/third_party/rocm/lib
2023-08-04T15:35:35.7617341Z copying triton/third_party/rocm/lib/libamdhip64.so.5 -> build/lib.linux-x86_64-cpython-38/triton/third_party/rocm/lib
...
2023-08-04T15:40:10.5063779Z copying build/lib.linux-x86_64-cpython-38/triton/third_party/rocm/lib/libamdhip64.so -> build/bdist.linux-x86_64/wheel/triton/third_party/rocm/lib
2023-08-04T15:40:10.6144654Z copying build/lib.linux-x86_64-cpython-38/triton/third_party/rocm/lib/libamdhip64.so.5 -> build/bdist.linux-x86_64/wheel/triton/third_party/rocm/lib
...
2023-08-04T15:40:37.3571973Z adding 'triton/third_party/rocm/lib/libamdhip64.so'
2023-08-04T15:40:38.4553988Z adding 'triton/third_party/rocm/lib/libamdhip64.so.5'
...
2023-08-04T15:40:53.3747917Z Setting rpath of triton/third_party/rocm/lib/libamdhip64.so to ''
2023-08-04T15:40:53.4602326Z $ORIGIN
2023-08-04T15:40:53.4620034Z Setting rpath of triton/third_party/rocm/lib/libamdhip64.so.5 to ''
2023-08-04T15:40:53.6337419Z $ORIGIN
...
2023-08-04T15:40:53.7152828Z Copied triton/third_party/rocm/lib/libamdhip64.so to triton/third_party/rocm/lib/libamdhip64.so
2023-08-04T15:40:53.7215480Z Copied triton/third_party/rocm/lib/libamdhip64.so.5 to triton/third_party/rocm/lib/libamdhip64.so
...
```

After PR 107600
https://github.com/pytorch/pytorch/actions/runs/5967761429/job/16190110783
```
2023-08-24T18:59:14.3976234Z + ROCM_SO=("libhsa-runtime64.so" "libamdhip64.so" "libamd_comgr.so" "libdrm.so" "libdrm_amdgpu.so")
2023-08-24T18:59:14.3977489Z + for lib in '"${ROCM_SO[@]}"'
2023-08-24T18:59:14.4216575Z + file_path=($(find $ROCM_HOME/lib/ -name "$lib"))
2023-08-24T18:59:14.4219237Z ++ find /opt/rocm/lib/ -name libamdhip64.so
2023-08-24T18:59:14.4254361Z + [[ -z /opt/rocm/lib/libamdhip64.so ]]
2023-08-24T18:59:14.4254677Z + [[ -z /opt/rocm/lib/libamdhip64.so ]]
2023-08-24T18:59:14.4254965Z + [[ -z /opt/rocm/lib/libamdhip64.so ]]
2023-08-24T18:59:14.4255246Z + [[ -z /opt/rocm/lib/libamdhip64.so ]]
2023-08-24T18:59:14.4255538Z + cp /opt/rocm/lib/libamdhip64.so python/triton/third_party/rocm/lib
2023-08-24T18:59:14.5085155Z ++ echo libamdhip64.so
2023-08-24T18:59:14.5085818Z ++ sed -e 's/\.so.*/.so/g'
2023-08-24T18:59:14.5097094Z + LINKNAME=libamdhip64.so
2023-08-24T18:59:14.5097572Z + ln -sf libamdhip64.so python/triton/third_party/rocm/lib/libamdhip64.so
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108137
Approved by: https://github.com/sunway513, https://github.com/huydhn
2023-08-29 16:47:56 +00:00
4e47ea5131 Revert "Break graph on manual_seed. (#107594)"
This reverts commit 6c28de24374db3b2a58aabf62985d18ce899c91f.

Reverted https://github.com/pytorch/pytorch/pull/107594 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause failures in trunk on inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_uniform_cuda_float, likely a landrace ([comment](https://github.com/pytorch/pytorch/pull/107594#issuecomment-1697783965))
2023-08-29 16:38:01 +00:00
fe2cda64dc [C10D] Implement new libuv backend for TCPStore. (#108066)
The new backend is currently under a flag 'use_libuv' in TCPStore constructor
to reduce the impact on existing users as we test it.

This is a reland of #105870 with a fix for a bad test.

Differential Revision: [D48742554](https://our.internmc.facebook.com/intern/diff/D48742554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108066
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
2023-08-29 14:55:14 +00:00
80c7fdf49f wrapper subclasses: support non-cpu device for dynamic shape overload (#107926)
This is tested by AOTAutograd later in the stack, but I can add direct tests if anyone wants them.

Previously, the second overload of `_make_wrapper_subclass` (which supports dynamic shapes) would just always return a wrapper tensor that reported as being on `cpu`. This updates it to properly respect the `device` arg that was passed in.

At first I thought about doing this the same way that FakeTensor does it (override device to do a custom impl), but that seemed overly complicated. Since the subclass is a wrapper, we can just bake in the value on the wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107926
Approved by: https://github.com/ezyang
ghstack dependencies: #107915, #107916
2023-08-29 14:27:21 +00:00
c6e3adaf54 add dynamic shapes support for subclasses that override size/stride (#107916)
This is mostly a minor fix on top of @soulitzer's PR https://github.com/pytorch/pytorch/pull/107839.

(1) `strides` wasn't going through the new `set_tensor_attr_with_capsule` flow
(2) The dynamic shapes overload for `_make_wrapper_subclass` currently errors when you try to use custom sizes - I removed the error
(3) added a test

I need this later because I'm adding a `__torch_dispatch__` `FunctionalTensor` wrapper subclass, that needs to support dynamic shapes, and also plumb metadata calls to its inner tensor later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107916
Approved by: https://github.com/ezyang, https://github.com/soulitzer
ghstack dependencies: #107915
2023-08-29 14:27:21 +00:00
4f34caf164 add return_and_correct_aliasing() util for wrapper subclasses (#107915)
This PR adds a `return_and_correct_aliasing()` utility, that wrapper subclasses can use to get correct aliasing. I updated `TwoTensor` to use it, and added some testing that the aliasing of my `TwoTensor` subclass now matches the aliasing behavior of normal tensors.

Right now my test just uses a few hand-picked opinfos (that have varying aliasing behavior). I thought all op infos might be overkill (does that take a while to run?), but I'm happy to add them all if people prefer.

One more general question about this PR: eventually, proper aliasing will be a **requirement** in order for AOTAutograd to handle aliasing/mutations on subclasses properly during compilation. How can we make sure that wrapper subclasses use this API? A few options (from talking to Richard):

(1) Yolo require subclasses to use the API and hope users do as well (what this PR does)

(2) Yolo require subclasses to use the API, but add a kwarg to `_make_wrapper_subclass`, e.g. `manual_aliasing=True`, that torch.compile checks for before allowing the subclass to be used in compilation

(3) Automatically run this API in our python fallback, for **every** tensor subclass that currently implements `__tensor_flatten__` (aka only the "traceable" subclasses)

(4) Automatically run this API in our python fallback, for **every** tensor subclass. This would be a bit higher blast radius, since it would change the existing aliasing behavior of wrapper subclasses. Maybe.. this is the right thing to do though?

Either way, my tentative plan is to do (1) to unblock, and revisit this later once we want to come up with public docs + a more general "tensor subclass in PT2 requirements" plan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107915
Approved by: https://github.com/ezyang
2023-08-29 14:27:19 +00:00
6c28de2437 Break graph on manual_seed. (#107594)
Fix: #107187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107594
Approved by: https://github.com/eellison
2023-08-29 12:59:57 +00:00
b7624fc91e Cleaned up test_mps.py::test_output*_match (#108092)
Description:
- cleaned up test_mps.py::test_output_match and test_mps.py::test_output_grad_match tests
  - removed unused variables and useless brackets
  - simplified atol/rtol setup if/else code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108092
Approved by: https://github.com/kulinseth
2023-08-29 10:46:02 +00:00
f3a8d57aea [Dynamo x FSDP] Add support for params, buffers, submodules on FSDPManagedNNModuleVariable (#107923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107923
Approved by: https://github.com/wconstab
2023-08-29 08:54:13 +00:00
977e4302ab skip dynamic shape test for test_conv_bn_fuse (#108113)
For test_conv_bn_fuse dynamic case, we always fuse bn with convolution and there only a external convolution call, not loops, so it will failed when we do a dynamic loop vars check. This PR will skip this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108113
Approved by: https://github.com/huydhn
2023-08-29 08:39:52 +00:00
147b3495e2 [quant][pt2e] Add reference representation for dynamic quantized linear (#108073)
Summary: att

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_representation_dynamic_linear
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_pt2e -- 'test_representation_dynamic_linear'

Reviewed By: kimishpatel

Differential Revision: D48703076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108073
Approved by: https://github.com/andrewor14
2023-08-29 07:12:55 +00:00
0cfc5899f9 [inductor] Improved grid_sampler_2d decomposition for cuda (#104710)
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to https://github.com/pytorch/pytorch/issues/104296

Perfs:
- speed-up on cuda (~x5) and cpu (~x2) for bicubic mode

```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git52598e9) PR" and "Compiled (2.1.0a0+gitcf76938) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         38.010 (+-0.118)        |          51.466 (+-1.257)          |             47.867 (+-0.124)            |     0.930 (+-0.000)      |           33.654 (+-0.411)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         35.532 (+-0.236)        |          52.189 (+-0.093)          |             58.979 (+-0.206)            |     1.130 (+-0.000)      |           32.543 (+-0.198)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         38.187 (+-0.112)        |          47.892 (+-0.117)          |             45.833 (+-0.081)            |     0.957 (+-0.000)      |           33.752 (+-0.116)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         36.708 (+-0.244)        |          51.680 (+-0.104)          |             58.360 (+-0.108)            |     1.129 (+-0.000)      |           32.576 (+-0.751)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         24.201 (+-0.088)        |          27.451 (+-0.059)          |             27.937 (+-0.081)            |     1.018 (+-0.000)      |           24.367 (+-0.074)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         19.266 (+-0.105)        |          26.070 (+-0.085)          |             26.092 (+-0.054)            |     1.001 (+-0.000)      |           20.144 (+-0.064)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         24.293 (+-0.125)        |          26.085 (+-0.064)          |             26.575 (+-0.061)            |     1.019 (+-0.000)      |           24.515 (+-0.095)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         19.440 (+-0.075)        |          25.252 (+-0.059)          |             25.259 (+-0.051)            |     1.000 (+-0.000)      |           19.770 (+-0.070)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        114.900 (+-0.508)        |         113.416 (+-1.271)          |            248.679 (+-1.431)            |     2.193 (+-0.000)      |          114.609 (+-0.515)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        115.973 (+-0.555)        |         124.711 (+-1.596)          |            282.187 (+-2.418)            |     2.263 (+-0.000)      |          115.368 (+-0.652)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        111.730 (+-0.562)        |         110.914 (+-0.865)          |            253.899 (+-2.226)            |     2.289 (+-0.000)      |          111.285 (+-1.226)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        112.859 (+-0.487)        |         131.696 (+-1.298)          |            294.124 (+-1.963)            |     2.233 (+-0.000)      |          110.910 (+-0.969)

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |        228.811 (+-0.037)        |          92.990 (+-0.446)          |             92.648 (+-0.286)            |     0.996 (+-0.000)      |          228.274 (+-0.067)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |        222.107 (+-0.076)        |          93.247 (+-0.387)          |             92.528 (+-0.423)            |     0.992 (+-0.000)      |          221.922 (+-0.297)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        235.654 (+-0.055)        |          75.781 (+-0.566)          |            115.865 (+-0.419)            |     1.529 (+-0.000)      |          236.032 (+-0.111)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        226.752 (+-0.088)        |          76.312 (+-0.328)          |            116.468 (+-0.477)            |     1.526 (+-0.000)      |          226.950 (+-0.027)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |        225.540 (+-0.013)        |          75.638 (+-0.341)          |             72.621 (+-0.292)            |     0.960 (+-0.000)      |          225.937 (+-0.017)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |        217.425 (+-0.024)        |          75.484 (+-0.545)          |             73.518 (+-0.296)            |     0.974 (+-0.000)      |          217.793 (+-0.008)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        231.474 (+-0.020)        |          75.972 (+-0.339)          |             73.030 (+-0.387)            |     0.961 (+-0.000)      |          231.991 (+-0.184)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        223.408 (+-0.016)        |          75.622 (+-0.279)          |             73.542 (+-0.336)            |     0.973 (+-0.000)      |          223.893 (+-0.021)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        319.382 (+-0.023)        |         149.060 (+-0.190)          |            772.116 (+-0.266)            |     5.180 (+-0.000)      |          320.549 (+-0.387)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        319.987 (+-0.134)        |         154.443 (+-0.014)          |            797.651 (+-0.232)            |     5.165 (+-0.000)      |          320.665 (+-0.397)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        326.138 (+-0.439)        |         149.092 (+-0.036)          |            772.508 (+-0.259)            |     5.181 (+-0.000)      |          325.751 (+-0.398)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        326.024 (+-0.118)        |         154.452 (+-0.209)          |            797.756 (+-0.229)            |     5.165 (+-0.000)      |          326.870 (+-0.372)

Times are in microseconds (us).

```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230828-134459-affine-grid-sampler-PR-vs-Nightly-speedup.md)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104710
Approved by: https://github.com/lezcano
2023-08-29 05:54:24 +00:00
d040d5b9ee Fix multi output layout error in indexing dtype calculation (#108085)
Differential Revision: [D48757829](https://our.internmc.facebook.com/intern/diff/D48757829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108085
Approved by: https://github.com/yanboliang, https://github.com/davidberard98, https://github.com/jansel, https://github.com/peterbell10
2023-08-29 05:43:44 +00:00
e68b3ad14f update triton pin with needed inductor change (#107722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107722
Approved by: https://github.com/jansel, https://github.com/cpuhrsch
2023-08-29 04:31:44 +00:00
00eed6f367 Better Error Message for invalid Out_dtype + Bias for scaled_mm (#108097)
# Summary
Fixes an error case that was directly throwing Cublasslt error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108097
Approved by: https://github.com/vkuzo
2023-08-29 04:10:17 +00:00
1b2eac00cb [vision hash update] update the pinned vision hash (#108112)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108112
Approved by: https://github.com/pytorchbot
2023-08-29 04:08:05 +00:00
6648880aca Revert "Remove Array.h (#106810)"
This reverts commit 39297eb22f13a92da40ddc79eca5f0fc937bfee1.

Reverted https://github.com/pytorch/pytorch/pull/106810 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the build is failing precompiled header build in trunk due to a landrace with the revert of https://github.com/pytorch/pytorch/pull/106915 ([comment](https://github.com/pytorch/pytorch/pull/106810#issuecomment-1696702323))
2023-08-29 03:11:13 +00:00
de5ffa8a3a [inductor] Add aten.multinomial to disallowed cudagraphs ops (#108105)
Fixes:
```python
CUDA_LAUNCH_BLOCKING=1 ./benchmarks/dynamo/torchbench.py --inference --performance --no-skip --inductor --freezing --only nanogpt_generate
loading model: 0it [00:00, ?it/s]number of parameters: 123.69M
loading model: 0it [00:07, ?it/s]
cuda eval  nanogpt_generate
ERROR:common:Backend dynamo failed in warmup()
Traceback (most recent call last):
  File "/data/users/jansel/pytorch/torch/_inductor/cudagraph_trees.py", line 1084, in _record
    static_outputs = model(inputs)
  File "/data/users/jansel/pytorch/torch/_inductor/codecache.py", line 401, in _run_from_cache
    return compiled_graph.compiled_artifact(inputs)
  File "/tmp/torchinductor_jansel/db/cdbk4ip3fucyoccnbnoik2crjpdkliwxll653l7l3wwsxiygmade.py", line 18375, in call
    buf239 = aten.multinomial.default(buf238, 1)
  File "/data/users/jansel/pytorch/torch/_ops.py", line 448, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: operation not permitted when stream is capturing
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108105
Approved by: https://github.com/eellison
ghstack dependencies: #108096, #108087, #108098
2023-08-29 02:58:48 +00:00
6d61d74545 [dynamo] Fix setattr nn.Module with new attribute (#108098)
This is one (but not all) issues in DALLE2_pytorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108098
Approved by: https://github.com/eellison
ghstack dependencies: #108096, #108087
2023-08-29 02:58:48 +00:00
39297eb22f Remove Array.h (#106810)
Summary: Replaced by std::array

Test Plan: Sandcastle

Differential Revision: D48160261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106810
Approved by: https://github.com/peterbell10
2023-08-29 02:49:08 +00:00
da54f3c519 reorder proxy / fake modes so they always run last (#104482)
**Update:** Made refactor of the original PR. See the original description below, but here I'll describe the updates:

(1) TLS changes in `TorchDispatchModeTLS.h/cpp`.

I added a `TorchDispatchModeKey` enum, that (for now) just contains PROXY and FAKE. The ModeTLS used to just contain a `std::vector<std::shared_ptr<c10::SafePyObject>>` corresponding to the mode stack. It now **also** contains a separate array of "infra modes", indexed by mode key (PROXY and FAKE, with a new addition, FUNCTIONAL, coming later in the stack).

`TorchDispatchModeTLS::push_onto_stack` and `TorchDispatchModeTLS::pop_stack` are now a bit more complicated. Pushing accepts an optional mode_key, which if set, tells us to add the given mode directly to our "infra_modes" array. Popping will first check the "user mode" stack, before trying to pop anything from the infra mode stack. It also optionally returns the mode key of the mode we popped if there was one - that way if we push that same mode back onto the TLS later, we know where it goes.

`TorchDispatchModeTLS::dispatch_mode_enabled()` now accepts an optional `skip_infra_modes` param, so you can separately query if there are "any modes at all", or if there are "any user modes".

`TorchDispatchModeTLS::get/set/unset_mode()` all take in a mode key, and get/set/unset the mode at that particular mode key (meaning they are only meant to be used for infra modes).

There were also some mild codegen changes to support the new enum

(2) `fake_tensor.py/proxy_tensor.py/_python_dispatch.py`

The way I tell the infra that certain subclasses/modes are "infra" is through the enum: I gave `FakeTensor` and `FakeTensorMode` a `self._mode_key = torch._C.TorchDispatchModeKey.FAKE`. `TorchDispatchMode.__enter/exit__()` (in `_python_dispatch.py` now check if the current mode has a mode key, and if so they plumb it into any `push_onto_stack()` calls (which eventually instructs `TorchDispatchModeTLS` where to put the mode). Same thing for `ProxyTorchDispatchMode`.

I also had to change both of these mode's enter/exit, to handle the fact that there can no longer be multiple proxy/fake modes on the mode stack at once. I updated them both to have a `self.enter_stack: List[Optional[TorchDispatchMode]]` - whenever we push a given mode in `__enter__`, we remove the current ambient fake/proxy mode from the mode stack, and save it in `enter_stack`, so that on exit we can reset the state properly.

(2) dispatching logic in `python_arg_parser.cpp`

This is where the core dispatching logic changes are. I added two helpers, `dispatch_on_subclass()` and `dispatch_on_mode()`. The overall dispatching order is now:
```
(a) dispatch_on_mode()  # try user modes first (where the mode stack automatically considers infra modes last)
(b) dispatch_on_subclass() # try user subclasses next (skipping infra subclasses)
(c) dispatch_on_subclass() # try infra subclasses next (skipping user subclasses)
```

Note that we still want "user subclasses" to run before "infra modes". As Ed helped me realize, this will work today: If proxy/fake modes in step 1, they'll return NotImplemented if they see a user subclass, allowing us to redispatch to the user subclass.

How do (b) and (c) distinguish between user and infra subclasses? Infra subclasses (FakeTensor, and later FunctionalTensor) are required to have a `_mode_key` hidden on the subclass - so we filter via arguments that do/don't have the _mode_key.

(3) I also changed `DoubleTensor` to `TwoTensor` to minimize confusion (@albanD  pointed out that DoubleTensor would be easily confused with `torch.FloatTensor` and friends).

----- original description below -----

The main purpose of this PR is to fix the "ordering problem" between torch_dispatch modes, where we want to ensure that our Fake and Proxy dispatch modes always run **after** any dispatch modes created by the user, regardless of where they are in the stack. See this doc for more details: https://docs.google.com/document/d/1COQ291nOZvtFnzGTQMJqoYZ3sttEYFw_7HbfSyL8gcA/edit

Full set of changes below. I ended up including a few semi-related changes in this PR that I documented - but if folks would rather I separate them out, happy to try to do that.

**(1) Add dedicated TLS slots for FakeTensorMode and ProxyTensorMode**

This is the main component of this PR. There are two new slots, `TorchDispatchModeTLS.fake_mode_` and `TorchDispatchModeTLS.proxy_mode_`, which correspond to a single "global" fake and proxy mode. There is now an invariant that `torchDispatchModeState.stack_` can never contain either of these modes.

I also added a `TorchDispatchModeTLS::maybe_highest_mode()` helper that consults the `stack_` as well as both the proxy and fake slots, and returns the highest priority mode - this is because there are a few places in the codebase where we legitimately want to get the highest priority mode, *including* fake or proxy, if one is set.

This also made the implementations of the existing `disable_proxy_modes_tracing()` and `get_innermost_proxy_mode()` marginally simpler.

**(2) Updated the dispatching logic in handle_torch_function_no_python_arg_parser()**

This is the function that actually figures out which torch_dispatch implementation to call, given the current mode stack and tensor subclass inputs. This function got marginally more complicated as part of the refactor: First we inspect the mode stack and any non-fake subclass inputs. Then we check for the proxy mode slot. Then we check for the Fake mode slot, before finally checking for any fake subclass inputs.

**(3) new python `_get_fake_tensor_mode()` and `_get_proxy_tensor_mode()` API's**

Before, if you wanted to see if proxy or fake modes were active in python, you would have to consult the mode stack. Since these two modes are no longer part of the actual mode stack, I added two new API's to directly check if either proxy or fake modes are active.

**(4) Allow traceable tensor subclasses to access storages from python**
This is convenient later in the stack, where AOTAutograd needs to detect aliasing of inputs and outputs, where those inputs and outputs might be tensor subclasses. Previously, `x.untyped_storage()` would raise an error if `x` was a subclass. In this PR, I tried to relax this constraint as little as possible: `THPVariable_storage()` will only try to return a storage to python if the tensor subclass that you are passing in is "traceable"

**(5) Fixed subclass fakeification**

@wanchaol recently added support to be able to fakeify tensor subclasses. That fakeification logic works in most cases, but there is one case it doesn't handle: autograd metadata. In particular, since autograd sees our tensor subclasses and not their desugared tensors, we need to make sure that our fakeified subclass has the same autograd metadata as the original subclass. I updated `meta_utils.py` to make sure that the autograd metadata is correct.

**(6) make tensor subclasses resizeable**

Previously we didn't allow tensor subclasses to be resizeable. I ran into an issue where fakeifying a tensor subclass occasionally requires swapping out its storage, which can involve resizing the tensor. Mechanically, this required updating `at::for_blob()` to expose a way to request that the tensor that you create has resizeable storage, and then using this new API in `_make_wrapper_tensor()`.

**(7) Added a basic DoubleTensor subclass for testing**

I use this subclass more later in this stack in my AOTAutograd tests - but it serves as a simple subclass example to test the dispatch ordering in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104482
Approved by: https://github.com/ezyang
ghstack dependencies: #107415
2023-08-29 02:36:48 +00:00
5efd63b1b8 better support for fakeifying and dynamoing through torch_dispatch subclasses (with dynamic shapes) (#107415)
There is already some support for plumbing `__torch_dispatch__` tensor subclasses through dynamo, but this PR beefs it up a bit and adds a test. In particular:

(1) Fakeifying tensor subclasses didn't properly set autograd metadata (requires_grad, is_leaf) on the newly fakeified wrapper subclass. I don't actually have a test for this in this PR, but it's tested pretty heavily later in my aot autograd tests

(2) Fakeifying tensor subclasses didn't properly track source information for dynamic shapes on the inner tensors. I added a new `WrapperSubclassFieldSource` subclass, that represents a source coming from a tensor field on a wrapper subclass, which I use in the fakeifying logic, and again in symbolic_shapes.py to generate proper guards.

(3) `_make_wrapper_subclass()` marginally updated this code to work better with dynamic shapes. One thing that's a bit weird about `_make_wrapper_subclass`: it has two overloads, and the first explicitly does not support dynamic shapes (and the second.. does not support kwargs). I think that later we probably want to consolidate / at least make the first overload work with dynamic shapes, but I didn't want to handle that in this PR (so these smaller changes seemed like a strict improvement).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107415
Approved by: https://github.com/ezyang
2023-08-29 02:36:48 +00:00
378ffde8c1 Revert "Remove some unnecessary <iostream> includes from headers (#106914)"
This reverts commit a6c29b722772816804d54eed070fbb38450d3e6f.

Reverted https://github.com/pytorch/pytorch/pull/106914 on behalf of https://github.com/izaitsevfb due to Causing metal breakage internally, see D48709279 ([comment](https://github.com/pytorch/pytorch/pull/106914#issuecomment-1696670027))
2023-08-29 02:22:33 +00:00
2f226804a0 Revert "Minor fixs to make torchbench runable on torch/xla (#107919)"
This reverts commit ed8f21282fca07621836a14f7d517148e1b944c3.

Reverted https://github.com/pytorch/pytorch/pull/107919 on behalf of https://github.com/izaitsevfb due to Conflicts with the revert of 106914 ([comment](https://github.com/pytorch/pytorch/pull/107919#issuecomment-1696662453))
2023-08-29 02:18:07 +00:00
de972529dc [logging] Add more flags to default logs (#107912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107912
Approved by: https://github.com/mlazos
2023-08-29 01:01:02 +00:00
5251ae6fb7 Explicitly include iostream (#108103)
Summary: Similar to D48568760

Test Plan: Sandcastle

Reviewed By: osalpekar

Differential Revision: D48758708

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108103
Approved by: https://github.com/osalpekar
2023-08-29 00:10:34 +00:00
2d54d4c913 [inductor] Add constant_to_device for ir.Constant (#108087)
Fixes error with:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 ./benchmarks/dynamo/torchbench.py --inference --performance --no-skip --inductor --freezing --only pyhpc_turbulent_kinetic_energy
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108087
Approved by: https://github.com/eellison
ghstack dependencies: #108096
2023-08-29 00:08:11 +00:00
73235d08c3 [dynamo] Graph break on pack_padded_sequence (#108096)
This is to workaround #93501.

Fixes errors in:
```
./benchmarks/dynamo/torchbench.py --inference --performance --no-skip --inductor --freezing --only tacotron2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108096
Approved by: https://github.com/davidberard98
2023-08-29 00:08:11 +00:00
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b3cb987e1f3230876654f156bd11a09.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
cd4f74fb2e [PT2] - Add check for stack (#108012)
Summary:
Add check for `guard.stack` which was causing exceptions like:

```
toch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'format'
```

Test Plan: contbuild & OSS CI

Differential Revision: D48709458

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108012
Approved by: https://github.com/anijain2305
2023-08-28 23:30:34 +00:00
3488837ec1 Update ruff to v0.0.286 (#108058)
Updates ruff to v0.0.286 and fixes one false negative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108058
Approved by: https://github.com/albanD
2023-08-28 22:55:56 +00:00
8caa89917b Revert "[ATen] Update pre-compiled header (#106915)"
This reverts commit c68d0a7042e850cebc4cbe7f717fc11aedf6b9d7.

Reverted https://github.com/pytorch/pytorch/pull/106915 on behalf of https://github.com/osalpekar due to Unfortunately there is still a breaking Metal job due to the bottom PR. @kit1980 will help fix this and get this merged ([comment](https://github.com/pytorch/pytorch/pull/106915#issuecomment-1696530828))
2023-08-28 22:51:19 +00:00
9d2ffc5dfa [reland][Dynamo] cache_size policy #107496 (#108069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108069
Approved by: https://github.com/yanboliang
2023-08-28 22:06:54 +00:00
cd20a89ccc [ROCM] Add ROCm support to debug_dump and enable_debug_mode (#107845)
enable_debug_mode and debug_dump are enabled in ROCM releases.  Add ROCM flags to #if defines so they can be accessed by PyTorch users.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107845
Approved by: https://github.com/pruthvistony, https://github.com/huydhn
2023-08-28 22:03:34 +00:00
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
9fdb5ef26b Skip ROCm jobs on PR (for now) (#108083)
Follow AMD suggestion to relieve the queue on ROCM, let the jobs run in only trunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108083
Approved by: https://github.com/pruthvistony, https://github.com/seemethere
2023-08-28 21:42:31 +00:00
199e23bc3a [quant][be] Clean up QAT tests in test_quantize_pt2e.py (#107991)
Summary: This commit does 4 main things:

1. When verifying QAT numerics, automatically check both the
per tensor and the per channel cases, and automatically verify
convert numerics

2. When verifying the QAT graph, automatically check both the
per tensor and the per channel cases

3. Merge verify graph and verify numerics tests for conv-bn

4. Fix `test_prepare_qat_conv_bn_fusion_getitem_placeholder`,
which was no longer testing the right thing recent capture
changes, since the maxpool op is no longer followed by a
getitem node. However, we do still need this test for other
ops that *are* followed by getitem nodes (e.g. standalone BN).

Items (1) - (3) make the QAT tests significantly less verbose
and easier to read.

Test Plan:
python test/test_quantization.py TestQuantizePT2E
python test/test_quantization.py TestQuantizePT2EModels

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107991
Approved by: https://github.com/jerryzh168
2023-08-28 21:12:00 +00:00
18a58f0bd6 Implement "RAdamW" optimizer (#107507)
Fixes #107282

## Overview

- basic design decision was followed as they made on #103881 (tensor operation, test cases, order & position of argument etc.)
- for the algorithm for decoupled weight decay, I referred to [1, 2]

## backwards-incompatible changes

- positional argument `decoupled_weight_decay` is added to:
    -  `torch.optim.radam`

The existing code which refers to these APIs can be affected.

Note: Positional argument `decoupled_weight_decay` is added to `torch.optim.RAdam`. However, since it was added to the last position and with default value, it is not affected.

## Reference

- [1] [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101)
- [2] https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam/radam.py#L5-L94

## TODO

- [x] implement tensor operation
- [x] implement test cases
- [x] modify doc-string
- [x] pass unit test code locally `python test/test_optim.py -k test_radam`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107507
Approved by: https://github.com/janeyx99
2023-08-28 20:50:25 +00:00
8cbf77585d Revert "[1/N] fix clang-tidy warnings in torch/csrc (#107648)"
This reverts commit 49eeca00d1e76dd0158758f2c29da6b1d06bf54a.

Reverted https://github.com/pytorch/pytorch/pull/107648 on behalf of https://github.com/osalpekar due to This causes breakages due to underspecified type ([comment](https://github.com/pytorch/pytorch/pull/107648#issuecomment-1696372588))
2023-08-28 20:35:12 +00:00
b0d109f29f [ONNX] Bump onnx submodule to 1.14.1; ONNX Runtime 1.16 (#106984)
Bump dependencies:

- ort-nightly 1.16.0.dev20230824005
- onnx 1.14.1rc2
- onnxscript 0.1.0.dev20230825
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106984
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-08-28 20:11:29 +00:00
bcda859e34 fix typos (#108006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108006
Approved by: https://github.com/Skylion007
2023-08-28 19:49:09 +00:00
5d85d897e0 Torchrec Enablement Fixes - Re-PR 107910 (#108018)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108018
Approved by: https://github.com/wconstab
2023-08-28 19:47:53 +00:00
73cbe95005 [pt2][autotuning] add logging for failed autotunings (#108034)
Summary: log failed autotunings due to cuda misaligned address errors

Test Plan: https://www.internalfb.com/intern/daiquery/?queryid=1587758145084896

Differential Revision: D48663354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108034
Approved by: https://github.com/jansel
2023-08-28 19:44:38 +00:00
182a9cf366 Add Independent Memory Efficient and Flash Attention Build Flags (#107985)
# Summary
In an effort to simplify https://github.com/pytorch/pytorch/pull/105602, this PR pulls out independent chunks of code that can be landed prior to FlashV2 landing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107985
Approved by: https://github.com/cpuhrsch
2023-08-28 18:39:18 +00:00
f0c6e5c91f Fix the use of inputs.build_environment in #107868 (#108075)
It should be `${{ inputs.build_environment }}`, although I wonder why not just clean up the artifacts directory for all build instead of just `aarch64`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108075
Approved by: https://github.com/atalman, https://github.com/seemethere
2023-08-28 18:29:19 +00:00
584a01b650 Fix LayerNorm(bias=False) error (#108060)
Fixes #108048

- [ ] Cherry pick this [here](https://github.com/pytorch/pytorch/issues/108055)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108060
Approved by: https://github.com/jbschlosser, https://github.com/albanD, https://github.com/malfet
2023-08-28 18:23:13 +00:00
cyy
054f3f1d8f [3/N] fix clang-tidy warnings in torch/csrc (#108024)
Apply fixes to some found issues by clang-tidy in torch/csrc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108024
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet
2023-08-28 18:00:00 +00:00
356b8f6339 [dynamo]bugfix:implement numel() for SizeVariable (#107944)
fix the issue that SizeVariable does not support numel() method
Fixes #106407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107944
Approved by: https://github.com/Skylion007
2023-08-28 17:54:57 +00:00
7349e8c1a1 Don't use np.random for TorchDynamo (#108009)
Part of https://github.com/pytorch/pytorch/issues/107970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108009
Approved by: https://github.com/lezcano
2023-08-28 17:18:40 +00:00
a1d8132210 Enable mypy check in torch/_inductor/optimize_indexing.py (#107943)
Fixes #105230

```shell
$ lintrunner init && lintrunner -a torch/_inductor/optimize_indexing.py
...
ok No lint issues.
Successfully applied all patches.
```

```shell
$ mypy torch/_inductor/optimize_indexing.py
Success: no issues found in 1 source file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107943
Approved by: https://github.com/Skylion007
2023-08-28 17:08:13 +00:00
20f3808aa2 Implement decomposition for aten.tensor_split.tensor_indices_or_sections (#107251)
Summary: Before this change, the tensor_indices_or_sections variant of aten.tensor_split causes a `RuntimeError: The tensor has a non-zero number of elements` due to that operation needing to introspect data. Decomposing into one of the other two tensor_split variants fixes the problem.

Test Plan:
Enabled tensor_split tests in test/inductor/test_torchinductor_opinfo.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107251
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-08-28 17:01:23 +00:00
010064159b Fix the issue described by #106532 (#108036)
Fixes #106532

As the title says
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108036
Approved by: https://github.com/albanD
2023-08-28 16:23:47 +00:00
c8f7f2659b Two small mem_eff bug fixes (#103201)
# Summary
Upstream two small bug fixes:
* https://github.com/fairinternal/xformers/pull/679
* https://github.com/fairinternal/xformers/pull/681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103201
Approved by: https://github.com/cpuhrsch
2023-08-28 16:21:47 +00:00
67371c7431 Binary op support for (B, C, *, *) NT with (C, 1, 1) dense (#107890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107890
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #107891, #107892
2023-08-28 15:19:39 +00:00
33d70be95f Binary out-of-place ge.Scalar / eq.Scalar support for NT (#107892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107892
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #107891
2023-08-28 15:18:37 +00:00
e917d2749a Unary out-of-place sin / cos support for NT (#107891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107891
Approved by: https://github.com/cpuhrsch
2023-08-28 15:17:34 +00:00
264df88a2d [C10D][Logger]Add more info to c10d logger (#107331)
This PR adds pg_name and world_size to c10d logging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107331
Approved by: https://github.com/kumpera
2023-08-28 15:10:56 +00:00
dcc674de8e remove step invocation warning (#107216)
Fixes #99734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107216
Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi
2023-08-28 14:35:25 +00:00
60bb02a907 Fix fallback FBGEMM implementation for Big Endian systems. (#96422)
This change fixes multiple tests in
test/test_quantization.py::TestQuantizedEmbeddingOps.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96422
Approved by: https://github.com/huydhn
2023-08-28 12:44:12 +00:00
49e964cad6 Automatically turn on dynamo in cond (#108028)
A replacement of https://github.com/pytorch/pytorch/pull/107932.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108028
Approved by: https://github.com/zou3519
ghstack dependencies: #108025, #108026, #108027
2023-08-28 10:16:41 +00:00
6f8eecfb10 Add UncapturedHigherOrderOpError to always raise exceptions for cond. (#108027)
We want cond to always throw errors despite user's torch.compile mode.

The current implementation is to
1. catch the UserError.GRAPH_BREAK_IN_CONTROL_FLOW and once saw it, we directly raise: once in [break_graph_if_unsupported](bad3f2db40/torch/_dynamo/symbolic_convert.py (L1250)), which catches and raises for call_function (entry point of higher order operator)  and a few others.
2. The raised exception is caught and raised again in [step](bad3f2db40/torch/_dynamo/symbolic_convert.py (L691)), where all instructions' exceptions are handled.
3. At the top-level, we treat it like an hard error and not supressing the errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108027
Approved by: https://github.com/zou3519
ghstack dependencies: #108025, #108026
2023-08-28 07:23:03 +00:00
3821 changed files with 241740 additions and 429315 deletions

View File

@ -71,6 +71,9 @@ if [[ "$image" == *cuda* && "$UBUNTU_VERSION" != "22.04" ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
elif [[ "$image" == *cuda*linter* ]]; then
# Use a separate Dockerfile for linter to keep a small image size
DOCKERFILE="linter-cuda/Dockerfile"
elif [[ "$image" == *linter* ]]; then
# Use a separate Dockerfile for linter to keep a small image size
DOCKERFILE="linter/Dockerfile"
@ -129,35 +132,6 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc7)
CUDA_VERSION=11.8.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc7-inductor-benchmarks)
CUDA_VERSION=11.8.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
@ -181,13 +155,13 @@ case "$image" in
CONDA_CMAKE=yes
ONNX=yes
;;
pytorch-linux-focal-py3-clang7-android-ndk-r19c)
pytorch-linux-focal-py3-clang9-android-ndk-r21e)
ANACONDA_PYTHON_VERSION=3.8
CLANG_VERSION=7
CLANG_VERSION=9
LLVMDEV=yes
PROTOBUF=yes
ANDROID=yes
ANDROID_NDK_VERSION=r19c
ANDROID_NDK_VERSION=r21e
GRADLE_VERSION=6.8.3
NINJA_VERSION=1.9.0
;;
@ -228,7 +202,7 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=5.4.2
ROCM_VERSION=5.6
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
@ -239,22 +213,11 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=5.6
ROCM_VERSION=5.7
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3.8-gcc7)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
CONDA_CMAKE=yes
TRITON=yes
DOCS=yes
;;
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=11
@ -286,6 +249,12 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang15-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=15
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3.8-gcc11)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=11
@ -297,6 +266,12 @@ case "$image" in
TRITON=yes
DOCS=yes
;;
pytorch-linux-jammy-py3-clang12-executorch)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=12
CONDA_CMAKE=yes
EXECUTORCH=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
@ -304,6 +279,11 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9
CONDA_CMAKE=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CONDA_CMAKE=yes
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes
@ -321,6 +301,9 @@ case "$image" in
extract_version_from_image_name rocm ROCM_VERSION
NINJA_VERSION=1.9.0
TRITON=yes
# To ensure that any ROCm config will build using conda cmake
# and thus have LAPACK/MKL enabled
CONDA_CMAKE=yes
fi
if [[ "$image" == *centos7* ]]; then
NINJA_VERSION=1.10.2
@ -354,14 +337,11 @@ if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
fi
# Build image
# TODO: build-arg THRIFT is not turned on for any image, remove it once we confirm
# it's no longer needed.
docker build \
--no-cache \
--progress=plain \
--build-arg "BUILD_ENVIRONMENT=${image}" \
--build-arg "PROTOBUF=${PROTOBUF:-}" \
--build-arg "THRIFT=${THRIFT:-}" \
--build-arg "LLVMDEV=${LLVMDEV:-}" \
--build-arg "DB=${DB:-}" \
--build-arg "VISION=${VISION:-}" \
@ -393,6 +373,7 @@ docker build \
--build-arg "ONNX=${ONNX}" \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \
-t "$tmp_tag" \
"$@" \

View File

@ -98,6 +98,18 @@ COPY ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
ARG TRITON
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
ENV CMAKE_C_COMPILER cc
ENV CMAKE_CXX_COMPILER c++
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -0,0 +1 @@
ca6322dcfc51b209a06b76d160bd95d81d58f15c

View File

@ -1 +1 @@
4.27.4
6c26faa159b79a42d7fa46cb66e2d21523351987

View File

@ -1 +1 @@
b9d43c7dcac1fe05e851dd7be7187b108af593d2
730b907b4d45a4713cbc425cbf224c46089fd514

View File

@ -1 +1 @@
05d67b9418cacda0d356c2102d7c1a887948b013
dafe1459823b9549417ed95e9720f1b594fab329

View File

@ -1 +1 @@
e6216047b8b0aef1fe8da6ca8667a3ad0a016411
bcad9dabe15021c53b6a88296e9d7a210044f108

View File

@ -9,10 +9,7 @@ install_ubuntu() {
# "$UBUNTU_VERSION" == "18.04"*
# instead of
# "$UBUNTU_VERSION" == "18.04"
if [[ "$UBUNTU_VERSION" == "18.04"* ]]; then
cmake3="cmake=3.10*"
maybe_libiomp_dev="libiomp-dev"
elif [[ "$UBUNTU_VERSION" == "20.04"* ]]; then
if [[ "$UBUNTU_VERSION" == "20.04"* ]]; then
cmake3="cmake=3.16*"
maybe_libiomp_dev=""
elif [[ "$UBUNTU_VERSION" == "22.04"* ]]; then
@ -23,7 +20,9 @@ install_ubuntu() {
maybe_libiomp_dev="libiomp-dev"
fi
if [[ "$CLANG_VERSION" == 12 ]]; then
if [[ "$CLANG_VERSION" == 15 ]]; then
maybe_libomp_dev="libomp-15-dev"
elif [[ "$CLANG_VERSION" == 12 ]]; then
maybe_libomp_dev="libomp-12-dev"
elif [[ "$CLANG_VERSION" == 10 ]]; then
maybe_libomp_dev="libomp-10-dev"

View File

@ -54,23 +54,13 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ]; then
conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS}
elif [ "$ANACONDA_PYTHON_VERSION" = "3.10" ]; then
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
elif [ "$ANACONDA_PYTHON_VERSION" = "3.9" ]; then
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
elif [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
else
# Install `typing-extensions` for 3.7
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS} typing-extensions
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
fi
# This is only supported in 3.8 upward
if [ "$MINOR_PYTHON_VERSION" -gt "7" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
# and libpython-static for torch deploy
conda_install llvmdev=8.0.0 "libpython-static=${ANACONDA_PYTHON_VERSION}"
fi
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
# and libpython-static for torch deploy
conda_install llvmdev=8.0.0 "libpython-static=${ANACONDA_PYTHON_VERSION}"
# Use conda cmake in some cases. Conda cmake will be newer than our supported
# min version (3.5 for xenial and 3.10 for bionic), so we only do it in those
@ -89,13 +79,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install some other packages, including those needed for Python test reporting
pip_install -r /opt/conda/requirements-ci.txt
# Update scikit-learn to a python-3.8 compatible version
if [[ $(python -c "import sys; print(int(sys.version_info >= (3, 8)))") == "1" ]]; then
pip_install -U scikit-learn
else
# Pinned scikit-learn due to https://github.com/scikit-learn/scikit-learn/issues/14485 (affects gcc 5.5 only)
pip_install scikit-learn==0.20.3
fi
pip_install -U scikit-learn
if [ -n "$DOCS" ]; then
apt-get update

View File

@ -0,0 +1,62 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
clone_executorch() {
EXECUTORCH_PINNED_COMMIT=$(get_pinned_commit executorch)
# Clone the Executorch
git clone https://github.com/pytorch/executorch.git
# and fetch the target commit
pushd executorch
git checkout "${EXECUTORCH_PINNED_COMMIT}"
git submodule update --init
popd
chown -R jenkins executorch
}
install_buck2() {
pushd executorch/.ci/docker
BUCK2_VERSION=$(cat ci_commit_pins/buck2.txt)
source common/install_buck.sh
popd
}
install_conda_dependencies() {
pushd executorch/.ci/docker
# Install conda dependencies like flatbuffer
conda_install --file conda-env-ci.txt
popd
}
install_pip_dependencies() {
pushd executorch/.ci/docker
# Install all Python dependencies
pip_install -r requirements-ci.txt
popd
}
setup_executorch() {
pushd executorch
source .ci/scripts/utils.sh
install_flatc_from_source
pip_install .
build_executorch_runner "cmake"
# Make sure that all the newly generate files are owned by Jenkins
chown -R jenkins .
popd
}
clone_executorch
install_buck2
install_conda_dependencies
install_pip_dependencies
setup_executorch

View File

@ -6,23 +6,21 @@ source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
function install_huggingface() {
local version
version=$(get_pinned_commit huggingface)
pip_install pandas
pip_install scipy
pip_install z3-solver
pip_install "transformers==${version}"
commit=$(get_pinned_commit huggingface)
pip_install pandas==2.0.3
pip_install "git+https://github.com/huggingface/transformers@${commit}"
}
function install_timm() {
local commit
commit=$(get_pinned_commit timm)
pip_install pandas
pip_install scipy
pip_install z3-solver
pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"
pip_install pandas==2.0.3
pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"
# Clean up
conda_run pip uninstall -y cmake torch torchvision triton
}
# Pango is needed for weasyprint which is needed for doctr
conda_install pango
install_huggingface
# install_timm
install_timm

23
.ci/docker/common/install_onnx.sh Normal file → Executable file
View File

@ -4,36 +4,35 @@ set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
retry () {
"$@" || (sleep 10 && "$@") || (sleep 20 && "$@") || (sleep 40 && "$@")
}
# A bunch of custom pip dependencies for ONNX
pip_install \
beartype==0.10.4 \
beartype==0.15.0 \
filelock==3.9.0 \
flatbuffers==2.0 \
mock==5.0.1 \
ninja==1.10.2 \
networkx==2.0 \
numpy==1.22.4
numpy==1.24.2
# ONNXRuntime should be installed before installing
# onnx-weekly. Otherwise, onnx-weekly could be
# overwritten by onnx.
pip_install \
onnxruntime==1.15.1 \
parameterized==0.8.1 \
pytest-cov==4.0.0 \
pytest-subtests==0.10.0 \
tabulate==0.9.0 \
transformers==4.31.0
transformers==4.32.1
# Using 1.15dev branch for the following not yet released features and fixes.
# - Segfault fix for shape inference.
# - Inliner to workaround ORT segfault.
pip_install onnx-weekly==1.15.0.dev20230717
pip_install coloredlogs packaging
retry pip_install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ --no-cache-dir --no-input ort-nightly==1.17.0.dev20231005006
# TODO: change this when onnx-script is on testPypi
# pip_install onnxscript-preview==0.1.0.dev20230809 --no-deps
# NOTE: temp change for CI to run on unpublished onnxscript PR.
pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@f69be19ebd3f2e0d7efe64b0c7be3329cbab3822" --no-deps
pip_install -i https://test.pypi.org/simple/ onnx==1.15.0rc2
pip_install onnxscript==0.1.0.dev20231128 --no-deps
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -5,8 +5,10 @@ set -ex
# "install" hipMAGMA into /opt/rocm/magma by copying after build
git clone https://bitbucket.org/icl/magma.git
pushd magma
# Fixes memory leaks of magma found while executing linalg UTs
git checkout 28592a7170e4b3707ed92644bf4a689ed600c27f
# Version 2.7.2 + ROCm related updates
git checkout 823531632140d0edcb7e77c3edc0e837421471c5
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc

View File

@ -1,14 +0,0 @@
apt-get update
apt-get install -y sudo wget libboost-dev libboost-test-dev libboost-program-options-dev libboost-filesystem-dev libboost-thread-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev
wget https://www-us.apache.org/dist/thrift/0.12.0/thrift-0.12.0.tar.gz
tar -xvf thrift-0.12.0.tar.gz
cd thrift-0.12.0
for file in ./compiler/cpp/Makefile*; do
sed -i 's/\-Werror//' $file
done
./bootstrap.sh
./configure --without-php --without-java --without-python --without-nodejs --without-go --without-ruby
sudo make
sudo make install
cd ..
rm thrift-0.12.0.tar.gz

View File

@ -23,8 +23,10 @@ fi
# The logic here is copied from .ci/pytorch/common_utils.sh
TRITON_PINNED_COMMIT=$(get_pinned_commit ${TRITON_TEXT_FILE})
apt update
apt-get install -y gpg-agent
if [ -n "${UBUNTU_VERSION}" ];then
apt update
apt-get install -y gpg-agent
fi
if [ -n "${CONDA_CMAKE}" ]; then
# Keep the current cmake and numpy version here, so we can reinstall them later
@ -36,12 +38,12 @@ if [ -z "${MAX_JOBS}" ]; then
export MAX_JOBS=$(nproc)
fi
if [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}" == "7" ]]; then
if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}" == "7" ]]; then
# Triton needs at least gcc-9 to build
apt-get install -y g++-9
CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
elif [ -n "${CLANG_VERSION}" ]; then
elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then
# Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get install -y g++-9

View File

@ -0,0 +1,44 @@
ARG UBUNTU_VERSION
FROM ubuntu:${UBUNTU_VERSION}
ARG UBUNTU_VERSION
ENV DEBIAN_FRONTEND noninteractive
# Install common dependencies (so that this step can be cached separately)
COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install missing libomp-dev
RUN apt-get update && apt-get install -y --no-install-recommends libomp-dev && apt-get autoclean && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install cuda and cudnn
ARG CUDA_VERSION
RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh
ENV DESIRED_CUDA ${CUDA_VERSION}
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH
# Note that Docker build forbids copying file outside the build context
COPY ./common/install_linter.sh install_linter.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_linter.sh
RUN rm install_linter.sh common_utils.sh
USER jenkins
CMD ["bash"]

View File

@ -75,10 +75,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.4.1
mypy==1.7.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.4.1
#Pinned versions: 1.7.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -124,10 +124,22 @@ opt-einsum==3.3
#Pinned versions: 3.3
#test that import: test_linalg.py
pillow==9.3.0 ; python_version <= "3.8"
pillow==9.5.0 ; python_version > "3.8"
optree==0.9.1
#Description: A library for tree manipulation
#Pinned versions: 0.9.1
#test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
#test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
#common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
#test_expanded_weights.py, test_decomp.py, test_overrides.py, test_masked.py,
#test_ops.py, test_prims.py, test_subclass.py, test_functionalization.py,
#test_schema_check.py, test_profiler_tree.py, test_meta.py, test_torchxla_num_output.py,
#test_utils.py, test_proxy_tensor.py, test_memory_profiler.py, test_view_ops.py,
#test_pointwise_ops.py, test_dtensor_ops.py, test_torchinductor.py, test_fx.py,
#test_fake_tensor.py, test_mps.py
pillow==10.0.1
#Description: Python Imaging Library fork
#Pinned versions:
#Pinned versions: 10.0.1
#test that import:
protobuf==3.20.2
@ -271,7 +283,18 @@ pytest-cpp==2.3.0
#Pinned versions: 2.3.0
#test that import:
z3-solver
z3-solver==4.12.2.0
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:
tensorboard==2.13.0
#Description: Also included in .ci/docker/requirements-docs.txt
#Pinned versions:
#test that import: test_tensorboard
pywavelets==1.4.1
#Description: This is a requirement of scikit-image, we need to pin
# it here because 1.5.0 conflicts with numpy 1.21.2 used in CI
#Pinned versions: 1.4.1
#test that import:

View File

@ -79,12 +79,6 @@ ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
ENV OPENSSL_DIR /opt/openssl
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
ARG INDUCTOR_BENCHMARKS
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
@ -93,6 +87,12 @@ COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
ARG TRITON
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access

View File

@ -17,13 +17,6 @@ ARG LLVMDEV
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# (optional) Install thrift.
ARG THRIFT
COPY ./common/install_thrift.sh install_thrift.sh
RUN if [ -n "${THRIFT}" ]; then bash ./install_thrift.sh; fi
RUN rm install_thrift.sh
ENV INSTALLED_THRIFT ${THRIFT}
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
@ -153,6 +146,14 @@ COPY ci_commit_pins/triton.txt triton.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt
ARG EXECUTORCH
# Build and install executorch
COPY ./common/install_executorch.sh install_executorch.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/executorch.txt executorch.txt
RUN if [ -n "${EXECUTORCH}" ]; then bash ./install_executorch.sh; fi
RUN rm install_executorch.sh common_utils.sh executorch.txt
ARG ONNX
# Install ONNX dependencies
COPY ./common/install_onnx.sh ./common/common_utils.sh ./

View File

@ -3,11 +3,6 @@
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# Use to retry ONNX test, only retry it twice
retry () {
"$@" || (sleep 60 && "$@")
}
if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
# TODO: This can be removed later once vision is also part of the Docker image
pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"
@ -16,5 +11,5 @@ if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
# NB: ONNX test is fast (~15m) so it's ok to retry it few more times to avoid any flaky issue, we
# need to bring this to the standard PyTorch run_test eventually. The issue will be tracked in
# https://github.com/pytorch/pytorch/issues/98626
retry "$ROOT_DIR/scripts/onnx/test.sh"
"$ROOT_DIR/scripts/onnx/test.sh"
fi

View File

@ -63,6 +63,12 @@ else
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
fi
if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then
# To build test_edge_op_registration
export BUILD_EXECUTORCH=ON
export USE_CUDA=0
fi
if ! which conda; then
# In ROCm CIs, we are doing cross compilation on build machines with
# intel cpu and later run tests on machines with amd cpu.
@ -159,6 +165,14 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* && -z "$TORCH_CUDA_ARCH_LIST" ]]; then
exit 1
fi
# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of
# memory to build and will OOM
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ "$TORCH_CUDA_ARCH_LIST" == *"8.6"* || "$TORCH_CUDA_ARCH_LIST" == *"8.0"* ]]; then
echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"
echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"
export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"
fi
if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
export CC=clang
export CXX=clang++
@ -168,7 +182,6 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then
export LDSHARED="clang --shared"
export USE_CUDA=0
export USE_ASAN=1
export USE_MKLDNN=0
export UBSAN_FLAGS="-fno-sanitize-recover=all;-fno-sanitize=float-divide-by-zero;-fno-sanitize=float-cast-overflow"
unset USE_LLVM
fi

View File

@ -43,7 +43,7 @@ function assert_git_not_dirty() {
# TODO: we should add an option to `build_amd.py` that reverts the repo to
# an unmodified state.
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]] && [[ "$BUILD_ENVIRONMENT" != *xla* ]] ; then
git_status=$(git status --porcelain)
git_status=$(git status --porcelain | grep -v '?? third_party' || true)
if [[ $git_status ]]; then
echo "Build left local git repository checkout dirty"
echo "git status --porcelain:"
@ -171,13 +171,6 @@ function install_torchrec_and_fbgemm() {
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
}
function install_numpy_pytorch_interop() {
local commit
commit=$(get_pinned_commit numpy_pytorch_interop)
# TODO: --no-use-pep517 will result in failure.
pip_install --user "git+https://github.com/Quansight-Labs/numpy_pytorch_interop.git@${commit}"
}
function clone_pytorch_xla() {
if [[ ! -d ./xla ]]; then
git clone --recursive --quiet https://github.com/pytorch/xla.git
@ -212,15 +205,6 @@ function test_torch_deploy(){
popd
}
function install_timm() {
local commit
commit=$(get_pinned_commit timm)
pip_install pandas
pip_install scipy
pip_install z3-solver
pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"
}
function checkout_install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)

View File

@ -43,7 +43,7 @@ cross_compile_arm64() {
compile_arm64() {
# Compilation for arm64
# TODO: Compile with OpenMP support (but this causes CI regressions as cross-compilation were done with OpenMP disabled)
USE_DISTRIBUTED=0 USE_OPENMP=0 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
}
compile_x86_64() {

View File

@ -36,10 +36,12 @@ time python test/run_test.py --verbose -i distributed/test_functional_api
# DTensor tests
time python test/run_test.py --verbose -i distributed/_tensor/test_device_mesh
time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops
time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile
# DeviceMesh test
time python test/run_test.py --verbose -i distributed/test_device_mesh
# DTensor/TP tests
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel

View File

@ -80,6 +80,11 @@ if [[ "$BUILD_ENVIRONMENT" != *bazel* ]]; then
CUSTOM_TEST_ARTIFACT_BUILD_DIR=$(realpath "${CUSTOM_TEST_ARTIFACT_BUILD_DIR:-"build/custom_test_artifacts"}")
fi
# Reduce set of tests to include when running run_test.py
if [[ -n $TESTS_TO_INCLUDE ]]; then
echo "Setting INCLUDE_CLAUSE"
INCLUDE_CLAUSE="--include $TESTS_TO_INCLUDE"
fi
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
@ -148,7 +153,7 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
export PYTORCH_TEST_WITH_ASAN=1
export PYTORCH_TEST_WITH_UBSAN=1
# TODO: Figure out how to avoid hard-coding these paths
export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-12/bin/llvm-symbolizer
export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-15/bin/llvm-symbolizer
export TORCH_USE_RTLD_GLOBAL=1
# NB: We load libtorch.so with RTLD_GLOBAL for UBSAN, unlike our
# default behavior.
@ -182,7 +187,7 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
# have, and it applies to child processes.
# TODO: get rid of the hardcoded path
export LD_PRELOAD=/usr/lib/llvm-12/lib/clang/12.0.1/lib/linux/libclang_rt.asan-x86_64.so
export LD_PRELOAD=/usr/lib/llvm-15/lib/clang/15.0.7/lib/linux/libclang_rt.asan-x86_64.so
# Disable valgrind for asan
export VALGRIND=OFF
# Increase stack size, because ASAN red zones use more stack
@ -228,13 +233,16 @@ test_python_shard() {
exit 1
fi
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "$1" "$NUM_TEST_SHARDS" --verbose
# Bare --include flag is not supported and quoting for lint ends up with flag not being interpreted correctly
# shellcheck disable=SC2086
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose
assert_git_not_dirty
}
test_python() {
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --verbose
# shellcheck disable=SC2086
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose
assert_git_not_dirty
}
@ -281,6 +289,10 @@ test_inductor_distributed() {
# Smuggle a few multi-gpu tests here so that we don't have to request another large node
echo "Testing multi_gpu tests in test_torchinductor"
pytest test/inductor/test_torchinductor.py -k test_multi_gpu
pytest test/inductor/test_aot_inductor.py -k test_non_default_cuda_device
pytest test/inductor/test_aot_inductor.py -k test_replicate_on_devices
pytest test/distributed/_tensor/test_dtensor_compile.py
pytest test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
# with if required # gpus aren't available
@ -303,14 +315,17 @@ test_inductor() {
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
# For example 'dynamic_aot_eager_torchbench' TEST_CONFIG means we run
# the benchmark script with '--dynamic-shapes --backend aot_eager --device cuda'
# The matrix of test options is specified in .github/workflows/periodic.yml
# and .github/workflows/inductor.yml
# The matrix of test options is specified in .github/workflows/inductor.yml,
# .github/workflows/inductor-periodic.yml, and
# .github/workflows/inductor-perf-test-nightly.yml
DYNAMO_BENCHMARK_FLAGS=()
if [[ "${TEST_CONFIG}" == *dynamo_eager* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--backend eager)
elif [[ "${TEST_CONFIG}" == *aot_eager* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--backend aot_eager)
elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--export-aot-inductor)
elif [[ "${TEST_CONFIG}" == *inductor* && "${TEST_CONFIG}" != *perf* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--inductor)
fi
@ -319,7 +334,7 @@ if [[ "${TEST_CONFIG}" == *dynamic* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--dynamic-shapes --dynamic-batch-only)
fi
if [[ "${TEST_CONFIG}" == *cpu_accuracy* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--device cpu)
else
DYNAMO_BENCHMARK_FLAGS+=(--device cuda)
@ -383,6 +398,11 @@ test_perf_for_dashboard() {
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freeze_autotune_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then
python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \
@ -433,19 +453,12 @@ test_single_dynamo_benchmark() {
"${DYNAMO_BENCHMARK_FLAGS[@]}" \
"$@" "${partition_flags[@]}" \
--output "$TEST_REPORTS_DIR/${name}_${suite}.csv"
if [[ "${TEST_CONFIG}" == *inductor* ]] && [[ "${TEST_CONFIG}" != *cpu_accuracy* ]]; then
# other jobs (e.g. periodic, cpu-accuracy) may have different set of expected models.
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"
python benchmarks/dynamo/check_graph_breaks.py \
--actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"
else
python benchmarks/dynamo/check_csv.py \
-f "$TEST_REPORTS_DIR/${name}_${suite}.csv"
fi
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"
python benchmarks/dynamo/check_graph_breaks.py \
--actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"
fi
}
@ -463,8 +476,10 @@ test_dynamo_benchmark() {
elif [[ "${TEST_CONFIG}" == *perf* ]]; then
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
else
if [[ "${TEST_CONFIG}" == *cpu_accuracy* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"
elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
else
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"
@ -479,9 +494,17 @@ test_inductor_torchbench_smoketest_perf() {
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \
--output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"
# the reference speedup value is hardcoded in check_hf_bert_perf_csv.py
# this value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_hf_bert_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4
python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \
--export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
# The perf number of nanogpt seems not very stable, e.g.
# https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,
# and thus we lower its threshold to reduce flakiness. If this continues to be a problem,
# we switch to use some other model.
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9
# Check memory compression ratio for a few models
for test in hf_Albert timm_vision_transformer; do
@ -544,6 +567,10 @@ test_without_numpy() {
python -c "import sys;sys.path.insert(0, 'fake_numpy');from unittest import TestCase;import torch;x=torch.randn(3,3);TestCase().assertRaises(RuntimeError, lambda: x.numpy())"
# Regression test for https://github.com/pytorch/pytorch/issues/66353
python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch;print(torch.tensor([torch.tensor(0.), torch.tensor(1.)]))"
# Regression test for https://github.com/pytorch/pytorch/issues/109387
if [[ "${TEST_CONFIG}" == *dynamo* ]]; then
python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch;torch.compile(lambda x:print(x))('Hello World')"
fi
popd
}
@ -601,7 +628,7 @@ test_libtorch_jit() {
# Run jit and lazy tensor cpp tests together to finish them faster
if [[ "$BUILD_ENVIRONMENT" == *cuda* && "$TEST_CONFIG" != *nogpu* ]]; then
LTC_TS_CUDA=1 python test/run_test.py --cpp --verbose -i cpp/test_jit cpp/nvfuser_tests cpp/test_lazy
LTC_TS_CUDA=1 python test/run_test.py --cpp --verbose -i cpp/test_jit cpp/test_lazy
else
# CUDA tests have already been skipped when CUDA is not available
python test/run_test.py --cpp --verbose -i cpp/test_jit cpp/test_lazy -k "not CUDA"
@ -662,7 +689,8 @@ test_vulkan() {
test_distributed() {
echo "Testing distributed python tests"
time python test/run_test.py --distributed-tests --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# shellcheck disable=SC2086
time python test/run_test.py --distributed-tests --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" $INCLUDE_CLAUSE --verbose
assert_git_not_dirty
if [[ ("$BUILD_ENVIRONMENT" == *cuda* || "$BUILD_ENVIRONMENT" == *rocm*) && "$SHARD_NUMBER" == 1 ]]; then
@ -971,9 +999,28 @@ test_docs_test() {
}
test_executorch() {
pushd /executorch
echo "Install torchvision and torchaudio"
# TODO(huydhn): Switch this to the pinned commits on ExecuTorch once they are
# there. These libraries need to be built here, and not part of the Docker
# image because they require the target version of torch to be installed first
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git"
echo "Run ExecuTorch regression tests for some models"
# NB: This is a sample model, more can be added here
export PYTHON_EXECUTABLE=python
# TODO(huydhn): Add more coverage here using ExecuTorch's gather models script
# shellcheck disable=SC1091
source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''
popd
# Test torchgen generated code for Executorch.
echo "Testing Executorch op registration"
echo "Testing ExecuTorch op registration"
"$BUILD_BIN_DIR"/test_edge_op_registration
assert_git_not_dirty
}
@ -988,6 +1035,8 @@ elif [[ "${TEST_CONFIG}" == *xla* ]]; then
install_torchvision
build_xla
test_xla
elif [[ "${TEST_CONFIG}" == *executorch* ]]; then
test_executorch
elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then
test_python_legacy_jit
elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then
@ -1010,11 +1059,10 @@ elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then
test_dynamo_benchmark huggingface "$id"
elif [[ "${TEST_CONFIG}" == *timm* ]]; then
install_torchvision
install_timm
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark timm_models "$id"
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_accuracy* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
install_torchaudio cpu
else
install_torchaudio cuda
@ -1025,13 +1073,13 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer
checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
else
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any
# nightlies that torchbench may pull in
if [[ "${TEST_CONFIG}" != *cpu_accuracy* ]]; then
if [[ "${TEST_CONFIG}" != *cpu_inductor* ]]; then
install_torchrec_and_fbgemm
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"
@ -1043,12 +1091,10 @@ elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_without_numpy
install_torchvision
install_numpy_pytorch_interop
test_dynamo_shard 1
test_aten
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision
install_numpy_pytorch_interop
test_dynamo_shard 2
elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_without_numpy
@ -1076,6 +1122,10 @@ elif [[ "${BUILD_ENVIRONMENT}" == *-mobile-lightweight-dispatch* ]]; then
test_libtorch
elif [[ "${TEST_CONFIG}" = docs_test ]]; then
test_docs_test
elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then
install_torchvision
test_python
test_aten
else
install_torchvision
install_monkeytype
@ -1088,5 +1138,4 @@ else
test_custom_backend
test_torch_function_benchmark
test_benchmarks
test_executorch
fi

View File

@ -127,8 +127,7 @@ python -c "import os, glob; os.system('python -mpip install --no-index --no-deps
:: export test times so that potential sharded tests that'll branch off this build will use consistent data
python tools/stats/export_test_times.py
copy /Y ".pytorch-test-times.json" "%PYTORCH_FINAL_PACKAGE_DIR%"
copy /Y ".pytorch-test-file-ratings.json" "%PYTORCH_FINAL_PACKAGE_DIR%"
robocopy /E ".additional_ci_files" "%PYTORCH_FINAL_PACKAGE_DIR%\.additional_ci_files"
:: Also save build/.ninja_log as an artifact
copy /Y "build\.ninja_log" "%PYTORCH_FINAL_PACKAGE_DIR%\"

View File

@ -2,6 +2,7 @@
import os
import subprocess
import sys
COMMON_TESTS = [
(
@ -53,4 +54,4 @@ if __name__ == "__main__":
print("Reruning with traceback enabled")
print("Command:", command_string)
subprocess.run(command_args, check=False)
exit(e.returncode)
sys.exit(e.returncode)

View File

@ -26,11 +26,6 @@ popd
python test_custom_ops.py -v
if ERRORLEVEL 1 exit /b 1
:: TODO: fix and re-enable this test
:: See https://github.com/pytorch/pytorch/issues/25155
:: python test_custom_classes.py -v
:: if ERRORLEVEL 1 exit /b 1
python model.py --export-script-module="build/model.pt"
if ERRORLEVEL 1 exit /b 1

View File

@ -1,7 +1,3 @@
:: Skip LibTorch tests when building a GPU binary and testing on a CPU machine
:: because LibTorch tests are not well designed for this use case.
if "%USE_CUDA%" == "0" IF NOT "%CUDA_VERSION%" == "cpu" exit /b 0
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
if errorlevel 1 exit /b 1
@ -21,7 +17,7 @@ if not errorlevel 0 exit /b 1
cd %TMP_DIR_WIN%\build\torch\test
for /r "." %%a in (*.exe) do (
call :libtorch_check "%%~na" "%%~fa"
if errorlevel 1 exit /b 1
if errorlevel 1 goto fail
)
goto :eof
@ -34,18 +30,6 @@ set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\test
:: Skip verify_api_visibility as it a compile level test
if "%~1" == "verify_api_visibility" goto :eof
:: See https://github.com/pytorch/pytorch/issues/25161
if "%~1" == "c10_metaprogramming_test" goto :eof
if "%~1" == "module_test" goto :eof
:: See https://github.com/pytorch/pytorch/issues/25312
if "%~1" == "converter_nomigraph_test" goto :eof
:: See https://github.com/pytorch/pytorch/issues/35636
if "%~1" == "generate_proposals_op_gpu_test" goto :eof
:: See https://github.com/pytorch/pytorch/issues/35648
if "%~1" == "reshape_op_gpu_test" goto :eof
:: See https://github.com/pytorch/pytorch/issues/35651
if "%~1" == "utility_ops_gpu_test" goto :eof
echo Running "%~2"
if "%~1" == "c10_intrusive_ptr_benchmark" (
:: NB: This is not a gtest executable file, thus couldn't be handled by pytest-cpp
@ -56,11 +40,15 @@ if "%~1" == "c10_intrusive_ptr_benchmark" (
python test\run_test.py --cpp --verbose -i "cpp/%~1"
if errorlevel 1 (
echo %1 failed with exit code %errorlevel%
exit /b 1
goto fail
)
if not errorlevel 0 (
echo %1 failed with exit code %errorlevel%
exit /b 1
goto fail
)
goto :eof
:eof
exit /b 0
:fail
exit /b 1

View File

@ -1,8 +1,7 @@
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
echo Copying over test times file
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%PROJECT_DIR_WIN%"
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-file-ratings.json" "%PROJECT_DIR_WIN%"
robocopy /E "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.additional_ci_files" "%PROJECT_DIR_WIN%\.additional_ci_files"
pushd test

View File

@ -22,8 +22,7 @@ if "%SHARD_NUMBER%" == "1" (
)
echo Copying over test times file
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%PROJECT_DIR_WIN%"
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-file-ratings.json" "%PROJECT_DIR_WIN%"
robocopy /E "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.additional_ci_files" "%PROJECT_DIR_WIN%\.additional_ci_files"
echo Run nn tests
python run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose

View File

@ -35,10 +35,10 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
fi
# TODO: Move both of them to Windows AMI
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver
python -m pip install z3-solver==4.12.2.0
run_tests() {
# Run nvidia-smi if available

View File

@ -1,28 +0,0 @@
from collections import OrderedDict
from cimodel.data.simple.util.branch_filters import gen_filter_dict
from cimodel.lib.miniutils import quote
CHANNELS_TO_PRUNE = ["pytorch-nightly", "pytorch-test"]
PACKAGES_TO_PRUNE = "pytorch torchvision torchaudio torchtext ignite torchcsprng"
def gen_workflow_job(channel: str):
return OrderedDict(
{
"anaconda_prune": OrderedDict(
{
"name": f"anaconda-prune-{channel}",
"context": quote("org-member"),
"packages": quote(PACKAGES_TO_PRUNE),
"channel": channel,
"filters": gen_filter_dict(branches_list=["postnightly"]),
}
)
}
)
def get_workflow_jobs():
return [gen_workflow_job(channel) for channel in CHANNELS_TO_PRUNE]

View File

@ -32,4 +32,4 @@ def gen_mobile_docker(specifier):
DOCKER_IMAGE_ASAN, DOCKER_REQUIREMENT_ASAN = gen_mobile_docker("asan")
DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK = gen_mobile_docker("android-ndk-r19c")
DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK = gen_mobile_docker("android-ndk-r21e")

51
.circleci/config.yml generated
View File

@ -444,35 +444,6 @@ jobs:
script="/Users/distiller/project/.circleci/scripts/binary_ios_upload.sh"
cat "$script"
source "$script"
anaconda_prune:
parameters:
packages:
type: string
description: "What packages are we pruning? (quoted, space-separated string. eg. 'pytorch', 'torchvision torchaudio', etc.)"
default: "pytorch"
channel:
type: string
description: "What channel are we pruning? (eq. pytorch-nightly)"
default: "pytorch-nightly"
docker:
- image: continuumio/miniconda3
environment:
- PACKAGES: "<< parameters.packages >>"
- CHANNEL: "<< parameters.channel >>"
steps:
- checkout
- run:
name: Install dependencies
no_output_timeout: "1h"
command: |
conda install -yq anaconda-client
- run:
name: Prune packages
no_output_timeout: "1h"
command: |
ANACONDA_API_TOKEN="${CONDA_PYTORCHBOT_TOKEN}" \
scripts/release/anaconda-prune/run.sh
pytorch_doc_push:
resource_class: medium
machine:
@ -652,7 +623,7 @@ jobs:
- run:
name: Archive artifacts into zip
command: |
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json .pytorch-test-file-ratings.json
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .additional_ci_files
cp artifacts.zip /Users/distiller/workspace
- persist_to_workspace:
@ -686,8 +657,6 @@ jobs:
TEST_CONFIG: << parameters.test-config >>
SHARD_NUMBER: << parameters.shard-number >>
NUM_TEST_SHARDS: << parameters.num-test-shards >>
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
steps:
- checkout
- attach_workspace:
@ -1414,22 +1383,4 @@ workflows:
requires:
- pytorch_ios_full_jit_12_5_1_nightly_x86_64_build
- pytorch_ios_full_jit_12_5_1_nightly_arm64_build
- anaconda_prune:
name: anaconda-prune-pytorch-nightly
context: "org-member"
packages: "pytorch torchvision torchaudio torchtext ignite torchcsprng"
channel: pytorch-nightly
filters:
branches:
only:
- postnightly
- anaconda_prune:
name: anaconda-prune-pytorch-test
context: "org-member"
packages: "pytorch torchvision torchaudio torchtext ignite torchcsprng"
channel: pytorch-test
filters:
branches:
only:
- postnightly
when: << pipeline.parameters.run_build >>

View File

@ -10,8 +10,6 @@ import shutil
import sys
from collections import namedtuple
import cimodel.data.simple.anaconda_prune_defintions
import cimodel.data.simple.docker_definitions
import cimodel.data.simple.mobile_definitions
import cimodel.data.simple.nightly_ios
@ -144,7 +142,6 @@ def gen_build_workflows_tree():
build_workflows_functions = [
cimodel.data.simple.mobile_definitions.get_workflow_jobs,
cimodel.data.simple.nightly_ios.get_workflow_jobs,
cimodel.data.simple.anaconda_prune_defintions.get_workflow_jobs,
]
build_jobs = [f() for f in build_workflows_functions]
build_jobs.extend(

View File

@ -33,7 +33,7 @@ fi
cp ${PROJ_ROOT}/LICENSE ${ZIP_DIR}/
# zip the library
export DATE="$(date -u +%Y%m%d)"
export IOS_NIGHTLY_BUILD_VERSION="2.1.0.${DATE}"
export IOS_NIGHTLY_BUILD_VERSION="2.2.0.${DATE}"
if [ "${BUILD_LITE_INTERPRETER}" == "1" ]; then
# libtorch_lite_ios_nightly_1.11.0.20210810.zip
ZIPFILE="libtorch_lite_ios_nightly_${IOS_NIGHTLY_BUILD_VERSION}.zip"

View File

@ -54,7 +54,7 @@ fi
# Move debug wheels out of the the package dir so they don't get installed
# Move debug wheels out of the package dir so they don't get installed
mkdir -p /tmp/debug_final_pkgs
mv /final_pkgs/debug-*.zip /tmp/debug_final_pkgs || echo "no debug packages to move"
@ -66,6 +66,12 @@ mv /final_pkgs/debug-*.zip /tmp/debug_final_pkgs || echo "no debug packages to m
# conda build scripts themselves. These should really be consolidated
# Pick only one package of multiple available (which happens as result of workflow re-runs)
pkg="/final_pkgs/\$(ls -1 /final_pkgs|sort|tail -1)"
if [[ "\$PYTORCH_BUILD_VERSION" == *dev* ]]; then
CHANNEL="nightly"
else
CHANNEL="test"
fi
if [[ "$PACKAGE_TYPE" == conda ]]; then
(
# For some reason conda likes to re-activate the conda environment when attempting this install
@ -83,25 +89,14 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then
retry conda install -c pytorch -y cpuonly
else
cu_ver="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4}"
CUDA_PACKAGE="pytorch-cuda"
PYTORCH_CHANNEL="pytorch"
if [[ "\${TORCH_CONDA_BUILD_FOLDER}" == "pytorch-nightly" ]]; then
PYTORCH_CHANNEL="pytorch-nightly"
fi
retry conda install \${EXTRA_CONDA_FLAGS} -yq -c nvidia -c "\${PYTORCH_CHANNEL}" "pytorch-cuda=\${cu_ver}"
retry conda install \${EXTRA_CONDA_FLAGS} -yq -c nvidia -c "pytorch-\${CHANNEL}" "pytorch-cuda=\${cu_ver}"
fi
conda install \${EXTRA_CONDA_FLAGS} -y "\$pkg" --offline
)
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
if [[ "$(uname -m)" == aarch64 ]]; then
# Using "extra-index-url" until all needed aarch64 dependencies are
# added to "https://download.pytorch.org/whl/nightly/"
pip install "\$pkg" --extra-index-url "https://download.pytorch.org/whl/nightly/${DESIRED_CUDA}"
else
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/nightly/${DESIRED_CUDA}"
fi
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"
retry pip install -q numpy protobuf typing-extensions
fi
if [[ "$PACKAGE_TYPE" == libtorch ]]; then

View File

@ -59,7 +59,7 @@ PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)"
#TODO: We should be pulling semver version from the base version.txt
BASE_BUILD_VERSION="2.1.0.dev$DATE"
BASE_BUILD_VERSION="2.2.0.dev$DATE"
# Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag
@ -77,13 +77,8 @@ else
export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}+$DESIRED_CUDA"
fi
if [[ -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_BUILD_VERSION="${PYTORCH_BUILD_VERSION}-with-pypi-cudnn"
fi
export PYTORCH_BUILD_NUMBER=1
JAVA_HOME=
BUILD_JNI=OFF
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
@ -155,8 +150,8 @@ EOL
# nproc doesn't exist on darwin
if [[ "$(uname)" != Darwin ]]; then
# Because most Circle executors only have 20 CPUs, using more causes OOMs w/ Ninja and nvcc parallelization
MEMORY_LIMIT_MAX_JOBS=18
# This was lowered from 18 to 12 to avoid OOMs when compiling FlashAttentionV2
MEMORY_LIMIT_MAX_JOBS=12
NUM_CPUS=$(( $(nproc) - 2 ))
# Defaults here for **binary** linux builds so they can be changed in one place

View File

@ -11,16 +11,11 @@ PKG_DIR=${PKG_DIR:-/tmp/workspace/final_pkgs}
# currently set within `designate_upload_channel`
UPLOAD_CHANNEL=${UPLOAD_CHANNEL:-nightly}
# Designates what subfolder to put packages into
UPLOAD_SUBFOLDER=${UPLOAD_SUBFOLDER:-cpu}
UPLOAD_SUBFOLDER=${UPLOAD_SUBFOLDER:-}
UPLOAD_BUCKET="s3://pytorch"
BACKUP_BUCKET="s3://pytorch-backup"
BUILD_NAME=${BUILD_NAME:-}
# this is temporary change to upload pypi-cudnn builds to separate folder
if [[ ${BUILD_NAME} == *with-pypi-cudnn* ]]; then
UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_cudnn"
fi
DRY_RUN=${DRY_RUN:-enabled}
# Don't actually do work unless explicit
ANACONDA="true anaconda"
@ -64,12 +59,17 @@ s3_upload() {
local pkg_type
extension="$1"
pkg_type="$2"
s3_dir="${UPLOAD_BUCKET}/${pkg_type}/${UPLOAD_CHANNEL}/${UPLOAD_SUBFOLDER}/"
s3_root_dir="${UPLOAD_BUCKET}/${pkg_type}/${UPLOAD_CHANNEL}"
if [[ -z ${UPLOAD_SUBFOLDER:-} ]]; then
s3_upload_dir="${s3_root_dir}/"
else
s3_upload_dir="${s3_root_dir}/${UPLOAD_SUBFOLDER}/"
fi
(
for pkg in ${PKG_DIR}/*.${extension}; do
(
set -x
${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_dir}"
${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}"
)
done
)
@ -82,15 +82,17 @@ pip install -q awscli
case "${PACKAGE_TYPE}" in
conda)
conda_upload
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(\
tar -xOf ${PKG_DIR}/*.bz2 info/index.json \
| grep subdir \
| cut -d ':' -f2 \
| sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//' \
)
BACKUP_DIR="conda/${subdir}"
for conda_archive in ${PKG_DIR}/*.tar.bz2; do
# Fetch platform (eg. win-64, linux-64, etc.) from index file because
# there's no actual conda command to read this
subdir=$(\
tar -xOf "${conda_archive}" info/index.json \
| grep subdir \
| cut -d ':' -f2 \
| sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//' \
)
BACKUP_DIR="conda/${subdir}"
done
;;
libtorch)
s3_upload "zip" "libtorch"

View File

@ -42,32 +42,3 @@ jobs:
script="/Users/distiller/project/.circleci/scripts/binary_ios_upload.sh"
cat "$script"
source "$script"
anaconda_prune:
parameters:
packages:
type: string
description: "What packages are we pruning? (quoted, space-separated string. eg. 'pytorch', 'torchvision torchaudio', etc.)"
default: "pytorch"
channel:
type: string
description: "What channel are we pruning? (eq. pytorch-nightly)"
default: "pytorch-nightly"
docker:
- image: continuumio/miniconda3
environment:
- PACKAGES: "<< parameters.packages >>"
- CHANNEL: "<< parameters.channel >>"
steps:
- checkout
- run:
name: Install dependencies
no_output_timeout: "1h"
command: |
conda install -yq anaconda-client
- run:
name: Prune packages
no_output_timeout: "1h"
command: |
ANACONDA_API_TOKEN="${CONDA_PYTORCHBOT_TOKEN}" \
scripts/release/anaconda-prune/run.sh

View File

@ -177,7 +177,7 @@
- run:
name: Archive artifacts into zip
command: |
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json .pytorch-test-file-ratings.json
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .additional_ci_files
cp artifacts.zip /Users/distiller/workspace
- persist_to_workspace:
@ -211,8 +211,6 @@
TEST_CONFIG: << parameters.test-config >>
SHARD_NUMBER: << parameters.shard-number >>
NUM_TEST_SHARDS: << parameters.num-test-shards >>
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
steps:
- checkout
- attach_workspace:

View File

@ -1,5 +1,8 @@
---
# NOTE there must be no spaces before the '-', so put the comma last.
# The check bugprone-unchecked-optional-access is also turned off atm
# because it causes clang-tidy to hang randomly. The tracking issue
# can be found at https://github.com/llvm/llvm-project/issues/69369.
InheritParentConfig: true
Checks: '
bugprone-*,
@ -9,6 +12,7 @@ bugprone-*,
-bugprone-lambda-function-name,
-bugprone-reserved-identifier,
-bugprone-swapped-arguments,
-bugprone-unchecked-optional-access,
clang-diagnostic-missing-prototypes,
cppcoreguidelines-*,
-cppcoreguidelines-avoid-do-while,
@ -30,8 +34,13 @@ cppcoreguidelines-*,
-facebook-hte-RelativeInclude,
hicpp-exception-baseclass,
hicpp-avoid-goto,
misc-unused-alias-decls,
misc-unused-using-decls,
misc-*,
-misc-const-correctness,
-misc-use-anonymous-namespace,
-misc-unused-parameters,
-misc-no-recursion,
-misc-non-private-member-variables-in-classes,
-misc-confusable-identifiers,
modernize-*,
-modernize-concat-nested-namespaces,
-modernize-macro-to-enum,
@ -44,7 +53,7 @@ modernize-*,
performance-*,
readability-container-size-empty,
'
HeaderFilterRegex: '^(c10/(?!test)|torch/csrc/(?!deploy/interpreter/cpython)).*$'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
AnalyzeTemporaryDtors: false
WarningsAsErrors: '*'
...

72
.devcontainer/README.md Normal file
View File

@ -0,0 +1,72 @@
# Step by step guide on using PyTorch's DevContainer
Using PyTorch's DevContainer environment involves a series of steps that will help you set up a development environment that is isolated and replicable. Below, we'll guide you through each step to make this process as smooth as possible:
## Step 1: Install VSCode
1. Navigate to the [Visual Studio Code website](https://code.visualstudio.com/).
2. Download the appropriate installer for your operating system (Windows, Linux, or macOS).
3. Run the installer and follow the on-screen instructions to install VSCode on your system.
4. After installation, launch VSCode.
## Step 2: Install DevContainer Extension
1. In VSCode, go to the Extensions view by clicking on the Extensions icon in the Activity Bar on the side of the window.
2. Search for "Dev Containers" in the Extensions view search bar.
3. Find the "Dev Containers" extension in the search results and click on the install button to install it.
You can also go to the extension's [homepage](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) and [documentation page](https://code.visualstudio.com/docs/devcontainers/containers) to find more details.
## Step 3: Install Docker and Add Current Login User to Docker Group
1. Follow the [official guide](https://docs.docker.com/get-docker/) to install Docker. Don't forget the [post installation steps](https://docs.docker.com/engine/install/linux-postinstall/).
If you are using [Visual Studio Code Remote - SSH](https://code.visualstudio.com/docs/remote/ssh), then you only need to install Docker in the remote host, not your local computer. And the following steps should be run in the remote host.
## Step 4 (Optional): Install NVIDIA Container Toolkit for GPU Usage
1. If you intend to use GPU resources, first ensure you have NVIDIA drivers installed on your system. Check if `nvidia-smi` works to verify your GPU setup.
2. Follow the [official guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#docker) to install the NVIDIA Container Toolkit.
3. After installation, verify that the toolkit is installed correctly by running:
```
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
```
## Step 5: Clone PyTorch
1. Open a terminal or command prompt.
2. Use the following command to clone the PyTorch repository:
```
git clone https://github.com/pytorch/pytorch
```
3. Navigate to the cloned directory:
```
cd pytorch
```
## Step 6: Open in DevContainer
1. In VSCode, use the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P` on macOS) to run the "Remote-Containers: Open Folder in Container..." command.
2. You will be prompted with two options: CPU dev container or CUDA dev container. Choose the one you want to run.
## Step 7: Wait for Building the Environment
1. After opening the folder in a DevContainer, VSCode will start building the container. This process can take some time as it involves downloading necessary images and setting up the environment.
2. You can monitor the progress in the VSCode terminal.
3. Once the build process completes, you'll have a fully configured PyTorch development environment in a container.
4. The next time you open the same dev container, it will be much faster, as it does not require building the image again.
You are now all set to start developing with PyTorch in a DevContainer environment. This setup ensures you have a consistent and isolated development environment for your PyTorch projects.
## Step 8: Build PyTorch
To build pytorch from source, simply run:
```
python setup.py develop
```
The process involves compiling thousands of files, and would take a long time. Fortunately, the compiled objects can be useful for your next build. When you modify some files, you only need to compile the changed files the next time.
Note that only contents in the `pytorch` directory are saved to disk. This directory is mounted to the docker image, while other contents in the docker image are all temporary, and will be lost if docker restarts the image or the server reboots.
For an in-depth understanding of Dev Container and its caveats, please refer to [the full documentation](https://code.visualstudio.com/docs/devcontainers/containers).

View File

@ -9,3 +9,5 @@ make setup_lint
# Add CMAKE_PREFIX_PATH to bashrc
echo 'export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}' >> ~/.bashrc
# Add linker path so that cuda-related libraries can be found
echo 'export LDFLAGS="-L${CONDA_PREFIX}/lib/ $LDFLAGS"' >> ~/.bashrc

12
.flake8
View File

@ -2,7 +2,7 @@
# NOTE: **Mirror any changes** to this file the [tool.ruff] config in pyproject.toml
# before we can fully move to use ruff
enable-extensions = G
select = B,C,E,F,G,P,SIM1,T4,W,B9
select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2
max-line-length = 120
# C408 ignored because we like the dict keyword argument syntax
# E501 is not flexible enough, we're using B950 instead
@ -14,15 +14,21 @@ ignore =
# to line this up with executable bit
EXE001,
# these ignores are from flake8-bugbear; please fix!
B007,B008,B017,B019,B020,B023,B024,B026,B028,B903,B904,B905,B906,B907
B007,B008,B017,B019,B023,B028,B903,B904,B905,B906,B907
# these ignores are from flake8-comprehensions; please fix!
C407,
# these ignores are from flake8-logging-format; please fix!
G100,G101,G200,G201,G202
G100,G101,G200
# these ignores are from flake8-simplify. please fix or ignore with commented reason
SIM105,SIM108,SIM110,SIM111,SIM113,SIM114,SIM115,SIM116,SIM117,SIM118,SIM119,SIM12,
# flake8-simplify code styles
SIM102,SIM103,SIM106,SIM112,
# TorchFix codes that don't make sense for PyTorch itself:
# removed and deprecated PyTorch functions.
TOR001,TOR101,
# TODO(kit1980): fix all TOR102 issues
# `torch.load` without `weights_only` parameter is unsafe
TOR102,
per-file-ignores =
__init__.py: F401
torch/utils/cpp_extension.py: B950

View File

@ -3,11 +3,12 @@ self-hosted-runner:
- linux.20_04.4x
- linux.20_04.16x
- linux.large
- linux.large.arc
- linux.2xlarge
- linux.4xlarge
- linux.12xlarge
- linux.24xlarge
- linux.t4g.2xlarge
- linux.arm64.2xlarge
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu
@ -23,3 +24,5 @@ self-hosted-runner:
- macos-12-xl
- macos-12
- macos12.3-m1
- macos-latest-xlarge
- macos-13-xlarge

View File

@ -13,6 +13,10 @@ inputs:
required: true
type: string
description: JSON description of what test configs to run.
job-name:
type: string
required: false
default: ""
outputs:
test-matrix:
@ -56,6 +60,7 @@ runs:
- name: Get the job name
id: get-job-name
if: inputs.job-name == ''
continue-on-error: true
shell: bash
run: |
@ -91,7 +96,7 @@ runs:
shell: bash
env:
GITHUB_TOKEN: ${{ inputs.github-token }}
JOB_NAME: ${{ steps.get-job-name.outputs.job-name }}
JOB_NAME: ${{ inputs.job-name == '' && steps.get-job-name.outputs.job-name || inputs.job-name }}
PR_NUMBER: ${{ github.event.pull_request.number }}
TAG: ${{ steps.parse-ref.outputs.tag }}
EVENT_NAME: ${{ github.event_name }}

View File

@ -11,18 +11,20 @@ outputs:
job-id:
description: The retrieved workflow job id
value: ${{ steps.get-job-id.outputs.job-id }}
job-name:
description: The retrieved workflow job name
value: ${{ steps.get-job-id.outputs.job-name }}
runs:
using: composite
steps:
- name: Get jobid or fail
- name: Get job id and name or fail
# timeout-minutes is unsupported for composite workflows, see https://github.com/actions/runner/issues/1979
# timeout-minutes: 10
shell: bash
id: get-job-id
run: |
set -eux
GHA_WORKFLOW_JOB_ID=$(python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}")
echo "job-id=${GHA_WORKFLOW_JOB_ID}" >> "${GITHUB_OUTPUT}"
python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}"
env:
GITHUB_TOKEN: ${{ inputs.github-token }}

View File

@ -10,6 +10,13 @@ inputs:
description: Shard number for the current job
required: false
default: "0"
sha:
description: SHA for the commit
required: true
test_config:
description: Name of the test config
required: false
default: "default"
job_identifier:
description: Text that uniquely identifies a given job type within a workflow. All shards of a job should share the same job identifier.
required: true
@ -33,6 +40,8 @@ runs:
env:
CACHE_DIR: ${{ inputs.cache_dir }}
JOB_IDENTIFIER: ${{ inputs.job_identifier }}
SHA: ${{ inputs.sha }}
TEST_CONFIG: ${{ inputs.test_config }}
SHARD: ${{ inputs.shard }}
REPO: ${{ github.repository }}
run: |
@ -41,6 +50,8 @@ runs:
--cache_dir $GITHUB_WORKSPACE/$CACHE_DIR \
--pr_identifier $GITHUB_REF \
--job_identifier $JOB_IDENTIFIER \
--sha $SHA \
--test_config $TEST_CONFIG \
--shard $SHARD \
--repo $REPO \
--temp_dir $RUNNER_TEMP \

View File

@ -43,14 +43,14 @@ runs:
FILE_SUFFIX: ${{ inputs.file-suffix }}
run: |
# Remove any previous test reports if they exist
rm -f usage-log-*.zip
rm -f logs-*.zip
# this workflow is also run in bazel build test, but we dont generate usage reports for it
# so check to see if the file exists first
if [ -f 'usage_log.txt' ]; then
zip "usage-log-${FILE_SUFFIX}.zip" 'usage_log.txt'
zip "logs-${FILE_SUFFIX}.zip" 'usage_log.txt'
fi
if ls test/**/*.log 1> /dev/null 2>&1; then
zip -r "usage-log-${FILE_SUFFIX}.zip" test -i '*.log'
zip -r "logs-${FILE_SUFFIX}.zip" test -i '*.log'
fi
# Windows zip
@ -80,7 +80,7 @@ runs:
FILE_SUFFIX: ${{ inputs.file-suffix }}
run: |
# -ir => recursive include all files in pattern
7z a "usage-log-$Env:FILE_SUFFIX.zip" 'usage_log.txt' -ir'!test\*.log'
7z a "logs-$Env:FILE_SUFFIX.zip" 'usage_log.txt' -ir'!test\*.log'
# S3 upload
- name: Store Test Downloaded JSONs on S3
@ -112,7 +112,7 @@ runs:
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14
if-no-files-found: ignore
path: usage-log-*.zip
path: logs-*.zip
# GHA upload
- name: Store Test Downloaded JSONs on Github
@ -146,7 +146,7 @@ runs:
continue-on-error: true
with:
# Add the run attempt, see [Artifact run attempt]
name: usage-log-runattempt${{ github.run_attempt }}-${{ inputs.file-suffix }}.zip
name: logs-runattempt${{ github.run_attempt }}-${{ inputs.file-suffix }}.zip
retention-days: 14
if-no-files-found: ignore
path: |

View File

@ -12,7 +12,6 @@ reviewers:
symbolic-shapes:
- symbolic-shapes
- antoniojkim
- wconstab
- SherlockNoMad
Chillee:
- ezyang

View File

@ -1 +1 @@
a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
6518fa9b2c74e84d7eb1fc6e3eb51e43213f0c05

View File

@ -1 +1 @@
1b2746f642cc2c99fe9d1a0c34359c0de45341c2
de731af65b4f04696e85c729e3282450b51b95fd

View File

@ -1 +0,0 @@
0c4e82511d349358d2c8c492dd833334e742f27f

View File

@ -1 +0,0 @@
b9d43c7dcac1fe05e851dd7be7187b108af593d2

View File

@ -1 +1 @@
9371b9e13c826f3930e54346b4d619cb59182f68
99944a2fb8624947f9c0e2edc898ff42a16124da

View File

@ -1 +1 @@
47cd5ea8e21d7596a24907710411d6b4a43f628d
e12d200c97d7aab668b976e92b46513c9ca7a0d8

View File

@ -1 +1 @@
e1ee592d9806216d7ac0bb711cae6307b0c5b68a
a80c1e7f958e7d8e8f92319db70876940e67ad9b

15
.github/labeler.yml vendored
View File

@ -15,6 +15,7 @@
"ciflow/inductor":
- torch/_decomp/**
- torch/_dynamo/**
- torch/_export/**
- torch/_inductor/**
- benchmarks/dynamo/**
- torch/_subclasses/fake_tensor.py
@ -22,12 +23,17 @@
- torch/_subclasses/meta_utils.py
- test/distributed/test_dynamo_distributed.py
- test/distributed/test_inductor_collectives.py
- torch/_functorch/partitioners.py
- torch/_functorch/_aot_autograd/**
- torch/_functorch/aot_autograd.py
- torch/_functorch/partitioners.py
- .ci/docker/ci_commit_pins/**
- .github/ci_commit_pins/**
- c10/core/Sym*
- torch/fx/experimental/symbolic_shapes.py
- test/distributed/_tensor/test_dtensor_compile.py
- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
- torch/distributed/_tensor/**
- torch/distributed/fsdp/**
"module: cpu":
- aten/src/ATen/cpu/**
@ -66,3 +72,10 @@
"ciflow/trunk":
- .ci/docker/ci_commit_pins/triton.txt
"oncall: distributed":
- torch/csrc/distributed/**
- torch/distributed/**
- torch/nn/parallel/**
- test/distributed/**
- torch/testing/_internal/distributed/**

View File

@ -4,15 +4,19 @@
- .ci/onnx/*
- .ci/docker/common/install_onnx.sh
- aten/src/ATen/core/interned_strings.h
- benchmarks/dynamo/**
- docs/source/onnx.rst
- docs/source/onnx*
- docs/source/scripts/onnx/**
- docs/source/_static/img/onnx/**
- scripts/onnx/**
- test/onnx/**
- test/onnx_caffe2/**
- tools/onnx/**
- torch/_dynamo/backends/onnxrt.py
- torch/_C/__init__.pyi.in
- torch/_C/_onnx.pyi
- torch/_logging/**
- torch/csrc/jit/passes/onnx.*
- torch/csrc/jit/passes/onnx/**
- torch/csrc/jit/serialization/export.*
@ -22,8 +26,6 @@
- torch/testing/_internal/common_methods_invocations.py
- third_party/onnx
- caffe2/python/onnx/**
- benchmarks/dynamo/_onnx/**
- torch/_logging/**
approved_by:
- BowenBao
- abock
@ -72,6 +74,7 @@
- name: OSS CI / pytorchbot
patterns:
- .github/ci_commit_pins/audio.txt
- .github/ci_commit_pins/vision.txt
- .github/ci_commit_pins/torchdynamo.txt
- .ci/docker/ci_commit_pins/triton.txt
@ -82,6 +85,19 @@
- EasyCLA
- Lint
- pull
- inductor
- name: OSS CI /pytorchbot / Executorch
patterns:
- .ci/docker/ci_commit_pins/executorch.txt
approved_by:
- pytorchbot
ignore_flaky_failures: false
mandatory_checks_name:
- EasyCLA
- Lint
- pull / linux-jammy-py3-clang12-executorch / build
- pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge)
- name: OSS CI / pytorchbot / XLA
patterns:
@ -92,8 +108,8 @@
mandatory_checks_name:
- EasyCLA
- Lint
- pull / linux-bionic-py3_8-clang8-xla / build
- pull / linux-bionic-py3_8-clang8-xla / test (xla, 1, 1, linux.12xlarge)
- pull / linux-focal-py3_8-clang9-xla / build
- pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge)
- name: Documentation
patterns:
@ -123,9 +139,6 @@
- name: PrimTorch
patterns:
- aten/src/ATen/native_functions.yaml
- aten/src/ATen/native/**
- test/**
- torch/_meta_registrations.py
- torch/_decomp/**
- torch/_refs/**
@ -319,6 +332,7 @@
- XiaobingSuper
- jgong5
- vfdev-5
- leslie-fang-intel
mandatory_checks_name:
- EasyCLA
- Lint
@ -337,6 +351,21 @@
- Lint
- pull
- name: x86 CPU quantization
patterns:
- torch/ao/quantization/quantizer/x86_inductor_quantizer.py
- torch/_inductor/fx_passes/quantization.py
- test/quantization/core/test_quantized_op.py
- test/inductor/test_mkldnn_pattern_matcher.py
- test/quantization/pt2e/test_x86inductor_quantizer.py
approved_by:
- leslie-fang-intel
- jgong5
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: Autocast
patterns:
- torch/amp/**

View File

@ -10,6 +10,7 @@ ciflow_push_tags:
- ciflow/mps
- ciflow/nightly
- ciflow/periodic
- ciflow/rocm
- ciflow/slow
- ciflow/trunk
- ciflow/unstable

View File

@ -1,7 +1,5 @@
blas=1.0
cmake=3.22.1
mkl=2022.1.0
mkl-include=2022.1.0
ninja=1.10.2
numpy=1.23.3
pyyaml=6.0

View File

@ -5,7 +5,7 @@ cmake=3.22.*
typing-extensions=4.3.0
dataclasses=0.8
pip=22.2.2
pillow=9.2.0
pillow=10.0.1
pkg-config=0.29.2
wheel=0.37.1
# NB: This is intentionally held back because anaconda main doesn't

View File

@ -7,7 +7,7 @@ cmake=3.22.*
typing-extensions=4.3.0
dataclasses=0.8
pip=22.2.2
pillow=9.2.0
pillow=10.0.1
libuv=1.40.0
pkg-config=0.29.2
wheel=0.37.1

View File

@ -1,3 +1,4 @@
# iOS simulator requirements
coremltools==5.0b5
protobuf==3.20.2
optree==0.9.1

View File

@ -10,6 +10,7 @@ numba<=0.49.1; platform_machine != "arm64"
opt-einsum>=3.3
psutil==5.9.1
nvidia-ml-py==11.525.84
packaging==23.1
pygments==2.15.0
pytest==7.3.2
pytest-xdist==3.3.1
@ -25,3 +26,5 @@ sympy==1.11.1
pytest-cpp==2.3.0
rockset==1.0.3
z3-solver==4.12.2.0
tensorboard==2.13.0
optree==0.9.1

View File

@ -1,2 +1,2 @@
typing-extensions
typing-extensions>=4.8.0
jinja2

View File

@ -60,12 +60,20 @@ def build_triton(
build_conda: bool = False,
build_rocm: bool = False,
py_version: Optional[str] = None,
release: bool = False,
) -> Path:
env = os.environ.copy()
if "MAX_JOBS" not in env:
max_jobs = os.cpu_count() or 1
env["MAX_JOBS"] = str(max_jobs)
version_suffix = ""
if not release:
# Nightly binaries include the triton commit hash, i.e. 2.1.0+e6216047b8
# while release build should only include the version, i.e. 2.1.0
version_suffix = f"+{commit_hash[:10]}"
version += version_suffix
with TemporaryDirectory() as tmpdir:
triton_basedir = Path(tmpdir) / "triton"
triton_pythondir = triton_basedir / "python"
@ -80,7 +88,7 @@ def build_triton(
if build_conda:
with open(triton_basedir / "meta.yaml", "w") as meta:
print(
f"package:\n name: torchtriton\n version: {version}+{commit_hash[:10]}\n",
f"package:\n name: torchtriton\n version: {version}\n",
file=meta,
)
print("source:\n path: .\n", file=meta)
@ -103,7 +111,7 @@ def build_triton(
patch_init_py(
triton_pythondir / "triton" / "__init__.py",
version=f"{version}+{commit_hash[:10]}",
version=f"{version}",
)
if py_version is None:
py_version = f"{sys.version_info.major}.{sys.version_info.minor}"
@ -122,21 +130,25 @@ def build_triton(
cwd=triton_basedir,
env=env,
)
conda_path = list(Path(tmpdir).glob("linux-64/torchtriton*.bz2"))[0]
conda_path = next(iter(Path(tmpdir).glob("linux-64/torchtriton*.bz2")))
shutil.copy(conda_path, Path.cwd())
return Path.cwd() / conda_path.name
patch_setup_py(
triton_pythondir / "setup.py",
name=triton_pkg_name,
version=f"{version}+{commit_hash[:10]}",
)
# change built wheel name and version
env["TRITON_WHEEL_NAME"] = triton_pkg_name
env["TRITON_WHEEL_VERSION_SUFFIX"] = version_suffix
patch_init_py(
triton_pythondir / "triton" / "__init__.py",
version=f"{version}+{commit_hash[:10]}",
version=f"{version}",
)
if build_rocm:
# TODO: Remove me when ROCM triton is updated
patch_setup_py(
triton_pythondir / "setup.py",
name=triton_pkg_name,
version=f"{version}",
)
check_call("scripts/amd/setup_rocm_libs.sh", cwd=triton_basedir, shell=True)
print("ROCm libraries setup for triton installation...")
@ -144,7 +156,7 @@ def build_triton(
[sys.executable, "setup.py", "bdist_wheel"], cwd=triton_pythondir, env=env
)
whl_path = list((triton_pythondir / "dist").glob("*.whl"))[0]
whl_path = next(iter((triton_pythondir / "dist").glob("*.whl")))
shutil.copy(whl_path, Path.cwd())
if build_rocm:
@ -157,12 +169,14 @@ def main() -> None:
from argparse import ArgumentParser
parser = ArgumentParser("Build Triton binaries")
parser.add_argument("--release", action="store_true")
parser.add_argument("--build-conda", action="store_true")
parser.add_argument("--build-rocm", action="store_true")
parser.add_argument("--py-version", type=str)
parser.add_argument("--commit-hash", type=str)
parser.add_argument("--triton-version", type=str, default=read_triton_version())
args = parser.parse_args()
build_triton(
build_rocm=args.build_rocm,
commit_hash=args.commit_hash
@ -171,6 +185,7 @@ def main() -> None:
version=args.triton_version,
build_conda=args.build_conda,
py_version=args.py_version,
release=args.release,
)

View File

@ -1,6 +1,7 @@
#!/usr/bin/env python3
"""Check whether a PR has required labels."""
import sys
from typing import Any
from github_utils import gh_delete_comment, gh_post_pr_comment
@ -46,7 +47,7 @@ def main() -> None:
except Exception as e:
pass
exit(0)
sys.exit(0)
if __name__ == "__main__":

BIN
.github/scripts/drci_mocks.json.gz vendored Normal file

Binary file not shown.

View File

@ -1,6 +1,5 @@
#!/usr/bin/env python3
import argparse
import sys
from pathlib import Path
@ -10,9 +9,11 @@ import yaml
REPO_ROOT = Path(__file__).resolve().parent.parent.parent
WORKFLOWS = REPO_ROOT / ".github" / "workflows"
EXPECTED_GROUP = (
EXPECTED_GROUP_PREFIX = (
"${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}"
"-${{ github.event_name == 'workflow_dispatch' }}"
)
EXPECTED_GROUP = (
EXPECTED_GROUP_PREFIX + "-${{ github.event_name == 'workflow_dispatch' }}"
)
@ -26,15 +27,8 @@ def should_check(filename: Path) -> bool:
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Ensure all relevant GitHub actions jobs will be cancelled based on a concurrency key"
)
args = parser.parse_args()
files = list(WORKFLOWS.glob("*.yml"))
errors_found = False
files = [f for f in files if should_check(f)]
files = [f for f in WORKFLOWS.glob("*.yml") if should_check(f)]
names = set()
for filename in files:
with open(filename) as f:
@ -46,7 +40,18 @@ if __name__ == "__main__":
errors_found = True
names.add(name)
actual = data.get("concurrency", {})
if not actual.get("group", "").startswith(EXPECTED_GROUP):
if filename.name == "create_release.yml":
if not actual.get("group", "").startswith(EXPECTED_GROUP_PREFIX):
print(
f"'concurrency' incorrect or not found in '{filename.relative_to(REPO_ROOT)}'",
file=sys.stderr,
)
print(
f"concurrency group should start with {EXPECTED_GROUP_PREFIX} but found {actual.get('group', None)}",
file=sys.stderr,
)
errors_found = True
elif not actual.get("group", "").startswith(EXPECTED_GROUP):
print(
f"'concurrency' incorrect or not found in '{filename.relative_to(REPO_ROOT)}'",
file=sys.stderr,

View File

@ -410,16 +410,17 @@ def process_jobs(
if target_job in (TEST_JOB_NAME, BUILD_AND_TEST_JOB_NAME):
target_cfg = m.group("cfg")
return _filter_jobs(
# NB: There can be multiple unstable configurations, i.e. inductor, inductor_huggingface
test_matrix = _filter_jobs(
test_matrix=test_matrix,
issue_type=issue_type,
target_cfg=target_cfg,
)
warnings.warn(
f"Found a matching {issue_type.value} issue {target_url} for {workflow} / {job_name}, "
+ f"but the name {target_job_cfg} is invalid"
)
else:
warnings.warn(
f"Found a matching {issue_type.value} issue {target_url} for {workflow} / {job_name}, "
+ f"but the name {target_job_cfg} is invalid"
)
# Found no matching target, return the same input test matrix
return test_matrix

View File

@ -10,13 +10,13 @@ architectures:
* Latest ROCM
"""
import os
from typing import Dict, List, Optional, Tuple
CUDA_ARCHES = ["11.8", "12.1"]
ROCM_ARCHES = ["5.5", "5.6"]
ROCM_ARCHES = ["5.6", "5.7"]
CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]
@ -24,6 +24,81 @@ CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]
CPU_AARCH64_ARCH = ["cpu-aarch64"]
PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"11.8": (
"nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | " # noqa: B950
"nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu11==2.19.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.1": (
"nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | " # noqa: B950
"nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.19.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
}
def get_nccl_submodule_version() -> str:
from pathlib import Path
nccl_version_mk = (
Path(__file__).absolute().parent.parent.parent
/ "third_party"
/ "nccl"
/ "nccl"
/ "makefiles"
/ "version.mk"
)
if not nccl_version_mk.exists():
raise RuntimeError(
"Please make sure that nccl submodule is checked out when importing this script"
)
with nccl_version_mk.open("r") as f:
content = f.read()
d = {}
for l in content.split("\n"):
if not l.startswith("NCCL_"):
continue
(k, v) = l.split(":=")
d[k.strip()] = v.strip()
return f"{d['NCCL_MAJOR']}.{d['NCCL_MINOR']}.{d['NCCL_PATCH']}"
def get_nccl_wheel_version(arch_version: str) -> str:
import re
requirements = map(
str.strip, re.split("[;|]", PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version])
)
return next(x for x in requirements if x.startswith("nvidia-nccl-cu")).split("==")[
1
]
def validate_nccl_dep_consistency(arch_version: str) -> None:
wheel_ver = get_nccl_wheel_version(arch_version)
submodule_ver = get_nccl_submodule_version()
if wheel_ver != submodule_ver:
raise RuntimeError(
f"NCCL submodule version {submodule_ver} differs from wheel version {wheel_ver}"
)
def arch_type(arch_version: str) -> str:
if arch_version in CUDA_ARCHES:
@ -38,23 +113,29 @@ def arch_type(arch_version: str) -> str:
return "cpu"
# This can be updated to the release version when cutting release branch, i.e. 2.1
DEFAULT_TAG = os.getenv("RELEASE_VERSION_TAG", "main")
WHEEL_CONTAINER_IMAGES = {
**{
gpu_arch: f"pytorch/manylinux-builder:cuda{gpu_arch}"
gpu_arch: f"pytorch/manylinux-builder:cuda{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in CUDA_ARCHES
},
**{
gpu_arch: f"pytorch/manylinux-builder:rocm{gpu_arch}"
gpu_arch: f"pytorch/manylinux-builder:rocm{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in ROCM_ARCHES
},
"cpu": "pytorch/manylinux-builder:cpu",
"cpu-cxx11-abi": "pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi",
"cpu-aarch64": "pytorch/manylinuxaarch64-builder:cpu-aarch64",
"cpu": f"pytorch/manylinux-builder:cpu-{DEFAULT_TAG}",
"cpu-cxx11-abi": f"pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-{DEFAULT_TAG}",
"cpu-aarch64": f"pytorch/manylinuxaarch64-builder:cpu-aarch64-{DEFAULT_TAG}",
}
CONDA_CONTAINER_IMAGES = {
**{gpu_arch: f"pytorch/conda-builder:cuda{gpu_arch}" for gpu_arch in CUDA_ARCHES},
"cpu": "pytorch/conda-builder:cpu",
**{
gpu_arch: f"pytorch/conda-builder:cuda{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in CUDA_ARCHES
},
"cpu": f"pytorch/conda-builder:cpu-{DEFAULT_TAG}",
}
PRE_CXX11_ABI = "pre-cxx11"
@ -64,26 +145,38 @@ DEBUG = "debug"
LIBTORCH_CONTAINER_IMAGES: Dict[Tuple[str, str], str] = {
**{
(gpu_arch, PRE_CXX11_ABI): f"pytorch/manylinux-builder:cuda{gpu_arch}"
(
gpu_arch,
PRE_CXX11_ABI,
): f"pytorch/manylinux-builder:cuda{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in CUDA_ARCHES
},
**{
(gpu_arch, CXX11_ABI): f"pytorch/libtorch-cxx11-builder:cuda{gpu_arch}"
(
gpu_arch,
CXX11_ABI,
): f"pytorch/libtorch-cxx11-builder:cuda{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in CUDA_ARCHES
},
**{
(gpu_arch, PRE_CXX11_ABI): f"pytorch/manylinux-builder:rocm{gpu_arch}"
(
gpu_arch,
PRE_CXX11_ABI,
): f"pytorch/manylinux-builder:rocm{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in ROCM_ARCHES
},
**{
(gpu_arch, CXX11_ABI): f"pytorch/libtorch-cxx11-builder:rocm{gpu_arch}"
(
gpu_arch,
CXX11_ABI,
): f"pytorch/libtorch-cxx11-builder:rocm{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in ROCM_ARCHES
},
("cpu", PRE_CXX11_ABI): "pytorch/manylinux-builder:cpu",
("cpu", CXX11_ABI): "pytorch/libtorch-cxx11-builder:cpu",
("cpu", PRE_CXX11_ABI): f"pytorch/manylinux-builder:cpu-{DEFAULT_TAG}",
("cpu", CXX11_ABI): f"pytorch/libtorch-cxx11-builder:cpu-{DEFAULT_TAG}",
}
FULL_PYTHON_VERSIONS = ["3.8", "3.9", "3.10", "3.11"]
FULL_PYTHON_VERSIONS = ["3.8", "3.9", "3.10", "3.11", "3.12"]
def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:
@ -190,7 +283,6 @@ def generate_wheels_matrix(
os: str,
arches: Optional[List[str]] = None,
python_versions: Optional[List[str]] = None,
gen_special_an_non_special_wheel: bool = True,
) -> List[Dict[str, str]]:
package_type = "wheel"
if os == "linux" or os == "linux-aarch64":
@ -224,9 +316,8 @@ def generate_wheels_matrix(
else arch_version
)
# special 12.1 wheels package without dependencies
# dependency downloaded via pip install
if arch_version == "12.1" and os == "linux":
# 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if arch_version in ["12.1", "11.8"] and os == "linux":
ret.append(
{
"python_version": python_version,
@ -238,41 +329,36 @@ def generate_wheels_matrix(
"devtoolset": "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": "nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | " # noqa: B950
"nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.18.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'",
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-with-pypi-cudnn".replace( # noqa: B950
"pytorch_extra_install_requirements": PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version], # fmt: skip
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}".replace( # noqa: B950
".", "_"
),
}
)
if not gen_special_an_non_special_wheel:
continue
ret.append(
{
"python_version": python_version,
"gpu_arch_type": gpu_arch_type,
"gpu_arch_version": gpu_arch_version,
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"devtoolset": "cxx11-abi"
if arch_version == "cpu-cxx11-abi"
else "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}".replace(
".", "_"
),
}
)
else:
ret.append(
{
"python_version": python_version,
"gpu_arch_type": gpu_arch_type,
"gpu_arch_version": gpu_arch_version,
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"devtoolset": "cxx11-abi"
if arch_version == "cpu-cxx11-abi"
else "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}".replace(
".", "_"
),
"pytorch_extra_install_requirements":
PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"] # fmt: skip
if os != "linux" else "",
}
)
return ret
validate_nccl_dep_consistency("12.1")
validate_nccl_dep_consistency("11.8")

View File

@ -60,7 +60,7 @@ class BinaryBuildWorkflow:
branches: str = "nightly"
# Mainly for macos
cross_compile_arm64: bool = False
xcode_version: str = ""
macos_runner: str = "macos-12-xl"
def __post_init__(self) -> None:
if self.abi_version:
@ -125,7 +125,9 @@ LINUX_BINARY_BUILD_WORFKLOWS = [
package_type="libtorch",
abi_version=generate_binary_build_matrix.CXX11_ABI,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.LINUX, generate_binary_build_matrix.CXX11_ABI
OperatingSystem.LINUX,
generate_binary_build_matrix.CXX11_ABI,
libtorch_variants=["shared-with-deps"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
@ -137,7 +139,9 @@ LINUX_BINARY_BUILD_WORFKLOWS = [
package_type="libtorch",
abi_version=generate_binary_build_matrix.PRE_CXX11_ABI,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.LINUX, generate_binary_build_matrix.PRE_CXX11_ABI
OperatingSystem.LINUX,
generate_binary_build_matrix.PRE_CXX11_ABI,
libtorch_variants=["shared-with-deps"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
@ -154,7 +158,6 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [
OperatingSystem.LINUX,
arches=["11.8", "12.1"],
python_versions=["3.8"],
gen_special_an_non_special_wheel=False,
),
branches="main",
),
@ -212,7 +215,9 @@ WINDOWS_BINARY_BUILD_WORKFLOWS = [
package_type="libtorch",
abi_version=generate_binary_build_matrix.RELEASE,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.WINDOWS, generate_binary_build_matrix.RELEASE
OperatingSystem.WINDOWS,
generate_binary_build_matrix.RELEASE,
libtorch_variants=["shared-with-deps"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
@ -224,7 +229,9 @@ WINDOWS_BINARY_BUILD_WORKFLOWS = [
package_type="libtorch",
abi_version=generate_binary_build_matrix.DEBUG,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.WINDOWS, generate_binary_build_matrix.DEBUG
OperatingSystem.WINDOWS,
generate_binary_build_matrix.DEBUG,
libtorch_variants=["shared-with-deps"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
@ -294,20 +301,39 @@ MACOS_BINARY_BUILD_WORKFLOWS = [
package_type="libtorch",
abi_version=generate_binary_build_matrix.CXX11_ABI,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.MACOS, generate_binary_build_matrix.CXX11_ABI
OperatingSystem.MACOS,
generate_binary_build_matrix.CXX11_ABI,
libtorch_variants=["shared-with-deps"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.MACOS_ARM64,
package_type="libtorch",
abi_version=generate_binary_build_matrix.CXX11_ABI,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.MACOS,
generate_binary_build_matrix.CXX11_ABI,
libtorch_variants=["shared-with-deps"],
),
cross_compile_arm64=False,
macos_runner="macos-13-xlarge",
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.MACOS_ARM64,
package_type="wheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.MACOS_ARM64
),
cross_compile_arm64=True,
cross_compile_arm64=False,
macos_runner="macos-13-xlarge",
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
isolated_workflow=True,

View File

@ -111,7 +111,7 @@ def fetch_jobs(url: str, headers: Dict[str, str]) -> List[Dict[str, str]]:
# running.
def find_job_id(args: Any) -> str:
def find_job_id_name(args: Any) -> Tuple[str, str]:
# From https://docs.github.com/en/actions/learn-github-actions/environment-variables
PYTORCH_REPO = os.environ.get("GITHUB_REPOSITORY", "pytorch/pytorch")
PYTORCH_GITHUB_API = f"https://api.github.com/repos/{PYTORCH_REPO}"
@ -130,15 +130,28 @@ def find_job_id(args: Any) -> str:
for job in jobs:
if job["runner_name"] == args.runner_name:
return job["id"]
return (job["id"], job["name"])
raise RuntimeError(f"Can't find job id for runner {args.runner_name}")
def set_output(name: str, val: Any) -> None:
if os.getenv("GITHUB_OUTPUT"):
with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
print(f"{name}={val}", file=env)
print(f"setting {name}={val}")
else:
print(f"::set-output name={name}::{val}")
def main() -> None:
args = parse_args()
try:
print(find_job_id(args))
# Get both the job ID and job name because we have already spent a request
# here to get the job info
job_id, job_name = find_job_id_name(args)
set_output("job-id", job_id)
set_output("job-name", job_name)
except Exception as e:
print(repr(e), file=sys.stderr)
print(f"workflow-{args.workflow_run_id}")

View File

@ -5,12 +5,15 @@ import os
import warnings
from dataclasses import dataclass
from typing import Any, Callable, cast, Dict, List, Optional, Tuple
from typing import Any, Callable, cast, Dict, List, Optional, Tuple, Union
from urllib.error import HTTPError
from urllib.parse import quote
from urllib.request import Request, urlopen
GITHUB_API_URL = "https://api.github.com"
@dataclass
class GitHubComment:
body_text: str
@ -26,16 +29,20 @@ def gh_fetch_url_and_headers(
url: str,
*,
headers: Optional[Dict[str, str]] = None,
data: Optional[Dict[str, Any]] = None,
data: Union[Optional[Dict[str, Any]], str] = None,
method: Optional[str] = None,
reader: Callable[[Any], Any] = lambda x: x.read(),
) -> Tuple[Any, Any]:
if headers is None:
headers = {}
token = os.environ.get("GITHUB_TOKEN")
if token is not None and url.startswith("https://api.github.com/"):
if token is not None and url.startswith(f"{GITHUB_API_URL}/"):
headers["Authorization"] = f"token {token}"
data_ = json.dumps(data).encode() if data is not None else None
data_ = None
if data is not None:
data_ = data.encode() if isinstance(data, str) else json.dumps(data).encode()
try:
with urlopen(Request(url, headers=headers, data=data_, method=method)) as conn:
return conn.headers, reader(conn)
@ -57,7 +64,7 @@ def gh_fetch_url(
url: str,
*,
headers: Optional[Dict[str, str]] = None,
data: Optional[Dict[str, Any]] = None,
data: Union[Optional[Dict[str, Any]], str] = None,
method: Optional[str] = None,
reader: Callable[[Any], Any] = lambda x: x.read(),
) -> Any:
@ -125,7 +132,7 @@ def gh_post_pr_comment(
org: str, repo: str, pr_num: int, comment: str, dry_run: bool = False
) -> List[Dict[str, Any]]:
return _gh_post_comment(
f"https://api.github.com/repos/{org}/{repo}/issues/{pr_num}/comments",
f"{GITHUB_API_URL}/repos/{org}/{repo}/issues/{pr_num}/comments",
comment,
dry_run,
)
@ -135,14 +142,14 @@ def gh_post_commit_comment(
org: str, repo: str, sha: str, comment: str, dry_run: bool = False
) -> List[Dict[str, Any]]:
return _gh_post_comment(
f"https://api.github.com/repos/{org}/{repo}/commits/{sha}/comments",
f"{GITHUB_API_URL}/repos/{org}/{repo}/commits/{sha}/comments",
comment,
dry_run,
)
def gh_delete_comment(org: str, repo: str, comment_id: int) -> None:
url = f"https://api.github.com/repos/{org}/{repo}/issues/comments/{comment_id}"
url = f"{GITHUB_API_URL}/repos/{org}/{repo}/issues/comments/{comment_id}"
gh_fetch_url(url, method="DELETE")
@ -153,7 +160,7 @@ def gh_fetch_merge_base(org: str, repo: str, base: str, head: str) -> str:
# https://docs.github.com/en/rest/commits/commits?apiVersion=2022-11-28#compare-two-commits
try:
json_data = gh_fetch_url(
f"https://api.github.com/repos/{org}/{repo}/compare/{base}...{head}",
f"{GITHUB_API_URL}/repos/{org}/{repo}/compare/{base}...{head}",
headers={"Accept": "application/vnd.github.v3+json"},
reader=json.load,
)
@ -167,3 +174,18 @@ def gh_fetch_merge_base(org: str, repo: str, base: str, head: str) -> str:
warnings.warn(f"Failed to get merge base for {base}...{head}: {error}")
return merge_base
def gh_update_pr_state(org: str, repo: str, pr_num: int, state: str = "open") -> None:
url = f"{GITHUB_API_URL}/repos/{org}/{repo}/pulls/{pr_num}"
try:
gh_fetch_url(url, method="PATCH", data={"state": state})
except HTTPError as err:
# When trying to open the pull request, error 422 means that the branch
# has been deleted and the API couldn't re-open it
if err.code == 422 and state == "open":
warnings.warn(
f"Failed to open {pr_num} because its head branch has been deleted: {err}"
)
else:
raise

105513
.github/scripts/gql_mocks.json generated vendored

File diff suppressed because one or more lines are too long

BIN
.github/scripts/gql_mocks.json.gz vendored Normal file

Binary file not shown.

View File

@ -38,6 +38,12 @@ def parse_args() -> argparse.Namespace:
required=True,
help="A unique job identifier that should be the same for all runs of job",
)
parser.add_argument(
"--sha", required="--upload" in sys.argv, help="SHA of the commit"
) # Only required for upload
parser.add_argument(
"--test_config", required="--upload" in sys.argv, help="The test config"
) # Only required for upload
parser.add_argument(
"--shard", required="--upload" in sys.argv, help="The shard id"
) # Only required for upload
@ -84,6 +90,8 @@ def main() -> None:
pr_identifier=pr_identifier,
repo=repo,
job_identifier=args.job_identifier,
sha=args.sha,
test_config=args.test_config,
shard=args.shard,
cache_dir=cache_dir,
bucket=args.bucket,

View File

@ -56,6 +56,8 @@ def upload_pytest_cache(
pr_identifier: PRIdentifier,
repo: GithubRepo,
job_identifier: str,
sha: str,
test_config: str,
shard: str,
cache_dir: Path,
temp_dir: Path,
@ -79,25 +81,11 @@ def upload_pytest_cache(
if not bucket:
bucket = BUCKET
# Merge the current cache with any caches from previous runs before uploading
# We only need to merge it with the cache for the same shard (which will have already been downloaded if it exists)
# since the other shards will handle themselves
shard_cache_path = _get_temp_cache_dir_path(
temp_dir, pr_identifier, repo, job_identifier, shard
)
if shard_cache_path.is_dir():
_merge_pytest_caches(shard_cache_path, cache_dir)
#
# Upload the cache
#
obj_key_prefix = _get_s3_key_prefix(pr_identifier, repo, job_identifier, shard)
# This doesn't include the zip file extension. That'll get added later
zip_file_path = temp_dir / ZIP_UPLOAD / obj_key_prefix
zip_file_path = zip_folder(cache_dir, zip_file_path)
obj_key_prefix = _get_s3_key_prefix(
pr_identifier, repo, job_identifier, sha, test_config, shard
)
zip_file_path = zip_folder(cache_dir, temp_dir / ZIP_UPLOAD / obj_key_prefix)
obj_key = f"{obj_key_prefix}{os.path.splitext(zip_file_path)[1]}" # Keep the new file extension
upload_file_to_s3(zip_file_path, bucket, obj_key)
@ -136,38 +124,22 @@ def download_pytest_cache(
)
for downloaded_zip in downloads:
# the file name of the zip is the shard id
shard = os.path.splitext(os.path.basename(downloaded_zip))[0]
cache_dir_for_shard = _get_temp_cache_dir_path(
temp_dir, pr_identifier, repo, job_identifier, shard
# Unzip into random folder, then merge with the current cache
cache_dir_for_shard = (
temp_dir / UNZIPPED_CACHES / os.urandom(16).hex() / PYTEST_CACHE_DIR_NAME
)
unzip_folder(downloaded_zip, cache_dir_for_shard)
print(
f"Merging cache for job_identifier `{job_identifier}`, shard `{shard}` into `{dest_cache_dir}`"
)
print(f"Merging cache from {downloaded_zip}")
_merge_pytest_caches(cache_dir_for_shard, dest_cache_dir)
def _get_temp_cache_dir_path(
temp_dir: Path,
pr_identifier: PRIdentifier,
repo: GithubRepo,
job_identifier: str,
shard: str,
) -> Path:
return (
temp_dir
/ UNZIPPED_CACHES
/ _get_s3_key_prefix(pr_identifier, repo, job_identifier, shard)
/ PYTEST_CACHE_DIR_NAME
)
def _get_s3_key_prefix(
pr_identifier: PRIdentifier,
repo: GithubRepo,
job_identifier: str,
sha: str = "",
test_config: str = "",
shard: str = "",
) -> str:
"""
@ -176,6 +148,10 @@ def _get_s3_key_prefix(
"""
prefix = f"{PYTEST_CACHE_KEY_PREFIX}/{repo.owner}/{repo.name}/{pr_identifier}/{sanitize_for_s3(job_identifier)}"
if sha:
prefix += f"/{sha}"
if test_config:
prefix += f"/{sanitize_for_s3(test_config)}"
if shard:
prefix += f"/{shard}"

File diff suppressed because it is too large Load Diff

BIN
.github/scripts/rockset_mocks.json.gz vendored Normal file

Binary file not shown.

View File

@ -0,0 +1,64 @@
import argparse
import subprocess
from typing import Dict
import generate_binary_build_matrix
def tag_image(
image: str,
default_tag: str,
release_version: str,
dry_run: str,
tagged_images: Dict[str, bool],
) -> None:
if image in tagged_images:
return
release_image = image.replace(f"-{default_tag}", f"-{release_version}")
print(f"Tagging {image} to {release_image} , dry_run: {dry_run}")
if dry_run == "disabled":
subprocess.check_call(["docker", "pull", image])
subprocess.check_call(["docker", "tag", image, release_image])
subprocess.check_call(["docker", "push", release_image])
tagged_images[image] = True
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument(
"--version",
help="Version to tag",
type=str,
default="2.2",
)
parser.add_argument(
"--dry-run",
help="No Runtime Error check",
type=str,
choices=["enabled", "disabled"],
default="enabled",
)
options = parser.parse_args()
tagged_images: Dict[str, bool] = dict()
platform_images = [
generate_binary_build_matrix.WHEEL_CONTAINER_IMAGES,
generate_binary_build_matrix.LIBTORCH_CONTAINER_IMAGES,
generate_binary_build_matrix.CONDA_CONTAINER_IMAGES,
]
default_tag = generate_binary_build_matrix.DEFAULT_TAG
for platform_image in platform_images: # type: ignore[attr-defined]
for arch in platform_image.keys(): # type: ignore[attr-defined]
tag_image(
platform_image[arch], # type: ignore[index]
default_tag,
options.version,
options.dry_run,
tagged_images,
)
if __name__ == "__main__":
main()

View File

@ -102,6 +102,30 @@ MOCKED_DISABLED_UNSTABLE_JOBS = {
"manywheel-py3_8-cuda11_8-build",
"",
],
"inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor)": [
"pytorchbot",
"107079",
"https://github.com/pytorch/pytorch/issues/107079",
"inductor",
"cuda12.1-py3.10-gcc9-sm86",
"test (inductor)",
],
"inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_huggingface)": [
"pytorchbot",
"109153",
"https://github.com/pytorch/pytorch/issues/109153",
"inductor",
"cuda12.1-py3.10-gcc9-sm86",
"test (inductor_huggingface)",
],
"inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_huggingface_dynamic)": [
"pytorchbot",
"109154",
"https://github.com/pytorch/pytorch/issues/109154",
"inductor",
"cuda12.1-py3.10-gcc9-sm86",
"test (inductor_huggingface_dynamic)",
],
}
MOCKED_PR_INFO = {
@ -569,6 +593,37 @@ class TestConfigFilter(TestCase):
"expected": '{"include": [{"config": "default", "unstable": "unstable"}]}',
"description": "Both binary build and test jobs are unstable",
},
{
"workflow": "inductor",
"job_name": "cuda12.1-py3.10-gcc9-sm86 / build",
"test_matrix": """
{ include: [
{ config: "inductor" },
{ config: "inductor_huggingface", shard: 1 },
{ config: "inductor_huggingface", shard: 2 },
{ config: "inductor_timm", shard: 1 },
{ config: "inductor_timm", shard: 2 },
{ config: "inductor_torchbench" },
{ config: "inductor_huggingface_dynamic" },
{ config: "inductor_torchbench_dynamic" },
{ config: "inductor_distributed" },
]}
""",
"expected": """
{ "include": [
{ "config": "inductor", "unstable": "unstable" },
{ "config": "inductor_huggingface", "shard": 1, "unstable": "unstable" },
{ "config": "inductor_huggingface", "shard": 2, "unstable": "unstable" },
{ "config": "inductor_timm", "shard": 1 },
{ "config": "inductor_timm", "shard": 2 },
{ "config": "inductor_torchbench" },
{ "config": "inductor_huggingface_dynamic", "unstable": "unstable" },
{ "config": "inductor_torchbench_dynamic" },
{ "config": "inductor_distributed" }
]}
""",
"description": "Marking multiple unstable configurations",
},
]
for case in testcases:
@ -577,7 +632,7 @@ class TestConfigFilter(TestCase):
test_matrix = yaml.safe_load(case["test_matrix"])
filtered_test_matrix = mark_unstable_jobs(workflow, job_name, test_matrix)
self.assertEqual(case["expected"], json.dumps(filtered_test_matrix))
self.assertEqual(json.loads(case["expected"]), filtered_test_matrix)
@mock.patch("subprocess.check_output")
def test_perform_misc_tasks(self, mocked_subprocess: Any) -> None:

View File

@ -7,11 +7,12 @@
# GraphQL queries in trymerge.py, please make sure to delete `gql_mocks.json`
# And re-run the test locally with ones PAT
import gzip
import json
import os
import warnings
from hashlib import sha256
from typing import Any, cast, Dict, List, Optional
from typing import Any, Dict, List, Optional
from unittest import main, mock, skip, TestCase
from urllib.error import HTTPError
@ -19,18 +20,20 @@ from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
from trymerge import (
categorize_checks,
DRCI_CHECKRUN_NAME,
find_matching_merge_rule,
FlakyRule,
get_classifications,
get_drci_classifications,
get_rockset_results,
gh_get_team_members,
gh_graphql,
GitHubPR,
is_broken_trunk,
JobCheckState,
main as trymerge_main,
MandatoryChecksMissingError,
MergeRule,
PostCommentError,
RE_GHSTACK_DESC,
read_merge_rules,
remove_job_name_suffix,
validate_revert,
@ -39,6 +42,10 @@ from trymerge import (
if "GIT_REMOTE_URL" not in os.environ:
os.environ["GIT_REMOTE_URL"] = "https://github.com/pytorch/pytorch"
GQL_MOCKS = "gql_mocks.json.gz"
ROCKSET_MOCKS = "rockset_mocks.json.gz"
DRCI_MOCKS = "drci_mocks.json.gz"
def mock_query(
fallback_function: Any,
@ -51,11 +58,11 @@ def mock_query(
def get_mocked_queries() -> Any:
if not os.path.exists(gql_db_fname):
return {}
with open(gql_db_fname, encoding="utf-8") as f:
with gzip.open(gql_db_fname, encoding="utf-8", mode="rt") as f:
return json.load(f)
def save_mocked_queries(obj: Any) -> None:
with open(gql_db_fname, encoding="utf-8", mode="w") as f:
with gzip.open(gql_db_fname, encoding="utf-8", mode="wt") as f:
json.dump(obj, f, indent=2)
f.write("\n")
@ -68,19 +75,20 @@ def mock_query(
try:
rc = fallback_function(*args)
except HTTPError as err:
if err.code == 401:
if err.code == 401 or err.code == 403:
err_msg = f"If you are seeing this message during workflow run, please make sure to update {file_name}"
err_msg += f" locally, by deleting it and running {os.path.basename(__file__)} with "
err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN environment variable"
err_msg += (
" the rockset api key passed via ROCKSET_API_KEY environment variable"
)
err_msg += f" locally, by deleting it and running {os.path.basename(__file__)} with"
err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN,"
err_msg += " the rockset api key passed via ROCKSET_API_KEY,"
err_msg += " and drci api key passed via DRCI_BOT_KEY environment variables"
if (
os.getenv("GITHUB_TOKEN") is None
or os.getenv("ROCKSET_API_KEY") is None
or os.getenv("DRCI_BOT_KEY") is None
):
err_msg = (
"Failed to update cached GraphQL queries as GITHUB_TOKEN or ROCKSET_API_KEY is not defined."
"Failed to update cached queries as GITHUB_TOKEN or ROCKSET_API_KEY or DRCI_BOT_KEY "
+ "is not defined. "
+ err_msg
)
raise RuntimeError(err_msg) from err
@ -100,19 +108,29 @@ def mocked_gh_graphql(query: str, **kwargs: Any) -> Any:
def gh_graphql_wrapper(query: str, kwargs: Any) -> Any:
return gh_graphql(query, **kwargs)
return mock_query(gh_graphql_wrapper, "gql_mocks.json", key_function, query, kwargs)
return mock_query(gh_graphql_wrapper, GQL_MOCKS, key_function, query, kwargs)
def mocked_rockset_results(head_sha: str, merge_base: str, num_retries: int = 3) -> Any:
return mock_query(
get_rockset_results,
"rockset_mocks.json",
ROCKSET_MOCKS,
lambda x, y: f"{x} {y}",
head_sha,
merge_base,
)
def mocked_drci_classifications(pr_num: int, project: str, num_retries: int = 3) -> Any:
return mock_query(
get_drci_classifications,
DRCI_MOCKS,
lambda x, y: f"{x} {y}",
pr_num,
project,
)
def mock_parse_args(revert: bool = False, force: bool = False) -> Any:
class Object:
def __init__(self) -> None:
@ -189,6 +207,18 @@ def mocked_read_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule
],
ignore_flaky_failures=True,
),
MergeRule(
name="xla",
patterns=[".github/ci_commit_pins/xla.txt"],
approved_by=["pytorchbot"],
mandatory_checks_name=[
"Lint",
"EasyCLA",
"pull / linux-focal-py3_8-clang9-xla / build",
"pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge)",
],
ignore_flaky_failures=True,
),
]
@ -196,16 +226,6 @@ def mocked_read_merge_rules_raise(repo: Any, org: str, project: str) -> List[Mer
raise RuntimeError("testing")
def empty_flaky_rules() -> List[FlakyRule]:
return []
def xla_is_flaky_rules() -> List[FlakyRule]:
return [
FlakyRule("xla", ["FAILED: Build did NOT complete successfully"]),
]
def xla_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule]:
return [
MergeRule(
@ -217,6 +237,7 @@ def xla_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule]:
"EasyCLA",
"pull / linux-bionic-py3_8-clang8-xla / build",
"pull / linux-bionic-py3_8-clang8-xla / test (xla, 1, 1, linux.4xlarge)",
"inductor / cuda11.8-py3.10-gcc7-sm86 / test (inductor_torchbench_dynamic, 1, 1, linux.g5.4xlarge.nvidia.gpu)",
],
ignore_flaky_failures=False,
),
@ -238,9 +259,11 @@ class DummyGitRepo(GitRepo):
return "super awsome commit message"
@mock.patch("trymerge.read_flaky_rules", side_effect=empty_flaky_rules)
@mock.patch("trymerge.get_rockset_results", side_effect=empty_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch(
"trymerge.get_drci_classifications", side_effect=mocked_drci_classifications
)
class TestTryMerge(TestCase):
def test_merge_rules_valid(self, *args: Any) -> None:
"Test that merge_rules.yaml can be parsed"
@ -251,7 +274,7 @@ class TestTryMerge(TestCase):
@mock.patch("trymerge.read_merge_rules", side_effect=mocked_read_merge_rules)
def test_match_rules(self, *args: Any) -> None:
"Tests that PR passes merge rules"
pr = GitHubPR("pytorch", "pytorch", 77700)
pr = GitHubPR("pytorch", "pytorch", 109999)
repo = DummyGitRepo()
self.assertTrue(find_matching_merge_rule(pr, repo) is not None)
@ -304,14 +327,9 @@ class TestTryMerge(TestCase):
def test_internal_changes(self, *args: Any) -> None:
"Tests that PR with internal changes is detected"
pr = GitHubPR("pytorch", "pytorch", 73969)
pr = GitHubPR("pytorch", "pytorch", 110140)
self.assertTrue(pr.has_internal_changes())
def test_checksuites_pagination(self, *args: Any) -> None:
"Tests that PR with lots of checksuits can be fetched"
pr = GitHubPR("pytorch", "pytorch", 73811)
self.assertEqual(len(pr.get_checkrun_conclusions()), 76)
def test_comments_pagination(self, *args: Any) -> None:
"Tests that PR with 50+ comments can be fetched"
pr = GitHubPR("pytorch", "pytorch", 31093)
@ -323,7 +341,9 @@ class TestTryMerge(TestCase):
# see https://gist.github.com/malfet/9b93bc7eeddeaf1d84546efc4f0c577f
pr = GitHubPR("pytorch", "pytorch", 68111)
self.assertGreater(len(pr.get_comments()), 20)
self.assertGreater(len(pr.get_checkrun_conclusions()), 3)
# NS(09/27/2023): GitHub seems to recycle older checkruns
# https://github.com/pytorch/pytorch/pull/68111/checks shows 0 runs
# self.assertGreater(len(pr.get_checkrun_conclusions()), 3)
self.assertGreater(pr.get_commit_count(), 60)
def test_gql_retrieve_checksuites(self, *args: Any) -> None:
@ -368,14 +388,16 @@ class TestTryMerge(TestCase):
def test_get_checkruns_many_runs(self, *args: Any) -> None:
"""Tests that all checkruns can be fetched"""
pr = GitHubPR("pytorch", "pytorch", 77700)
pr = GitHubPR("pytorch", "pytorch", 105260)
conclusions = pr.get_checkrun_conclusions()
self.assertEqual(len(conclusions), 79)
self.assertTrue("pull / linux-docs / build-docs (cpp)" in conclusions.keys())
self.assertEqual(len(conclusions), 221)
self.assertTrue(
"pull / linux-docs / build-docs-cpp-false" in conclusions.keys()
)
def test_cancelled_gets_ignored(self, *args: Any) -> None:
"""Tests that cancelled workflow does not override existing successfull status"""
pr = GitHubPR("pytorch", "pytorch", 82169)
pr = GitHubPR("pytorch", "pytorch", 110367)
conclusions = pr.get_checkrun_conclusions()
lint_checks = [name for name in conclusions.keys() if "Lint" in name]
self.assertTrue(len(lint_checks) > 0)
@ -523,108 +545,7 @@ class TestTryMerge(TestCase):
for case in test_cases:
self.assertEqual(case["expected"], remove_job_name_suffix(case["name"]))
def test_is_broken_trunk(self, *args: Any) -> None:
test_cases: List[Dict[str, Any]] = [
{
"head_job": None,
"base_jobs": {
"job_a": {
"conclusion": "success",
"failure_captures": ["a", "b"],
},
"job_b": {
"conclusion": "failure",
"failure_captures": ["a", "b"],
},
},
"expected": False,
"description": "Invalid input - head job",
},
{
"head_job": {
"conclusion": "failure",
"failure_captures": ["a", "b"],
},
"base_jobs": None,
"expected": False,
"description": "Invalid input - base jobs",
},
{
"head_job": {
"conclusion": "failure",
"failure_captures": ["a", "b"],
},
"base_jobs": {},
"expected": False,
"description": "Invalid input - empty base jobs",
},
{
"head_job": {
"conclusion": "failure",
"failure_captures": ["x", "y"],
},
"base_jobs": {
"job_a": {
"conclusion": "success",
"failure_captures": ["a", "b"],
},
"job_b": {
"conclusion": "failure",
"failure_captures": ["x", "y"],
},
},
"expected": True,
"description": "Found a match",
},
{
"head_job": {
"conclusion": "success",
"failure_captures": ["x", "y"],
},
"base_jobs": {
"job_a": {
"conclusion": "success",
"failure_captures": ["a", "b"],
},
"job_b": {
"conclusion": "failure",
"failure_captures": ["x", "y"],
},
},
"expected": False,
"description": "Not found - different conclusion",
},
{
"head_job": {
"conclusion": "failure",
"failure_captures": ["a", "b"],
},
"base_jobs": {
"job_a": {
"conclusion": "success",
"failure_captures": ["a", "b"],
},
"job_b": {
"conclusion": "failure",
"failure_captures": ["x", "y"],
},
},
"expected": False,
"description": "Not found - different captured failures",
},
]
for case in test_cases:
self.assertEqual(
case["expected"], is_broken_trunk(case["head_job"], case["base_jobs"])
)
def test_get_merge_base(
self,
mock_gh_graphql: Any,
mock_get_rockset_results: Any,
mock_read_flaky_rules: Any,
) -> None:
def test_get_merge_base(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 104121)
mock_merge_base = "mocked-sha"
@ -642,57 +563,130 @@ class TestTryMerge(TestCase):
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch(
"trymerge.get_drci_classifications", side_effect=mocked_drci_classifications
)
class TestBypassFailures(TestCase):
def test_get_classifications(self, *args: Any) -> None:
flaky_rules = [
# Try a regex rule
FlakyRule("distributed", ["##\\[error\\]The operation [wW]as .+"])
]
pr = GitHubPR("pytorch", "pytorch", 92863)
pr = GitHubPR("pytorch", "pytorch", 109584)
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
checks, pr.last_commit()["oid"], pr.get_merge_base(), flaky_rules, []
pr.pr_num,
pr.project,
checks,
[],
)
self.assertTrue(
checks[
"pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.4xlarge)"
"pull / linux-focal-py3.11-clang10 / test (dynamo, 1, 2, linux.2xlarge)"
].classification
== "BROKEN_TRUNK"
)
self.assertTrue(
checks[
"pull / linux-focal-py3.7-gcc7 / test (distributed, 1, 2, linux.2xlarge)"
"trunk / win-vs2019-cpu-py3 / test (default, 2, 3, windows.4xlarge.nonephemeral)"
].classification
== "FLAKY"
)
self.assertTrue(
checks[
"pull / linux-jammy-py3.8-gcc11 / test (distributed, 1, 2, linux.2xlarge)"
].classification
== "FLAKY"
)
self.assertTrue(
checks[
"pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu)"
].classification
== "FLAKY"
)
# Set the threshold larger or equal to the number of ok failures
pending, failed, ignorable = categorize_checks(
checks, list(checks.keys()), ok_failed_checks_threshold=2
checks, list(checks.keys()), ok_failed_checks_threshold=6
)
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["FLAKY"]) == 1)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 1)
self.assertTrue(len(ignorable["FLAKY"]) == 4)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 2)
# Not set any threshold, defaults to -1 to ignore all flaky and broken trunk failures
pending, failed, ignorable = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["FLAKY"]) == 1)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 1)
self.assertTrue(len(ignorable["FLAKY"]) == 4)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 2)
# Set the threshold lower than the number of ok failures
pending, failed, ignorable = categorize_checks(
checks, list(checks.keys()), ok_failed_checks_threshold=1
)
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 2)
self.assertTrue(len(failed) == 6)
self.assertTrue(len(ignorable["FLAKY"]) == 4)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 2)
# Set the threshold to 0 like when ignore_flaky_failures is on
pending, failed, ignorable = categorize_checks(
checks, list(checks.keys()), ok_failed_checks_threshold=1
)
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 6)
self.assertTrue(len(ignorable["FLAKY"]) == 4)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 2)
def test_get_classifications_flaky_fullname(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 110362)
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
pending, failed, ignorable = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["FLAKY"]) == 1)
def test_get_classifications_invalid_cancel(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 110367)
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
pending, failed, ignorable = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["FLAKY"]) == 0)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 0)
self.assertTrue(len(ignorable["UNSTABLE"]) == 3)
def test_get_classifications_similar_failures(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 109750)
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
pending, failed, ignorable = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["FLAKY"]) == 1)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 1)
def test_get_classifications_unstable(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 104312)
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
checks, pr.last_commit()["oid"], pr.get_merge_base(), [], []
pr.pr_num,
pr.project,
checks,
[],
)
workflow_name = "linux-bionic-cuda12.1-py3.10-gcc9-bazel-test"
job_name = "build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu, unstable)"
@ -706,19 +700,6 @@ class TestBypassFailures(TestCase):
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["UNSTABLE"]) == 1)
def test_get_classifications_pending_unstable(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 105998)
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
checks, pr.last_commit()["oid"], pr.get_merge_base(), [], []
)
pending, failed, ignorable = categorize_checks(
checks, list(checks.keys()), ok_failed_checks_threshold=1
)
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 3)
self.assertTrue(len(ignorable["UNSTABLE"]) == 3)
def test_get_classifications_broken_trunk(self, *args: Any) -> None:
# The mock merge base is the actual value returned by gh_fetch_merge_base
test_cases = [
@ -726,13 +707,13 @@ class TestBypassFailures(TestCase):
# This PR had one broken trunk failure but it was run on a different shard
# than the one on the base commit. This should still count as broken trunk
"pr_num": 104214,
"mock_merge_base": "436d035dc74db9c703297a62163b0cad0c546665",
"related_failure_count": 0,
"unrelated_failure_count": 1,
},
{
# This PR had one broken trunk failure and it used ghstack
"pr_num": 105145,
"mock_merge_base": "194fe1d12f9860734cc28ed21bdabda2fbb06336",
"related_failure_count": 0,
"unrelated_failure_count": 1,
},
{
@ -741,112 +722,81 @@ class TestBypassFailures(TestCase):
# keep the failure record from the merge base so that it can
# be used to detect broken trunk
"pr_num": 107160,
"mock_merge_base": "a5d841ef01e615e2a654fb12cf0cd08697d12ccf",
"related_failure_count": 0,
"unrelated_failure_count": 4,
},
{
# This PR used Dr.CI broken trunk classification
"pr_num": 111253,
"related_failure_count": 1,
"unrelated_failure_count": 2,
},
]
for case in test_cases:
pr_num = case["pr_num"]
mock_merge_base = case["mock_merge_base"]
related_failure_count = case["related_failure_count"]
unrelated_failure_count = case["unrelated_failure_count"]
pr = GitHubPR("pytorch", "pytorch", cast(int, pr_num))
with mock.patch(
"trymerge.gh_fetch_merge_base", return_value=mock_merge_base
) as mocked_gh_fetch_merge_base:
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
checks, pr.last_commit()["oid"], pr.get_merge_base(), [], []
)
pr = GitHubPR("pytorch", "pytorch", pr_num)
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
pending, failed, _ = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 0)
pending, failed, _ = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == related_failure_count)
# When the ok_failed_checks_threshold is set to 0, the broken trunk failure
# won't be ignored
pending, failed, _ = categorize_checks(
checks, list(checks.keys()), ok_failed_checks_threshold=0
)
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == unrelated_failure_count)
# When the ok_failed_checks_threshold is set to 0, the broken trunk failure
# won't be ignored
pending, failed, _ = categorize_checks(
checks, list(checks.keys()), ok_failed_checks_threshold=0
)
self.assertTrue(len(pending) == 0)
self.assertTrue(
len(failed) == unrelated_failure_count + related_failure_count
)
def test_ignore_current(self, *args: Any) -> None:
# Test various interactions of the failure classifier to ensure that ignore
# current checks takes place after other classifications: flaky, unstable,
# or broken trunk. Only actual new failures should be kept in the list of
# ignore current checks to use to record force merge with actual failures
flaky_rules = [
FlakyRule("distributed", ["##\\[error\\]The operation was canceled."])
]
flaky = (
"pull / linux-focal-py3.7-gcc7 / test (distributed, 1, 2, linux.2xlarge)"
)
flaky = "pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu)"
broken_trunk = (
"pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.4xlarge)"
"pull / linux-focal-py3.11-clang10 / test (dynamo, 1, 2, linux.2xlarge)"
)
pr = GitHubPR("pytorch", "pytorch", 92863)
pr = GitHubPR("pytorch", "pytorch", 109584)
checks = pr.get_checkrun_conclusions()
# No broken trunk or flaky rules, then all failures are ignored when ic is used
checks = get_classifications(
checks, pr.last_commit()["oid"], None, [], [broken_trunk, flaky]
)
self.assertTrue(checks[flaky].classification == "IGNORE_CURRENT_CHECK")
self.assertTrue(checks[broken_trunk].classification == "IGNORE_CURRENT_CHECK")
_, failed, ignorable = categorize_checks(
checks, list(checks.keys()), ok_failed_checks_threshold=2
)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["IGNORE_CURRENT_CHECK"]) == 2)
self.assertTrue(len(ignorable["FLAKY"]) == 0)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 0)
# Known flaky failure takes precedence over ignore current (need to set the
# merge base here to get the results from Rockset, and that categorize the
# broken trunk failure too
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
pr.last_commit()["oid"],
pr.get_merge_base(),
flaky_rules,
[broken_trunk, flaky],
)
self.assertTrue(checks[flaky].classification == "FLAKY")
self.assertTrue(checks[broken_trunk].classification == "BROKEN_TRUNK")
_, failed, ignorable = categorize_checks(
checks, list(checks.keys()), ok_failed_checks_threshold=2
)
_, failed, ignorable = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["IGNORE_CURRENT_CHECK"]) == 0)
self.assertTrue(len(ignorable["FLAKY"]) == 1)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 1)
self.assertTrue(len(ignorable["FLAKY"]) == 4)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 2)
# Broken trunk takes precedence over ignore current (no flaky rule is set here)
checks = get_classifications(
checks,
pr.last_commit()["oid"],
pr.get_merge_base(),
[],
[broken_trunk, flaky],
)
self.assertTrue(checks[flaky].classification == "IGNORE_CURRENT_CHECK")
self.assertTrue(checks[broken_trunk].classification == "BROKEN_TRUNK")
_, failed, ignorable = categorize_checks(
checks, list(checks.keys()), ok_failed_checks_threshold=2
)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["IGNORE_CURRENT_CHECK"]) == 1)
self.assertTrue(len(ignorable["FLAKY"]) == 0)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 1)
@mock.patch("trymerge.read_flaky_rules", side_effect=xla_is_flaky_rules)
@mock.patch("trymerge.read_merge_rules", side_effect=xla_merge_rules)
def test_dont_ignore_flaky_failures(self, *args: Any) -> None:
"""Regression test for https://github.com/pytorch/test-infra/issues/4126"""
pr = GitHubPR("pytorch", "pytorch", 100369)
"""
Regression test for https://github.com/pytorch/test-infra/issues/4126
"""
pr = GitHubPR("pytorch", "pytorch", 105312)
repo = DummyGitRepo()
# Check that failure is classified as flaky but still raises exception
with warnings.catch_warnings(record=True) as w, self.assertRaises(RuntimeError):
@ -861,14 +811,97 @@ class TestBypassFailures(TestCase):
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
class TestGitHubPRGhstackDependencies2(TestCase):
@mock.patch("trymerge.get_drci_classifications", return_value={})
class TestBypassFailuresOnSandCastle(TestCase):
def test_get_classifications(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 111467)
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
pending, failed, ignorable = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["FLAKY"]) == 1)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 1)
def test_get_classifications_drci_checkrun_not_found(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 111467)
# No summary
checks = pr.get_checkrun_conclusions()
checks[DRCI_CHECKRUN_NAME] = JobCheckState(
DRCI_CHECKRUN_NAME,
"",
"NEUTRAL",
None,
1,
"",
None,
)
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
pending, failed, ignorable = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 2)
# Empty summary
checks = pr.get_checkrun_conclusions()
checks[DRCI_CHECKRUN_NAME] = JobCheckState(
DRCI_CHECKRUN_NAME,
"",
"NEUTRAL",
None,
1,
"",
"",
)
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
pending, failed, ignorable = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 2)
# No Dr.CI checkrun
checks = pr.get_checkrun_conclusions()
del checks[DRCI_CHECKRUN_NAME]
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
pending, failed, ignorable = categorize_checks(checks, list(checks.keys()))
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 2)
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch(
"trymerge.get_drci_classifications", side_effect=mocked_drci_classifications
)
class TestGitHubPRGhstackDependencies(TestCase):
def test_pr_dependencies(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 106068)
msg = pr.gen_commit_message(filter_ghstack=True)
assert msg == (
"[FSDP] Break up `_post_backward_hook` into smaller funcs (#106068)\n\n\nDifferential Revision: ["
"D47852461](https://our.internmc.facebook.com/intern/diff/D47852461)\nPull Request resolved: "
"https://github.com/pytorch/pytorch/pull/106068\nApproved by: \n"
self.assertEqual(
msg,
f"{pr.get_title()} (#106068)\n\n{RE_GHSTACK_DESC.sub('', pr.get_body())}\n"
"Pull Request resolved: https://github.com/pytorch/pytorch/pull/106068\n"
"Approved by: https://github.com/ezyang, https://github.com/fegin\n",
)
def test_pr_dependencies_ghstack(self, *args: Any) -> None:
@ -876,13 +909,13 @@ class TestGitHubPRGhstackDependencies2(TestCase):
pr1 = GitHubPR("pytorch", "pytorch", 106033)
pr2 = GitHubPR("pytorch", "pytorch", 106034)
pr = GitHubPR("pytorch", "pytorch", 106068)
msg = pr.gen_commit_message(filter_ghstack=True, ghstack_deps=[pr0, pr1, pr2])
assert msg == (
"[FSDP] Break up `_post_backward_hook` into smaller funcs (#106068)\n\n\nDifferential Revision: ["
"D47852461](https://our.internmc.facebook.com/intern/diff/D47852461)\nPull Request resolved: "
"https://github.com/pytorch/pytorch/pull/106068\nApproved by: \n"
"ghstack dependencies: #106032, #106033, #106034\n"
self.assertEqual(
msg,
f"{pr.get_title()} (#106068)\n\n{RE_GHSTACK_DESC.sub('', pr.get_body())}\n"
"Pull Request resolved: https://github.com/pytorch/pytorch/pull/106068\n"
"Approved by: https://github.com/ezyang, https://github.com/fegin\n"
"ghstack dependencies: #106032, #106033, #106034\n",
)
@skip(
@ -931,7 +964,7 @@ class TestGitHubPRGhstackDependencies2(TestCase):
mock_repo.cherry_pick.assert_any_call("rev2")
mock_repo.cherry_pick.assert_any_call("rev123")
assert mock.call("rev1") not in mock_repo.cherry_pick.call_args_list
self.assertTrue(mock.call("rev1") not in mock_repo.cherry_pick.call_args_list)
# Verify the first call
message = mock_repo.amend_commit_message.call_args_list[0].args[0]
@ -944,8 +977,8 @@ class TestGitHubPRGhstackDependencies2(TestCase):
"dependencies: #106032, #106033\n"
)
assert message.startswith(prefix)
assert message.endswith(suffix)
self.assertTrue(message.startswith(prefix))
self.assertTrue(message.endswith(suffix))
# Verify the second call
mock_repo.amend_commit_message.assert_any_call(

View File

@ -30,6 +30,7 @@ from github_utils import (
gh_fetch_url,
gh_post_commit_comment,
gh_post_pr_comment,
gh_update_pr_state,
GitHubComment,
)
@ -61,6 +62,7 @@ class JobCheckState(NamedTuple):
classification: Optional[str]
job_id: Optional[int]
title: Optional[str]
summary: Optional[str]
JobNameToStateDict = Dict[str, JobCheckState]
@ -74,29 +76,6 @@ class WorkflowCheckState:
self.jobs: JobNameToStateDict = {}
class FlakyRule:
def __init__(self, name: str, captures: List[str]):
self.name = re.compile(name)
self.captures = [re.compile(r) for r in captures]
def matches(self, job: Optional[Dict[str, Any]]) -> bool:
return (
job is not None
and self.name.search(job.get("name", "")) is not None
and job.get("failure_captures") is not None
and all(
any(
r.search(capture) is not None
for capture in job.get("failure_captures", [])
)
for r in self.captures
)
)
def __repr__(self) -> str:
return f"FlakyRule[name='{self.name}', captures={self.captures}]"
GH_PR_REVIEWS_FRAGMENT = """
fragment PRReviews on PullRequestReviewConnection {
nodes {
@ -141,6 +120,7 @@ fragment PRCheckSuites on CheckSuiteConnection {
detailsUrl
databaseId
title
summary
}
pageInfo {
endCursor
@ -332,6 +312,7 @@ query ($owner: String!, $name: String!, $number: Int!, $cs_cursor: String, $cr_c
detailsUrl
databaseId
title
summary
}
pageInfo {
endCursor
@ -456,6 +437,7 @@ MERGE_RULE_PATH = Path(".github") / "merge_rules.yaml"
ROCKSET_MERGES_COLLECTION = "merges"
ROCKSET_MERGES_WORKSPACE = "commons"
REMOTE_MAIN_BRANCH = "origin/main"
DRCI_CHECKRUN_NAME = "Dr.CI"
INTERNAL_CHANGES_CHECKRUN_NAME = "Meta Internal-Only Changes Check"
HAS_NO_CONNECTED_DIFF_TITLE = (
"There is no internal Diff connected, this can be merged now"
@ -569,6 +551,7 @@ def add_workflow_conclusions(
classification=None,
job_id=checkrun_node["databaseId"],
title=checkrun_node["title"],
summary=checkrun_node["summary"],
)
if bool(checkruns["pageInfo"]["hasNextPage"]):
@ -599,6 +582,7 @@ def add_workflow_conclusions(
classification=None,
job_id=None,
title=None,
summary=None,
)
for job_name, job in no_workflow_obj.jobs.items():
res[job_name] = job
@ -924,6 +908,7 @@ class GitHubPR:
classification=None,
job_id=None,
title=None,
summary=None,
)
return self.conclusions
@ -1261,13 +1246,6 @@ def read_merge_rules(
return [MergeRule(**x) for x in rc]
@lru_cache(maxsize=None)
def read_flaky_rules() -> List[FlakyRule]:
# NOTE: This is currently hardcoded, can be extended to do per repo rules
FLAKY_RULES_URL = "https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/flaky-rules.json"
return _get_flaky_rules(FLAKY_RULES_URL)
def find_matching_merge_rule(
pr: GitHubPR,
repo: Optional[GitRepo] = None,
@ -1298,25 +1276,15 @@ def find_matching_merge_rule(
reject_reason = f"No rule found to match PR. Please [report]{issue_link} this issue to DevX team."
rules = read_merge_rules(repo, pr.org, pr.project)
flaky_rules = read_flaky_rules()
if not rules:
reject_reason = f"Rejecting the merge as no rules are defined for the repository in {MERGE_RULE_PATH}"
raise RuntimeError(reject_reason)
checks = pr.get_checkrun_conclusions()
base_rev = None
try:
# is allowed to fail if git is not available
base_rev = pr.get_merge_base()
except Exception as e:
print(
f"Failed fetching base git revision for {pr.pr_num}. Skipping additional classifications.\n"
f"{type(e)}\n{e}"
)
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
pr.last_commit()["oid"],
base_rev,
flaky_rules,
ignore_current_checks=ignore_current_checks,
)
@ -1467,11 +1435,6 @@ def checks_to_markdown_bullets(
]
@retries_decorator(rc=[])
def _get_flaky_rules(url: str) -> List[FlakyRule]:
return [FlakyRule(**rule) for rule in gh_fetch_json_list(url)]
@retries_decorator()
def save_merge_record(
collection: str,
@ -1575,6 +1538,27 @@ where
return []
@retries_decorator()
def get_drci_classifications(pr_num: int, project: str = "pytorch") -> Any:
"""
Query HUD API to find similar failures to decide if they are flaky
"""
# NB: This doesn't work internally atm because this requires making an
# external API call to HUD
failures = gh_fetch_url(
f"https://hud.pytorch.org/api/drci/drci?prNumber={pr_num}",
data=f"repo={project}",
headers={
"Authorization": os.getenv("DRCI_BOT_KEY", ""),
"Accept": "application/vnd.github.v3+json",
},
method="POST",
reader=json.load,
)
return failures.get(str(pr_num), {}) if failures else {}
REMOVE_JOB_NAME_SUFFIX_REGEX = re.compile(r", [0-9]+, [0-9]+, .+\)$")
@ -1583,78 +1567,86 @@ def remove_job_name_suffix(name: str, replacement: str = ")") -> str:
def is_broken_trunk(
head_job: Optional[Dict[str, Any]], base_jobs: Optional[Dict[str, Dict[str, Any]]]
name: str,
drci_classifications: Any,
) -> bool:
if not head_job or not base_jobs:
if not name or not drci_classifications:
return False
# Consult the list of broken trunk failures from Dr.CI
return any(
head_job["conclusion"] == base_job["conclusion"]
and head_job["failure_captures"] == base_job["failure_captures"]
for base_job in base_jobs.values()
name == broken_trunk["name"]
for broken_trunk in drci_classifications.get("BROKEN_TRUNK", [])
)
def is_flaky(
name: str,
drci_classifications: Any,
) -> bool:
if not name or not drci_classifications:
return False
# Consult the list of flaky failures from Dr.CI
return any(name == flaky["name"] for flaky in drci_classifications.get("FLAKY", []))
def is_invalid_cancel(
name: str,
conclusion: Optional[str],
drci_classifications: Any,
) -> bool:
"""
After https://github.com/pytorch/test-infra/pull/4579, invalid cancelled
signals have been removed from HUD and Dr.CI. The same needs to be done
here for consistency
"""
if (
not name
or not drci_classifications
or not conclusion
or conclusion.upper() != "CANCELLED"
):
return False
# If a job is cancelled and not listed as a failure by Dr.CI, it's an
# invalid signal and can be ignored
return all(
name != failure["name"] for failure in drci_classifications.get("FAILED", [])
)
def get_classifications(
pr_num: int,
project: str,
checks: Dict[str, JobCheckState],
head_sha: str,
merge_base: Optional[str],
flaky_rules: List[FlakyRule],
ignore_current_checks: Optional[List[str]],
) -> Dict[str, JobCheckState]:
# Group by job name without shard id and suffix to correctly identify broken
# trunk failures, i.e. linux-bionic-cuda12.1-py3.10-gcc9-sm86 / test (default)
head_sha_jobs: Dict[str, Dict[str, Dict[str, Any]]] = defaultdict(dict)
merge_base_jobs: Dict[str, Dict[str, Dict[str, Any]]] = defaultdict(dict)
# Get the failure classification from Dr.CI, which is the source of truth
# going forward. It's preferable to try calling Dr.CI API directly first
# to get the latest results as well as update Dr.CI PR comment
drci_classifications = get_drci_classifications(pr_num=pr_num, project=project)
print(f"From Dr.CI API: {json.dumps(drci_classifications)}")
if merge_base is not None:
def insert(
d: Dict[str, Dict[str, Dict[str, Any]]],
key: str,
val: Dict[str, Any],
overwrite_failed_run_attempt: bool,
) -> None:
key_no_suffix = remove_job_name_suffix(key)
if key not in d[key_no_suffix]:
d[key_no_suffix][key] = val
return
# When overwrite_failed_run_attempt is set to True, always overwrite
# the job with the result from the latest attempt. This option is for
# jobs from the pull request head_sha where the latest retry is used
# when merging
#
# When overwrite_failed_run_attempt is False, only overwrite the job
# with the result from the latest attempt if the latest retry failed.
# This option is for jobs from the merger_base where we want to record
# failures for broken trunk
if d[key_no_suffix][key]["id"] < val["id"] and (
overwrite_failed_run_attempt or not is_passing_status(val["conclusion"])
):
d[key_no_suffix][key] = val
rockset_results = get_rockset_results(head_sha, merge_base)
for rockset_result in rockset_results:
name = f"{rockset_result['workflow_name']} / {rockset_result['name']}"
if rockset_result["head_sha"] == head_sha:
insert(
head_sha_jobs,
name,
rockset_result,
overwrite_failed_run_attempt=True,
)
else:
insert(
merge_base_jobs,
name,
rockset_result,
overwrite_failed_run_attempt=False,
)
# NB: if the latest results from Dr.CI is not available, i.e. when calling from
# SandCastle, we fallback to any results we can find on Dr.CI check run summary
if (
not drci_classifications
and DRCI_CHECKRUN_NAME in checks
and checks[DRCI_CHECKRUN_NAME]
and checks[DRCI_CHECKRUN_NAME].summary
):
drci_summary = checks[DRCI_CHECKRUN_NAME].summary
try:
print(f"From Dr.CI checkrun summary: {drci_summary}")
drci_classifications = json.loads(str(drci_summary))
except json.JSONDecodeError as error:
warn("Invalid Dr.CI checkrun summary")
drci_classifications = {}
checks_with_classifications = checks.copy()
for name, check in checks.items():
if check.status == "SUCCESS":
if check.status == "SUCCESS" or check.status == "NEUTRAL":
continue
if "unstable" in name:
@ -1665,13 +1657,13 @@ def get_classifications(
"UNSTABLE",
check.job_id,
check.title,
check.summary,
)
continue
name_no_suffix = remove_job_name_suffix(name)
head_sha_job = head_sha_jobs.get(name_no_suffix, {}).get(name)
if is_broken_trunk(head_sha_job, merge_base_jobs.get(name_no_suffix)):
# NB: It's important to note that when it comes to ghstack and broken trunk classification,
# Dr.CI uses the base of the whole stack
if is_broken_trunk(name, drci_classifications):
checks_with_classifications[name] = JobCheckState(
check.name,
check.url,
@ -1679,12 +1671,34 @@ def get_classifications(
"BROKEN_TRUNK",
check.job_id,
check.title,
check.summary,
)
continue
elif any(rule.matches(head_sha_job) for rule in flaky_rules):
elif is_flaky(name, drci_classifications):
checks_with_classifications[name] = JobCheckState(
check.name, check.url, check.status, "FLAKY", check.job_id, check.title
check.name,
check.url,
check.status,
"FLAKY",
check.job_id,
check.title,
check.summary,
)
continue
elif is_invalid_cancel(name, check.status, drci_classifications):
# NB: Create a new category here for invalid cancelled signals because
# there are usually many of them when they happen. So, they shouldn't
# be counted toward ignorable failures threshold
checks_with_classifications[name] = JobCheckState(
check.name,
check.url,
check.status,
"INVALID_CANCEL",
check.job_id,
check.title,
check.summary,
)
continue
@ -1696,6 +1710,7 @@ def get_classifications(
"IGNORE_CURRENT_CHECK",
check.job_id,
check.title,
check.summary,
)
return checks_with_classifications
@ -1789,6 +1804,7 @@ def try_revert(
if not dry_run:
pr.add_numbered_label("reverted")
gh_post_commit_comment(pr.org, pr.project, commit_sha, revert_msg)
gh_update_pr_state(pr.org, pr.project, pr.pr_num)
def prefix_with_github_url(suffix_str: str) -> str:
@ -1864,6 +1880,8 @@ def categorize_checks(
# ignored anyway. This is useful to not need to wait for scarce resources
# like ROCm, which is also frequently in unstable mode
pending_checks.append((checkname, url, job_id))
elif classification == "INVALID_CANCEL":
continue
elif not is_passing_status(check_runs[checkname].status):
target = (
ignorable_failed_checks[classification]
@ -1909,7 +1927,8 @@ def merge(
ignore_current: bool = False,
) -> None:
initial_commit_sha = pr.last_commit()["oid"]
print(f"Attempting merge of {initial_commit_sha}")
pr_link = f"https://github.com/{pr.org}/{pr.project}/pull/{pr.pr_num}"
print(f"Attempting merge of {initial_commit_sha} ({pr_link})")
if MERGE_IN_PROGRESS_LABEL not in pr.get_labels():
gh_add_labels(pr.org, pr.project, pr.pr_num, [MERGE_IN_PROGRESS_LABEL])
@ -1974,7 +1993,6 @@ def merge(
start_time = time.time()
last_exception = ""
elapsed_time = 0.0
flaky_rules = read_flaky_rules()
ignore_current_checks = [
x[0] for x in ignore_current_checks_info
] # convert to List[str] for convenience
@ -2007,10 +2025,9 @@ def merge(
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
pr.last_commit()["oid"],
pr.get_merge_base(),
flaky_rules,
ignore_current_checks=ignore_current_checks,
)
pending, failing, _ = categorize_checks(

View File

@ -51,7 +51,7 @@ def post_already_uptodate(
def rebase_onto(
pr: GitHubPR, repo: GitRepo, onto_branch: str, dry_run: bool = False
) -> None:
) -> bool:
branch = f"pull/{pr.pr_num}/head"
remote_url = f"https://github.com/{pr.info['headRepository']['nameWithOwner']}.git"
refspec = f"{branch}:{pr.head_ref()}"
@ -68,6 +68,7 @@ def rebase_onto(
push_result = repo._run_git("push", "-f", remote_url, refspec)
if "Everything up-to-date" in push_result:
post_already_uptodate(pr, repo, onto_branch, dry_run)
return False
else:
gh_post_comment(
pr.org,
@ -78,18 +79,21 @@ def rebase_onto(
+ "git pull --rebase`)",
dry_run=dry_run,
)
return True
def rebase_ghstack_onto(
pr: GitHubPR, repo: GitRepo, onto_branch: str, dry_run: bool = False
) -> None:
) -> bool:
if (
subprocess.run(
[sys.executable, "-m", "ghstack", "--help"], capture_output=True
[sys.executable, "-m", "ghstack", "--help"],
capture_output=True,
check=False,
).returncode
!= 0
):
subprocess.run([sys.executable, "-m", "pip", "install", "ghstack"])
subprocess.run([sys.executable, "-m", "pip", "install", "ghstack"], check=True)
orig_ref = f"{re.sub(r'/head$', '/orig', pr.head_ref())}"
repo.fetch(orig_ref, orig_ref)
@ -115,8 +119,9 @@ def rebase_ghstack_onto(
if dry_run:
print("Don't know how to dry-run ghstack")
return False
else:
ghstack_result = subprocess.run(["ghstack"], capture_output=True)
ghstack_result = subprocess.run(["ghstack"], capture_output=True, check=True)
push_result = ghstack_result.stdout.decode("utf-8")
print(push_result)
if ghstack_result.returncode != 0:
@ -166,6 +171,8 @@ def rebase_ghstack_onto(
in push_result
):
post_already_uptodate(pr, repo, onto_branch, dry_run)
return False
return True
def additional_rebase_failure_info(e: Exception) -> str:
@ -222,9 +229,10 @@ def main() -> None:
try:
if pr.is_ghstack_pr():
with git_config_guard(repo):
rebase_ghstack_onto(pr, repo, onto_branch, dry_run=args.dry_run)
rc = rebase_ghstack_onto(pr, repo, onto_branch, dry_run=args.dry_run)
else:
rebase_onto(pr, repo, onto_branch, dry_run=args.dry_run)
rc = rebase_onto(pr, repo, onto_branch, dry_run=args.dry_run)
sys.exit(0 if rc else 1)
except Exception as e:
msg = f"Rebase failed due to {e}"

View File

@ -114,7 +114,8 @@ def main() -> None:
# query to see if a pr already exists
params = {
"q": f"is:pr is:open in:title author:pytorchmergebot repo:{OWNER}/{REPO} {args.repo_name} hash update"
"q": f"is:pr is:open in:title author:pytorchupdatebot repo:{OWNER}/{REPO} {args.repo_name} hash update",
"sort": "created",
}
response = git_api("/search/issues", params)
if response["total_count"] != 0:

View File

@ -36,9 +36,10 @@ concurrency:
{%- macro setup_ec2_windows() -%}
!{{ display_ec2_information() }}
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: seemethere/add-github-ssh-key@v1
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell

View File

@ -55,12 +55,14 @@ jobs:
uses: ./.github/workflows/_binary-build-linux.yml
with:!{{ upload.binary_env_as_input(config) }}
{%- if "aarch64" in build_environment %}
runs_on: linux.t4g.2xlarge
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "conda" in build_environment and config["gpu_arch_type"] == "cuda" %}
runs_on: linux.24xlarge
{%- endif %}
build_name: !{{ config["build_name"] }}
build_environment: !{{ build_environment }}
{%- if config.pytorch_extra_install_requirements is defined %}
{%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0 %}
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
{%- endif %}
secrets:
@ -74,7 +76,7 @@ jobs:
build_name: !{{ config["build_name"] }}
build_environment: !{{ build_environment }}
{%- if "aarch64" in build_environment %}
runs_on: linux.t4g.2xlarge
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif config["gpu_arch_type"] == "rocm" %}
runs_on: linux.rocm.gpu

View File

@ -58,9 +58,12 @@ jobs:
{%- for config in build_configs %}
!{{ config["build_name"] }}-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-12-xl
runs-on: !{{ macos_runner }}
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config, true) }}
{%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0 %}
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
{%- endif %}
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -69,11 +72,15 @@ jobs:
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-x86_64.sh
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
if [ -d "/Applications/Xcode_14.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_14.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
fi
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
- name: Install sccache (only for non-forked PRs, and pushes to trunk)

View File

@ -68,5 +68,6 @@
aws-pytorch-uploader-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
{%- endmacro %}

View File

@ -59,6 +59,9 @@ jobs:
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config, True) }}
{%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0 %}
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
{%- endif %}
steps:
!{{ common.setup_ec2_windows() }}
!{{ set_runner_specific_vars() }}

View File

@ -29,6 +29,7 @@ env:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}

View File

@ -29,6 +29,7 @@ env:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
@ -157,7 +158,7 @@ jobs:
# run gradle buildRelease
(echo "./.circleci/scripts/build_android_gradle.sh" | docker exec \
-e BUILD_ENVIRONMENT="pytorch-linux-focal-py3-clang7-android-ndk-r19c-gradle-build" \
-e BUILD_ENVIRONMENT="pytorch-linux-focal-py3-clang9-android-ndk-r21e-gradle-build" \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
-e PR_NUMBER \

View File

@ -33,6 +33,7 @@ env:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
@ -120,8 +121,7 @@ jobs:
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
REENABLED_ISSUES: ${{ needs.filter.outputs.reenabled-issues }}
# TODO duplicated
AWS_DEFAULT_REGION: us-east-1
@ -147,6 +147,7 @@ jobs:
-e GITHUB_JOB \
-e GITHUB_RUN_NUMBER \
-e GITHUB_RUN_ATTEMPT \
-e JOB_ID \
-e GIT_DEFAULT_BRANCH="$GIT_DEFAULT_BRANCH" \
-e SHARD_NUMBER \
-e NUM_TEST_SHARDS \
@ -157,8 +158,6 @@ jobs:
-e TORCH_CUDA_ARCH_LIST \
-e OUR_GITHUB_JOB_ID \
-e CUDA_VERSION \
-e PYTORCH_RETRY_TEST_CASES \
-e PYTORCH_OVERRIDE_FLAKY_SIGNAL \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
@ -184,7 +183,7 @@ jobs:
shell: bash
if: always() && steps.test.conclusion
run: |
cat test/**/*.log || true
cat test/**/*_toprint.log || true
- name: Chown workspace
uses: ./.github/actions/chown-workspace

View File

@ -15,7 +15,7 @@ on:
required: false
default: linux.12xlarge
type: string
description: Hardware to run this "build"job on, linux.12xlarge or linux.t4g.2xlarge.
description: Hardware to run this "build"job on, linux.12xlarge or linux.arm64.2xlarge.
ALPINE_IMAGE:
required: false
type: string
@ -140,6 +140,7 @@ jobs:
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.github-token }}
@ -159,10 +160,12 @@ jobs:
- name: Clean workspace
shell: bash
run: |
set -eux
rm -rf "${GITHUB_WORKSPACE}"
mkdir "${GITHUB_WORKSPACE}"
if [[ inputs.build_environment == 'linux-aarch64-binary-manywheel' ]]; then
if [[ ${{ inputs.build_environment }} == 'linux-aarch64-binary-manywheel' ]]; then
rm -rf "${RUNNER_TEMP}/artifacts"
mkdir "${RUNNER_TEMP}/artifacts"
fi

View File

@ -62,7 +62,7 @@ on:
runs_on:
required: true
type: string
description: Hardware to run this job on. Valid values are linux.4xlarge, linux.4xlarge.nvidia.gpu, linux.t4g.2xlarge, and linux.rocm.gpu
description: Hardware to run this job on. Valid values are linux.4xlarge, linux.4xlarge.nvidia.gpu, linux.arm64.2xlarge, and linux.rocm.gpu
secrets:
github-token:
required: true
@ -128,6 +128,7 @@ jobs:
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.github-token }}

Some files were not shown because too many files have changed in this diff Show More