pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Xu Han	b2553a6ec4	[AOTI] raise PyTorchStreamWriter open failed error code on windows (#162799 ) When I debug AOTI UT: `TestAOTInductorPackage_cpu::test_add`. I found it didn't output the verbose error code, when PyTorchStreamWriter open failed. This PR add the verbose error code output for debug. Local test shows as below: <img width="1124" height="653" alt="image" src="https://github.com/user-attachments/assets/01cb1a51-2982-4106-8b5b-c608ac26a075" /> The error code is 32, we can check the Windows error code 32 at https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499- ``` ERROR_SHARING_VIOLATION 32 (0x20) The process cannot access the file because it is being used by another process. ``` This issue is caused by the file is opened by another process. I fixed same issue in zip open as PR: https://github.com/pytorch/pytorch/pull/162617 But still no idea how to open file with shared access in `std::ofstream`. I will continue to researching it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162799 Approved by: https://github.com/jansel	2025-09-13 01:41:14 +00:00
cyy	3c2324c64a	[2/N] Fix cppcoreguidelines-init-variables suppression (#146237 ) This PR removes all `cppcoreguidelines-init-variables` suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146237 Approved by: https://github.com/ezyang	2025-06-19 23:26:42 +00:00
Xintong Hu	a6182903cd	Update PyTorchStreamReader API to take cpu allocator override (#150439 ) Summary: Add allocator param in getRecord Test Plan: newly added UT ``` buck test caffe2/caffe2/serialize:inline_container_test ``` Differential Revision: D72252585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150439 Approved by: https://github.com/albanD	2025-04-18 01:53:14 +00:00
Richard Barnes	536c0c7a47	[codemod][lowrisk] Remove unused exception parameter from caffe2/aten/src/ATen/cuda/CUDABlas.cpp (#149328 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: dtolnay Pull Request resolved: https://github.com/pytorch/pytorch/pull/149328 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-03-19 02:05:33 +00:00
Mikayla Gawarecki	be0ceee1c3	Make record/storage alignment in torch.save configurable (#147788 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788 Approved by: https://github.com/albanD ghstack dependencies: #147786, #147787	2025-03-06 12:04:46 +00:00
Mikayla Gawarecki	001e355a56	Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 ) ## Background This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`. When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this). The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases. `6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)` ## How does this work The format for the checkpoint is as such ``` archive_name/ \|_ data.pkl \|_.format_version \|_byteorder \|_data/ \|_ 0 \|_ 1 \|_ 2 \|_ ... \|_ ``` Each `data/i` record represents a storage, where storages are written in the order that the Pickler encounters them. For each storage, our `persistent_load` logic saves the following metadata to the pickle file `dtype, numel, key, location` where `numel` is the number of bytes in the storage. Note that we always use `miniz` writer in the zip64 mode per [here](`7796e308d0/caffe2/serialize/inline_container.cc (L701)`) A zipfile record written by miniz looks as such ``` ---------------- ----------------- ------------------- ---------------- --------- ------------------------------ \| 30 byte header \| n byte filename \| zip64_extra_data \| m byte padding \| storage \| 16 or 24 byte local dir footer \| ---------------- ----------------- ------------------- ---------------- --------- ------------------------------ ``` - The header size (30) is given by [`MZ_ZIP_LOCAL_DIR_HEADER_SIZE`](https://github.com/pytorch/pytorch/blob/main/third_party/miniz-3.0.2/miniz.c?fbclid=IwZXh0bgNhZW0CMTEAAR2O8Vysd--UoSCxW70gabXIS1dbz733oHwuUQ5_Ff1hY2WU6PL2i6CSH4A_aem_J9oaU2HpDeWtJKOU9EnVqw#L3290) - filename will be `"{archive_name}/{filepath}"` - `zip64_extra_data` is determined by [`mz_zip_writer_create_zip64_extra_data`](`7796e308d0/third_party/miniz-3.0.2/miniz.c (L6202)`). Note that [we only create zip64_extra_data if storage_size >= 0xFFFFFFFF or the offset of the start of the header >= 0xFFFFFFFF](`7796e308d0/third_party/miniz-3.0.2/miniz.c (L6519-L6524)`) - `m` is determined by [`getPadding`](`7796e308d0/caffe2/serialize/inline_container.cc (L254)`), which accounts for filename, zip64_extra_data to determine `m` such that the start of `storage` is aligned to 64 bytes. The `m` bytes will always start with `F B padding_size" as the first 4 bytes - The local dir footer size is determined based on [this snippet ](`7796e308d0/third_party/miniz-3.0.2/miniz.c (L6610-L6632)`): if the buffer size is 0 it is skipped. If the zip64_extra_data was created, it is 24, otherwise it is 16. When `torch.utils.serialization.config.load.calculate_storage_offsets` is set we do the following - We keep track of where the "cursor" is in the file using `current_offset`, after each persistent_load call, it will be at the offset where the header for the next record starts - for the 0th storage, "data/0", we use the regular get_record_offset to determine the start of the storage - for any other storage, (where the storages will be in order encountered by the unpickler, 0, 1, 2, 3, ...) we use `get_record_offset_no_read`, which re-uses the `getPadding` logic to determine the offset of the storage - Note that `load_tensor` will only ever be called again with the same key if the storage's `._data_ptr()` is 0 [[pointer1](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1917-L1918)][[pointer2](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1936-L1937)], so we cache the offsets for this edge case - After each storage, if the storage is non-zero, we account for the local dir footer based on the logic described above ## Testing strategy The agreed upon testing strategy was as follows: - Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False) - This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested. Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880 Approved by: https://github.com/albanD ghstack dependencies: #143879	2025-01-31 17:09:20 +00:00
cyy	116af809eb	Use std::string_view (#145906 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906 Approved by: https://github.com/albanD	2025-01-30 03:14:27 +00:00
PyTorch MergeBot	9010649292	Revert "Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 )" This reverts commit db3685a35cdce32622ab89f6c92e09d52210ff53. Reverted https://github.com/pytorch/pytorch/pull/143880 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but either this PR or the base PR breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/143880#issuecomment-2617743403))	2025-01-28 03:07:17 +00:00
Mikayla Gawarecki	db3685a35c	Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 ) ## Background This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`. When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this). The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases. `6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)` ## Testing strategy The agreed upon testing strategy was as follows: - Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False) - This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested. Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880 Approved by: https://github.com/albanD ghstack dependencies: #143879	2025-01-27 23:57:30 +00:00
cyy	bffaddf9ea	Format caffe2/serialize (#141850 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141850 Approved by: https://github.com/cpuhrsch	2024-12-04 01:14:24 +00:00
Richard Barnes	fca0f34b83	Switch c10::string_view to std::string_view (#139635 ) Shortens `string_view_starts_with` to `starts_with`. Adds some missing headers. Isolates `c10_string_view` to use with `get_fully_qualified_name`. Test Plan: Sandcastle Reviewed By: ezyang Differential Revision: D64833558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139635 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-27 01:41:18 +00:00
FFFrog	300ca6368f	Remove depracated alias macro(2/3) (#137559 ) Detailed Descriptions: - Remove AT_ASSERTM Macro Pull Request resolved: https://github.com/pytorch/pytorch/pull/137559 Approved by: https://github.com/ezyang	2024-11-01 06:17:57 +00:00
cyyever	8ace3e8023	Add sv starts/ends_with (#139261 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139261 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-01 01:17:42 +00:00
Wouter Devriendt	bae3426af7	reimport pr137735 due to merging check issues (#138959 ) This is a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959 Approved by: https://github.com/mikaylagawarecki	2024-10-27 16:31:34 +00:00
PyTorch MergeBot	dd32a32cb6	Revert "Expose option to disable CRC-32 computation during `torch.save` (#137735 )" This reverts commit 534fa96f2d9a4feb1dcdfaecb3d73990db60f819. Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))	2024-10-16 17:03:06 +00:00
Mikayla Gawarecki	534fa96f2d	Expose option to disable CRC-32 computation during `torch.save` (#137735 ) Option only works in open source, not internal Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735 Approved by: https://github.com/albanD	2024-10-15 19:30:02 +00:00
PyTorch MergeBot	cd292908e5	Revert "Make c10::string_view an alias of std::string_view (#130417 )" This reverts commit c48fe8901114aa2b0a9c2d77f915a2ad8ab2098b. Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/clee2000 due to breaking some internal tests, probably usages of string_view that need to be changed? ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2414775064))	2024-10-15 18:55:09 +00:00
cyy	c48fe89011	Make c10::string_view an alias of std::string_view (#130417 ) In order to facilitate the mitigation from c10::string_view to std::string_view, the old c10::string_view was renamed to c10::string_view_ext. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417 Approved by: https://github.com/ezyang	2024-10-14 09:28:04 +00:00
PyTorch MergeBot	564d00f364	Revert "Fix clang-tidy warnings in Caffe2 code (#134935 )" This reverts commit 7cfd23636c8fa6fcbb8bf3ea34e15b847ec9ad9d. Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))	2024-09-13 16:42:37 +00:00
cyy	7cfd23636c	Fix clang-tidy warnings in Caffe2 code (#134935 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935 Approved by: https://github.com/ezyang	2024-09-12 03:27:09 +00:00
PyTorch MergeBot	7ce5b5767c	Revert "Make c10::string_view an alias of std::string_view (#130417 )" This reverts commit c9551a3f50efc8163d8508a3c2189536528577ac. Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/izaitsevfb due to depends on #130009 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2224212227))	2024-07-12 00:37:04 +00:00
cyy	c9551a3f50	Make c10::string_view an alias of std::string_view (#130417 ) Follows #130009 to further facilitate the mitigation from c10::string_view to std::string_view. The old c10::string_view was renamed to c10::string_view_ext. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417 Approved by: https://github.com/ezyang	2024-07-11 12:31:06 +00:00
cyy	e4c32d14a8	[3/N] Remove inclusion of c10/util/string_utils.h (#128504 ) Follows #128372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128504 Approved by: https://github.com/malfet	2024-06-15 06:38:40 +00:00
Mikayla Gawarecki	bbdbfe3661	Reland add `write_record_metadata` to PyTorchFileWriter (#126087 ) Reland of https://github.com/pytorch/pytorch/pull/125184 with compiler warning fixed by extending `m_pWrite` rather than adding `m_pSeek` to miniz API Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D57287327](https://our.internmc.facebook.com/intern/diff/D57287327) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126087 Approved by: https://github.com/albanD	2024-05-14 21:48:44 +00:00
PyTorch MergeBot	ccbac091d2	Revert "Add `write_record_metadata` to PyTorchFileWriter (#125184 )" This reverts commit dd92637f445d2787f83829079276f71b1ad1fc7c. Reverted https://github.com/pytorch/pytorch/pull/125184 on behalf of https://github.com/izaitsevfb due to breaks internal builds, see D56962076 ([comment](https://github.com/pytorch/pytorch/pull/125184#issuecomment-2094976897))	2024-05-05 22:40:00 +00:00
Stefan-Alin Pahontu	bebefcf845	Driver folder check (#117548 ) Added extra check for driver folders for Libtorch, as stat struct does not recognize driver folders, so torch.save should work for them as well. (e.g. save model.pt directly under C: ) Fixes [#111121](https://github.com/pytorch/pytorch/issues/111121) and #105488 Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117548 Approved by: https://github.com/malfet	2024-05-03 09:10:11 +00:00
Mikayla Gawarecki	dd92637f44	Add `write_record_metadata` to PyTorchFileWriter (#125184 ) Add `PyTorchFileWriter.write_record_metadata(record_name, num_bytes)` that - writes the zipfile header/end of central directory metadata for an entry* - reserves `num_bytes` in the zipfile for the payload. *Since the payload is not provided, the CRC32 computation is skipped and 0s are written in the corresponding entry of the zipfile header Pull Request resolved: https://github.com/pytorch/pytorch/pull/125184 Approved by: https://github.com/albanD	2024-05-03 07:29:52 +00:00
Zhijing Li (Accelerator Enablement)	87082bd025	Reduce single reader check time for inline_container (#113328 ) Differential Revision: D51089711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113328 Approved by: https://github.com/jiayisuse	2023-11-09 22:02:28 +00:00
Ayham Tannous	be66d5e845	Add file name and size to the serialization metadata logging (#113077 ) Summary: To be able to get more info on serialization/deserialization events, adding these two files to the metadata logging. - file_name - file_size Test Plan: buck2 test mode/dev caffe2/caffe2/serialize:inline_container_test Reviewed By: davidberard98 Differential Revision: D51040426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113077 Approved by: https://github.com/davidberard98	2023-11-09 11:14:24 +00:00
Zhijing Li (Accelerator Enablement)	55971c5c4e	Enable concurrent reader for getRecord function (#112818 ) Summary: Use concurrent multiple readers to access record from different start index. It can provide better performance when the data being accessed is large. bypass-github-pytorch-ci-checks Test Plan: ``` buck2 run @//mode/dev //caffe2/caffe2/serialize:inline_container_test ``` Reviewed By: YazhiGao Differential Revision: D50957607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112818 Approved by: https://github.com/houseroad, https://github.com/huydhn	2023-11-03 22:55:27 +00:00
PyTorch MergeBot	2d5fec4d59	Revert "Enable concurrent reader for getRecord function (#111426 )" This reverts commit 12a6f5aa6bf3e11668293c36b436eead2f3b8614. Reverted https://github.com/pytorch/pytorch/pull/111426 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111426#issuecomment-1791733096))	2023-11-03 00:22:21 +00:00
Zhijing Li (Accelerator Enablement)	12a6f5aa6b	Enable concurrent reader for getRecord function (#111426 ) Summary: Zion-4s core has poor perf when it comes to reading the large tensor (e.g. 300G), no matter for manifold downloading or reading from files. In this diff, I changed the getRecord function from single thread to multiple threads by passing multiple readers to getRecord function and access the same record at different chunks with different readers. We control the number of additional reader with the`sigrid_model_manager_additional_reader` flag. The default value is 0. When `additional_reader=2`, we allocate `2` extra read client threads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111426 Approved by: https://github.com/jiayisuse	2023-11-02 22:07:04 +00:00
Shiyan Deng	3acaf8564d	[easy] use number of param bytes as the chunk size if it's not provided (#111844 ) Summary: ATT Test Plan: CI Differential Revision: D50572228 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111844 Approved by: https://github.com/zyan0, https://github.com/houseroad	2023-10-24 23:56:33 +00:00
cyy	ac603bc2f8	[Reland] Eliminate invocations of c10::stoi,c10::stod,c10::stoull,c10::stoll (#109566 ) This is reland of #87603 with definitions of c10::stoXX kept for further investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109566 Approved by: https://github.com/huydhn	2023-09-19 07:15:25 +00:00
PyTorch MergeBot	4d44d8c00a	Revert "Eliminate c10::stoi,c10::stod,c10::stoull,c10::stoll (#109179 )" This reverts commit 852f1b8417e80b72a7d1c4a772f66af28da02913. Reverted https://github.com/pytorch/pytorch/pull/109179 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this is breaking periodic buck build, so please fix the issue and reland the change https://github.com/pytorch/pytorch/actions/runs/6207458526/job/16852695272 ([comment](https://github.com/pytorch/pytorch/pull/109179#issuecomment-1724168571))	2023-09-18 18:41:12 +00:00
cyy	852f1b8417	Eliminate c10::stoi,c10::stod,c10::stoull,c10::stoll (#109179 ) We can remove these functions in favor of std ones. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109179 Approved by: https://github.com/colesbury	2023-09-16 07:22:50 +00:00
Lujia Zhang	a6fadf643f	Re-do D48544397: [TGIF Inplace] [xlv2][1/n] Expose a couple APIs from inline_container that will be used for chunk read" (#109183 ) Summary: Original commit changeset: 4a5f31518ad0 Original Phabricator Diff: D48544397 fix easycla Differential Revision: D49221088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109183 Approved by: https://github.com/wqfish	2023-09-14 08:17:14 +00:00
Shiyan Deng	d471eaeb1d	fix inline_container.cc inplace loading (#108573 ) Summary: bypass-github-pytorch-ci-checks bypass-github-export-checks force-merge-on-github Differential Revision: D48971847 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108573 Approved by: https://github.com/wqfish	2023-09-06 00:02:42 +00:00
Lujia Zhang	b897c57d47	[TGIF][Inplace][Perf] Copy tensor to device with pinned memory & move copy weight sleep to getRecord (#106849 ) Summary: There are 2 changes in the diff that helps optimize perf during inplace update: 1. Read data with pinned memory 2. move the copy weight sleep from between copying the whole Tensor to between copying chunks Test Plan: Local Test ``` ./ai_infra/inference_platform/test_platform/script/run_sigrid_4card.sh --port 7451 --local_model_dir /home/lujia/script --cuda_devices 6 --bind_node 3 --model_id 962549778_514 --gflag_config_path sigrid/predictor/predictor_x_gflags_mrs_prospector_gpu_torchscript_fusedsolution_1card_opt_fm -- --enable_thrift_warmup=false --tgif_replicate_merge_by_tempfile=false --enable_inplace_snapshot_transition --model_version_config_path sigrid/predictor/models_version/lujia_test --inplace_update_max_retries 0 --submod_to_device="merge\|cuda0" ``` Load test on job tsp_eag/smart/inference_platform_sp__sigrid_predictor_gpu_adhoc_realtimetest_m962549778_latest.s3 Before: (p99 latency) {F1066957232} (SR error rate) {F1066957650} After: (p99 latency) {F1066957141} (SR error rate) {F1066957376} Differential Revision: D48182533 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106849 Approved by: https://github.com/842974287, https://github.com/kit1980	2023-08-13 07:37:46 +00:00
Aleksei Nikiforov	c42fd73cf9	Add functions to get and set default endianness in load() functions (#101973 ) By default interpret tensor data as native endian, but add an option to interpret data as little endian or big endian. Related to #101688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101973 Approved by: https://github.com/mikaylagawarecki	2023-07-06 20:12:56 +00:00
atannous	b469ed72d0	Integrating new API usage metadata logger (#101762 ) Summary: The new logger allows passing metadata into the api usage logger. The immediate use case is to pass the serialization_id to the save and load events to be enable tracking serialized models in API events. It could be extended to add more metadata in the future. Test Plan: ``` buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test ``` Reviewed By: davidberard98 Differential Revision: D45683697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101762 Approved by: https://github.com/davidberard98	2023-05-26 00:24:26 +00:00
atannous	149237415f	Using deterministic hashing instead of GUID for pytorch serialization id generation (#101964 ) Summary: serialization_id was added in a previous change to be written as a random GUID associated with each time saving of a module is called, for the purpose of adding tracking for saved artifacts. In order not to disturb existing systems that rely on the serialized bytes to be deterministic for serializing the same module, this change uses the combined hash of uncompressed content and file names instead of GUID for serialization id. The use of this hashing reuses the same CRC32 that is already calculated for zip writing, so it doesn't incur additional computational overhead. Data descriptor is one of the file headers inside the zip format https://en.wikipedia.org/wiki/ZIP_(file_format)#Data_descriptor. It contains the CRC32 of the uncompressed data. By inspecting the written data in PyTorchStreamWriter, the CRC32 is found for each written record. In order to make serialization_id a unique and deterministic id for the serialized files without computation overhead, the updated `serialization_id` is computed based on all files written, and is composed of: 1) a combined hash of record name hashes 2) a combined crc32 of the record uncompressed data Example value: "15656915541136177431866432772" Test Plan: buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test Differential Revision: D46038973 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101964 Approved by: https://github.com/davidberard98	2023-05-23 20:47:30 +00:00
atannous	3ed1569e86	Adding serialization ID to inline container (#100994 ) Summary: In order to better track models after serialization, this change writes a serialization_id as a UUID to inline container. Having this ID enables traceability of model in saving and loading events. serialization_id is generated as a new UUID everytime serialization takes place. It can be thought of as a model snapshot identifier at the time of serialization. Test Plan: ``` buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test ``` Local tests: ``` buck2 run @//mode/opt //scripts/atannous:example_pytorch_package buck2 run @//mode/opt //scripts/atannous:example_pytorch buck2 run @//mode/opt //scripts/atannous:example_pytorch_script ``` ``` $ unzip -l output.pt Archive: output.pt Length Date Time Name --------- ---------- ----- ---- 36 00-00-1980 00:00 output/.data/serialization_id 358 00-00-1980 00:00 output/extra/producer_info.json 58 00-00-1980 00:00 output/data.pkl 261 00-00-1980 00:00 output/code/__torch__.py 326 00-00-1980 00:00 output/code/__torch__.py.debug_pkl 4 00-00-1980 00:00 output/constants.pkl 2 00-00-1980 00:00 output/version --------- ------- 1045 7 files ``` ``` unzip -p output.pt "output/.data/serialization_id" a9f903df-cbf6-40e3-8068-68086167ec60 ``` Differential Revision: D45683657 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100994 Approved by: https://github.com/davidberard98	2023-05-17 17:08:48 +00:00
Hongyi Jia	23a095ca5f	Chunked inplace weight loading API (#100615 ) Chunking inplace memory writing to save memory further Reviewed By: zyan0 Differential Revision: D45506186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100615 Approved by: https://github.com/davidberard98	2023-05-04 17:41:18 +00:00
Hongyi Jia	f558bb6f76	inplace PyTorchStreamReader getRecord() (#100418 ) Summary: Sometimes we want to getRecord into an pre-allocated memory to save cpu memory. Adding new API to support the inplace memory writing. Test Plan: caffe2/serialize/inline_container_test Reviewed By: zyan0 Differential Revision: D45439517 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100418 Approved by: https://github.com/davidberard98, https://github.com/houseroad	2023-05-04 01:30:59 +00:00
mikey dagitses	531b8e8f1e	stop using caffe2/core/logging.h forwarding header in serialize lib (#98168 ) No need to create a library for this useless header. Differential Revision: [D44612668](https://our.internmc.facebook.com/intern/diff/D44612668/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98168 Approved by: https://github.com/PaliC	2023-04-06 21:27:07 +00:00
Nikita Shulga	a229e78544	[BE] Enforce sign-compare (#96723 ) Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds. Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase. The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars here: `6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)` Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869 Do not try to fix sign compare violations in caffe2 codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723 Approved by: https://github.com/albanD	2023-03-15 06:04:20 +00:00
Han Qi	b8ba4802fe	Add an option to skip loading of debug traces (#91430 ) Summary: Debug traces consumes lots of memory especially for small models. Test Plan: Unit test Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91430 Approved by: https://github.com/davidberard98	2022-12-29 22:53:17 +00:00
Nikita Shulga	caaf37a111	Fix `PyTorchStreamWriter` exception handling (#88128 ) Avoid double exception in destructor if attempting to serialize to python object that does not have `write` method Use `Finalizer` class in `PyTorchStreamWriter::writeEndOfFile()` to a always set `finailized_` property even if excretion occurs. (as there isn't much one can do at this point) Add expicit check for the attribue to `_open_zipfile_writer_buffer` and add unitests Modernize code a bit by using Python-3 `super()` method Fixes https://github.com/pytorch/pytorch/issues/87997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88128 Approved by: https://github.com/albanD	2022-10-31 23:38:03 +00:00
Daniil Kutz	9213751970	Add exception handler for stoull in caffe2 (#77557 ) Hi! I was playing with libfuzzer and found bug when loading a model from file via `torch::jit::load` function. There is an unhandled exception in caffe2/serialize when calling a `stoull` function on unsanitized version string. The bug can be reproduced with `aot_model_compiler` binary: ``` aot_model_compiler --model=crash-stoull --model_name=name --model_version=1 --input_dims='1,3,224,224;2,2' --input_types='float;float' ``` Crash file is provided in [crash.zip](https://github.com/pytorch/pytorch/files/8701504/crash.zip). gdb output: ``` Temporary breakpoint 1, main (argc=6, argv=0x7ffcd160f9f8) at /pytorch_master/binaries/aot_model_compiler.cc:87 87 "Run NNC AOT compiler for pytorch model. Example usage:\n" (gdb) c Continuing. terminate called after throwing an instance of 'std::invalid_argument' what(): stoull Program received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007fa637f16859 in __GI_abort () at abort.c:79 #2 0x00007fa6381c1911 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x00007fa6381cd38c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x00007fa6381cd3f7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x00007fa6381cd6a9 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007fa6381c42ce in std::__throw_invalid_argument(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x000000000247d567 in __gnu_cxx::__stoa<unsigned long long, unsigned long long, char, int> (__str=0x7ffcd160f228 "ZZ", __idx=0x0, __base=10, __convf=<optimized out>, __name=<optimized out>) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/ext/string_conversions.h:83 #8 std::__cxx11::stoull (__str="ZZ", __idx=0x0, __base=10) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:6577 #9 caffe2::serialize::PyTorchStreamReader::init (this=this@entry=0x8c11ce0) at /pytorch_master/caffe2/serialize/inline_container.cc:145 #10 0x000000000247d9c7 in caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader (this=0x8c11ce0, in=std::shared_ptr<class caffe2::serialize::ReadAdapterInterface> (empty) = {...}) at /pytorch_master/caffe2/serialize/inline_container.cc:88 #11 0x00000000035b7ba4 in __gnu_cxx::new_allocator<caffe2::serialize::PyTorchStreamReader>::construct<caffe2::serialize::PyTorchStreamReader, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > ( __p=0x2, __args=..., this=<optimized out>) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/ext/new_allocator.h:150 #12 std::allocator_traits<std::allocator<caffe2::serialize::PyTorchStreamReader> >::construct<caffe2::serialize::PyTorchStreamReader, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (__a=..., __p=0x2, __p@entry=0x8c11ce0, __args=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/alloc_traits.h:512 #13 0x00000000035b1988 in std::_Sp_counted_ptr_inplace<caffe2::serialize::PyTorchStreamReader, std::allocator<caffe2::serialize::PyTorchStreamReader>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (this=0x8c11cd0, __a=..., __args=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:551 #14 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<caffe2::serialize::PyTorchStreamReader, std::allocator<caffe2::serialize::PyTorchStreamReader>, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (this=0x7ffcd160f3a8, __p=@0x7ffcd160f3a0: 0x10, __args=..., __a=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:683 #15 std::__shared_ptr<caffe2::serialize::PyTorchStreamReader, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<caffe2::serialize::PyTorchStreamReader>, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (this=0x7ffcd160f3a0, __args=..., __tag=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:1371 #16 std::shared_ptr<caffe2::serialize::PyTorchStreamReader>::shared_ptr<std::allocator<caffe2::serialize::PyTorchStreamReader>, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (this=0x7ffcd160f3a0, __args=..., __tag=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr.h:408 #17 std::allocate_shared<caffe2::serialize::PyTorchStreamReader, std::allocator<caffe2::serialize::PyTorchStreamReader>, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (__args=..., __a=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr.h:859 #18 std::make_shared<caffe2::serialize::PyTorchStreamReader, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (__args=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr.h:875 #19 torch::jit::load (rai=std::shared_ptr<class caffe2::serialize::ReadAdapterInterface> (empty) = {...}, device=device@entry=..., Python Exception <class 'gdb.error'> No type named std::__detail::_Hash_node<struct std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, true>.: extra_files=std::unordered_map with 0 elements) at /pytorch_master/torch/csrc/jit/serialization/import.cpp:474 #20 0x00000000035b1ef6 in torch::jit::load (filename="crash-stoull", device=device@entry=..., Python Exception <class 'gdb.error'> No type named std::__detail::_Hash_node<struct std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, true>.: extra_files=std::unordered_map with 0 elements) at /pytorch_master/torch/csrc/jit/serialization/import.cpp:444 #21 0x00000000035b1d22 in torch::jit::load (filename="", device=device@entry=...) at /pytorch_master/torch/csrc/jit/serialization/import.cpp:424 #22 0x00000000008f9be3 in main (argc=1, argv=0x7ffcd160f9f8) at /pytorch_master/binaries/aot_model_compiler.cc:128 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/77557 Approved by: https://github.com/Gamrix	2022-08-10 23:56:15 +00:00

1 2

84 Commits