pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 13:44:15 +08:00

Author	SHA1	Message	Date
Gerard Goossen	c2c0a32155	Remove setting logger level in caffe2.python.checkpoint (#19803 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19803 There is no reason to set a specific logging level for this module. Removing it to just use the default logging level. Differential Revision: D15098834 fbshipit-source-id: 1654c04500c19690ddde03343f2e84b04bb0f1ef	2019-05-10 07:00:58 -07:00
Qinqing Zheng	17dac3e17f	Create class constant for string literal 'blob_names' Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10827 Reviewed By: boryiingsu Differential Revision: D9484567 fbshipit-source-id: 275eddc9406b5f427d72c0ab9b0da481b5e59ece	2018-08-24 22:11:43 -07:00
Shihao Xu	04b773ab87	Support Loading to GPU (#10710 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10710 Can't resume from checkpoint for workflows that use GPU. The problem is just we didn't leverage the already-provided GPU deserialization of Caffe2. `keep_device` arg of LoadOp. See https://fburl.com/y27ltaxw How a serialized BlobProto (contraining TensorProto) is loaded into GPU memory? - Load BlobProto from DB. https://fburl.com/pe1qaeyf - Deserialize the BlobProto into a Blob instance. https://fburl.com/5dirjuuh and https://fburl.com/stoho0x1 - Call Blob->Deserialized. https://fburl.com/bnureu32 - Deserializer Registration. https://fburl.com/wbu95ry7 https://fburl.com/ycetud8u - Create TensorCUDA Deserializer. https://fburl.com/2lirfuqj - Create Tensor on GPU and get TensorProto of BlobProto. https://fburl.com/7dre82zg - Copy TensorProto in CPU to Tensor on GPU. https://fburl.com/fr0qk2oe Cloned the GPU workflows for testing in D9125520. Reviewed By: mraway Differential Revision: D9372950 fbshipit-source-id: 2bf70747bd71e8da16239197f7d2761d63f09ff8	2018-08-21 13:57:36 -07:00
Bram Wasti	aa56a1211d	Update from facebook (#6871 ) * Track checkpoint performance in scuba As title. * [C2/CUDA]: fix cross entropy sigmoid with logits when adding log_d_trick, I forgot to add it to the cuda impl; this diff fixes it. * Back out "[caffe2] Unregister MKL fallbacks for NCHW conversions" Original commit changeset: 8918dd40205a Will land after @jongsoo's diff https://phabricator.intern.facebook.com/D7596315 lands * [Easy][C2] Don't add blob to external outputs from output_record if it's already external output As desc. * On Mobile phones, call GlobalInit with no arguments in predictor in case we need to perform initialization FACEBOOK: The QPL logger needs the initialization code. In the past, the initialization code is put in the pipeline calling Caffe2. However, those places become obsolete quickly, as the product teams change places to call Caffe2 from time to time. We also need to track which teams use Caffe2 so that we can put the initialization code there. With this diff, the initialization code is put in the predictor constructor, only enabled for mobile phones. This way, we can always enable QPL logging. Once we do this, we can check how many times Caffe2 inference is called in production, and which models are more popular in production. This way, we can prioritize our effort supporting those models. Will clean up the old code calling the init in the product in a separate diff. * add padding op for sparse length tensor to pad length-based sparse tensor with padding_value * Add conv_op with cudaconvnet engine Add conv_op with cudaconvnet engine * [numa] Fix simple NUMA copy benchmark Move XavierFill into init_net and also compute BW * call roundf (device function) instead of round (host function) * [caffe2_benchmark][observer] Make caffe2_benchmark use its own observer 1. Add ClearGlobalNetObservers() 2. Make caffe2_benchmark use its own observer and observer_reporter * [detectron] Use roundf instead of round in the detectron module ops * allow K larger than number of elements in top k op one use case is to use this op together with PackSegments for sparse tensors, where the number of elements in each slice is not statistically defined. * add ChannelShuffle DNNLOWP op * fixup math_cpu.cc break	2018-04-23 15:01:56 -07:00
Qinqing Zheng	90586d925f	[DT] [38/n] Rename add_stop_signal to add_stop_condition (#6825 ) att	2018-04-23 10:39:37 -07:00
Qinqing Zheng	66791f54d5	Update the compile function of Job (#6323 )	2018-04-09 22:44:23 -07:00
Qinqing Zheng	fd2e7cb487	Change JobRunner's __call__ function to train (#6205 )	2018-04-02 21:04:36 -07:00
Qinqing Zheng	365652229d	Back out "Revert D7372460: [DT] [28/n] Lift epoch_limiter" Original commit changeset: b0a986d16c3b	2018-03-30 21:00:44 -07:00
Andrey Malevich	f8eb8a66e2	Revert D7372460: [DT] [28/n] Lift epoch_limiter This reverts commit 05bd9bec10fad5ff9dc40be88836fd7274d50ce9 @bypass-lint An infra SEV is better than not reverting this diff. If you copy this password, see you in SEV Review! @cause_a_sev_many_files	2018-03-30 21:00:44 -07:00
Orion Reblitz-Richardson	1d5780d42c	Remove Apache headers from source. * LICENSE file contains details, so removing from individual source files.	2018-03-27 13:10:18 -07:00
Qinqing Zheng	1288c4fd79	refactor epoch_limiter (#2389 ) * refactor epoch_limiter * fix test	2018-03-22 20:32:13 -07:00
Qinqing Zheng	b3fdfa7bd6	[DT] [4/n] Make epoch_group explicit for JobRunner (#2018 )	2018-02-23 10:41:52 -08:00
Kittipat Virochsiri	6f533fd8b8	Only overwrite path_prefix & path_type when not None Summary: This breaks internal functionality Reviewed By: aartibasant Differential Revision: D6975222 fbshipit-source-id: ce751950b4b9217d8ea5de703690451e98642f00	2018-02-13 14:40:35 -08:00
Aarti Basant	28f42cc8e7	separating set_params and init() for checkpoint managers. Summary: separating set_params and init() for checkpoint managers. Reviewed By: anshulverma Differential Revision: D6852255 fbshipit-source-id: 061f16ce0c49953ca8a5fe9546af5c9945a3be48	2018-02-05 18:03:21 -08:00
Qinqing Zheng	90a3363f29	Return an empty TaskGroup if node managers exist in MultiNodeCheckpointManager Summary: Current MultiNodeCheckpointManager return None in this case, yet in JobRunner we assume this function returns a valid task group, i.e. we call session.run(self.checkpoint_manager.init(...)) directly. This will fail the case we use LocalHostScheduler and reuse a MultiNodeCheckpointManager Reviewed By: azzolini Differential Revision: D6843450 fbshipit-source-id: a7ec942cfe692f19e8751b0078ae6a6108f29e54	2018-01-30 19:20:50 -08:00
Aarti Basant	fc56e86c7d	Introduce init API for the optional Checkpoint Metadata Handler object Summary: Every call to the checkpoint_metadata_handler write() API requires us to pass all params like db_prefix, db_type etc. Introducing an init API in the checkpoint_metadata_handler so that such params can be saved and need not be passed in every API call Reviewed By: mraway, anshulverma Differential Revision: D6792651 fbshipit-source-id: 059fa4309e8fce1ee5ab009af3e0570573c24245	2018-01-24 15:19:55 -08:00
Wei Zhang	1d4e996b87	Separate parameter downloading tasks from training tasks and run them in a different group Summary: At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training: 1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource. 2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training. Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group. Reviewed By: azzolini Differential Revision: D6765393 fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49	2018-01-22 14:04:12 -08:00
Aarti Basant	33d734fcf1	Generalize construction of db_name in checkpoint manager Summary: Instead of constructing db_name as a member of checkpoint_manager, generalize this function Reviewed By: anshulverma Differential Revision: D6671088 fbshipit-source-id: c528538def66933619f2fdf67820bca5d13571ea	2018-01-10 11:49:17 -08:00
Aarti Basant	8af9f0da99	Saving checkpoint failure should not cause job failure Summary: If we encounter failures while writing a checkpoint, ensure that the job does not fail. A job can make progress even if writing a checkpoint fails Reviewed By: anshulverma, boryiingsu Differential Revision: D6615163 fbshipit-source-id: 01f790422e1a81bab1fe73f86750eaf75a72bb77	2017-12-21 10:32:55 -08:00
Aarti Basant	5de880f3e1	Resume from epoch instead of re-starting a worklow from scratch when we retry Reviewed By: anshulverma Differential Revision: D6354076 fbshipit-source-id: d2bee93a1136fb07c46942649e90110d2e3ccb0e	2017-11-17 12:51:07 -08:00
Anshul Verma	4b8669b087	Write checkpoint info to XDB at the end of an epoch Summary: In this diff I am making sure that the checkpoint metadata is written out to the db for every epoch. This will allow us to automatically resume from a epoch if a workflow fails. Reviewed By: aartibasant Differential Revision: D6234832 fbshipit-source-id: f09a4de118f2eac25f663556476ac6313925fdf3	2017-11-09 11:13:24 -08:00
Lei Chen	58bcf76ba3	Have model downloading as a separate plan Summary: For distributed offline training, downloading parameters from trainer_0 is part of epoch plan. However for distributed realtime training, we publish model by a specific time interval, so we need run multiple iterations for epoch plan before publishing the model. In this diff, I split downloading parameters from epoch plan as a separate plan, so we can explicitly execute it before model publishing for distributed online training. Reviewed By: boryiingsu Differential Revision: D5995122 fbshipit-source-id: 47d61d7b8c57cfae156e79b7ec32068ef579d7c3	2017-10-16 16:03:48 -07:00
Dmytro Dzhulgakov	2972a6ca02	Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger" Summary: This reverts commit 95c634872ac02be721257169e38c8fead04cd66b bypass-lint Differential Revision: D6026557 fbshipit-source-id: 663c28583ce3b01070ff5449115ed7e222f71776	2017-10-12 20:21:52 -07:00
Luke Yeager	75bece6ede	Fix "No handlers could be found for logger" Summary: Closes https://github.com/caffe2/caffe2/pull/1316 Differential Revision: D6026557 Pulled By: Yangqing fbshipit-source-id: 95c634872ac02be721257169e38c8fead04cd66b	2017-10-10 22:32:13 -07:00
Yangqing Jia	8286ce1e3a	Re-license to Apache Summary: Closes https://github.com/caffe2/caffe2/pull/1260 Differential Revision: D5906739 Pulled By: Yangqing fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902	2017-09-28 16:22:00 -07:00
Andrew Dye	2070467c57	Allow CheckpointManager init() and load() to use a different db type with path_prefix Summary: CheckpointManager already accepts a path_prefix override for init() and load(), but it assumes the same db_type passed in __init__(). This change adds an optional path_type for each call. Reviewed By: boryiingsu Differential Revision: D5888152 fbshipit-source-id: 21cd31a62a0188fe0e0b19b43c3b232c2342d0a8	2017-09-22 09:48:29 -07:00
Aarti Basant	77a02eaa7f	Enable reader checkpoint Summary: Reader checkpointing was disabled due to bug captured in T21143272 Now that we have resolved that issue, re-enabling reader checkpointing Reviewed By: boryiingsu, rayleichen Differential Revision: D5730545 fbshipit-source-id: 7fae48b03e07eaf530bfc9e8e8b6683d8ed4e206	2017-09-05 14:21:25 -07:00
Bor-Yiing Su	b3536a3a6d	Adds checkpoint taskgroups to the online trainer. Summary: 1. Uses the upload_builder in the offline training. 2. Adds the checkpoint taskgroups to the online trainer. 3. Changes the naming rules so that the model checkpoint has the format of <directory>/<entity_id>_<snapshot_id>.<node_name>.<snapshot_id> Reviewed By: rayleichen Differential Revision: D5665068 fbshipit-source-id: a8103aed2ca195a506174d2a1d50611d2f1d9c35	2017-08-19 04:09:47 -07:00
Bor-Yiing Su	1d70a2276d	Changes the checkpoint naming rules. Summary: So far the we format the epoch name with 6 digits, but this is constraining. In order to have consistent naming, we can simply append the epoch to the suffix. Then we will have consistent naming rules for small and for large epoch numbers. Reviewed By: azzolini Differential Revision: D5653871 fbshipit-source-id: acdf26a14b731347bb85fe2f33c1b89e2ba83bdd	2017-08-17 22:16:42 -07:00
Bor-Yiing Su	49ec942825	Temporarily disables the checkpoints for the readers. Summary: The hive reader checkpoints are broken because of D5582328. This breaks our offline simulator test as well. This is a temporary fix that disables the checkpoints for readers. Reviewed By: azzolini Differential Revision: D5637719 fbshipit-source-id: 4f31ae534cb7e981fcacbb721cbb2420249fad91	2017-08-15 19:36:11 -07:00
Bor-Yiing Su	404f8ee9b4	Extends the jobrunner to support uploading checkpoints. Summary: 1. Adds one more step in the JobRunner class to upload checkpoints. 2. Adds one function to return the name of the checkpoint given the name of the node. Reviewed By: andrewwdye Differential Revision: D5597130 fbshipit-source-id: 570a55785e6227859e1115326d6cab077f0e7f72	2017-08-11 14:17:17 -07:00
Bor-Yiing Su	81a55f441c	Adds interfaces to check the existence of a DB Summary: To evaluate on checkpoints, we often need to load from multiple checkpoints. However, it is inconvenient if we always need to check the existence of a checkpoint manually. Adds interfaces to check the existence of a DB so that we can find available checkpoints automatically. Reviewed By: azzolini Differential Revision: D4823876 fbshipit-source-id: e5a65b736ac2addd0447c4add81dbd0986f422e7	2017-04-11 14:07:49 -07:00
Bor-Yiing Su	8f9cd757db	Skips the initialization phase of the individual checkpoint objects. Summary: The initialization phase of each checkpoint object simply loads the nanmes of the blobs in the checkpoints. When we load from the checkpoints, the names of the blobs are given. We can skip this init step. Reviewed By: azzolini Differential Revision: D4808114 fbshipit-source-id: 4c740049c1014f3e93b4b87f43e3937afdefa25a	2017-03-31 10:10:56 -07:00
Bor-Yiing Su	0e6413f8ea	Fix flaky test Summary: Somehow the stress-runs flag does not work as what I expected. Now the test finally passes. Reviewed By: azzolini Differential Revision: D4797559 fbshipit-source-id: 1e46844e9ae55c331c2e265a59dc550983274213	2017-03-29 16:48:20 -07:00
Aaron Markham	58f7f2b441	doxygen python block added Summary: Closes https://github.com/caffe2/caffe2/pull/226 Differential Revision: D4793550 Pulled By: JoelMarcey fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e	2017-03-29 06:46:16 -07:00
Bor-Yiing Su	7fa4acab9b	Loads only the model blobs from the checkpoints. Summary: To evaluate from checkpoints, we need to load a model from the checkpoints. However, the checkpoints store way more blobs than the blobs needed by the model. This function enables the model builder to load only the blobs associated with the model to the workspace. After that, the model builder can evaluate the model from the populated workspace. Reviewed By: azzolini Differential Revision: D4751414 fbshipit-source-id: a7a420228d681fc2dcfd8573cf69a97b1abc2ef3	2017-03-27 10:02:11 -07:00
Alisson Gusatti Azzolini	6ff05fd49d	Fix issues pickling jobs Summary: We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session. This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling. Reviewed By: dzhulgakov Differential Revision: D4554799 fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984	2017-02-21 20:47:27 -08:00
Alisson Gusatti Azzolini	14a5b35805	Snapshot -> Checkpoint Summary: As per kennyhorror request. Reviewed By: kennyhorror Differential Revision: D4473177 fbshipit-source-id: 6cab6ccf247b09aab8f6f056c807bd3ed27ee6a5	2017-01-27 22:29:32 -08:00

38 Commits