mirror of
https://github.com/huggingface/accelerate.git
synced 2025-10-20 18:13:46 +08:00
fix typos (#3549)
This commit is contained in:
@ -26,7 +26,7 @@ You will also learn how to setup a few requirements needed for ensuring your env
|
||||
|
||||
## Configuring the Environment
|
||||
|
||||
Before any training can be performed, a Accelerate config file must exist in the system. Usually this can be done by running the following in a terminal and answering the prompts:
|
||||
Before any training can be performed, an Accelerate config file must exist in the system. Usually this can be done by running the following in a terminal and answering the prompts:
|
||||
|
||||
```bash
|
||||
accelerate config
|
||||
@ -52,7 +52,7 @@ os._exit(00) # Restart the notebook
|
||||
|
||||
## Preparing the Dataset and Model
|
||||
|
||||
Next you should prepare your dataset. As mentioned at earlier, great care should be taken when preparing the `DataLoaders` and model to make sure that **nothing** is put on *any* GPU.
|
||||
Next you should prepare your dataset. As mentioned earlier, great care should be taken when preparing the `DataLoaders` and model to make sure that **nothing** is put on *any* GPU.
|
||||
|
||||
If you do, it is recommended to put that specific code into a function and call that from within the notebook launcher interface, which will be shown later.
|
||||
|
||||
|
@ -153,7 +153,7 @@ To use [`find_executable_batch_size`], restructure your training function to inc
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
The inner function **must** take batch size as the first parameter, but we do not pass one to it when called. The wrapper will handles this for you. Any object (models, optimizers) that consumes device memory and is passed to the [`Accelerator`] also **must** be declared inside the inner function.
|
||||
The inner function **must** take batch size as the first parameter, but we do not pass one to it when called. The wrapper will handle this for you. Any object (models, optimizers) that consumes device memory and is passed to the [`Accelerator`] also **must** be declared inside the inner function.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
@ -109,7 +109,7 @@ While FSDP require an explicit `--fsdp_cpu_ram_efficient_loading true` to activa
|
||||
<Tip>
|
||||
|
||||
For FSDP, whenever setting `--fsdp_cpu_ram_efficient_loading true`, `accelerate` will automatically set `sync_module_states` to true.
|
||||
For RAM efficient loading the weights will be loaded only in a singe rank, and thus requires `sync_module_states` to broadcast weights to other ranks.
|
||||
For RAM efficient loading the weights will be loaded only in a single rank, and thus requires `sync_module_states` to broadcast weights to other ranks.
|
||||
|
||||
</Tip>
|
||||
|
||||
@ -125,7 +125,7 @@ FSDP requires an explicit `--fsdp_auto_wrap_policy` for the algorithm to decide
|
||||
|
||||
### Parameters Summoning
|
||||
|
||||
FSDP requires an explicit `--fsdp_use_orig_params` flag if using `torch.compile`, see [the pytorch documenation](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp). For DeepSpeed this is transparent to the user.
|
||||
FSDP requires an explicit `--fsdp_use_orig_params` flag if using `torch.compile`, see [the pytorch documentation](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp). For DeepSpeed this is transparent to the user.
|
||||
|
||||
<Tip>
|
||||
|
||||
@ -147,7 +147,7 @@ Deepspeed requires explicit `--gradient_accumulation_steps` and `--gradient_clip
|
||||
|
||||
## On Differences in Data Precision Handling
|
||||
|
||||
To discuss the how data precision is handled in both FSDP and Deepspeed, it is instructive to first give an overview of how model parameters are handled in these frameworks. Before the model / optimizer parameters are distributed across GPUs, parameter preparation is involved to first "flatten" them to one-dimensional [`torch.Tensor`](https://pytorch.org/docs/stable/tensors.html#torch-tensor). The implementation of FSDP / DeepSpeed varies in the respect of the `dtype` in which these "flattened" parameters are stored, and there are ramifications with regards to how [`torch.Optimizer`](https://pytorch.org/docs/stable/optim.html#module-torch.optim) allocate their `dtype`s. The table below outlines the processes for both frameworks; the "Local" column indicates the process occurring at a per-gpu level, therefore any memory overheads by upcasting should be understood to be amortized by the number of gpus used.
|
||||
To discuss how data precision is handled in both FSDP and Deepspeed, it is instructive to first give an overview of how model parameters are handled in these frameworks. Before the model / optimizer parameters are distributed across GPUs, parameter preparation is involved to first "flatten" them to one-dimensional [`torch.Tensor`](https://pytorch.org/docs/stable/tensors.html#torch-tensor). The implementation of FSDP / DeepSpeed varies in the respect of the `dtype` in which these "flattened" parameters are stored, and there are ramifications with regards to how [`torch.Optimizer`](https://pytorch.org/docs/stable/optim.html#module-torch.optim) allocate their `dtype`s. The table below outlines the processes for both frameworks; the "Local" column indicates the process occurring at a per-gpu level, therefore any memory overheads by upcasting should be understood to be amortized by the number of gpus used.
|
||||
|
||||
<Tip>
|
||||
|
||||
@ -166,7 +166,7 @@ Optimizer (Actual Step) | ✅ | FSDP<br>DeepSpeed | occurs in `torch_dtype` <br
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Therefore when using DeepSpeed a small number of GPUs, be aware of potentially significant memory overheads due to the upcasting during preperation.
|
||||
Therefore when using DeepSpeed a small number of GPUs, be aware of potentially significant memory overheads due to the upcasting during preparation.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
@ -164,7 +164,7 @@ The following arguments are useful for selecting which training paradigm to use.
|
||||
|
||||
The following arguments are only useful when `multi_gpu` is passed or multi-gpu training is configured through `accelerate config`:
|
||||
|
||||
* `--gpu_ids` (`str`) -- What GPUs (by id) should be used for training on this machine as a comma-seperated list
|
||||
* `--gpu_ids` (`str`) -- What GPUs (by id) should be used for training on this machine as a comma-separated list
|
||||
* `--same_network` (`bool`) -- Whether all machines used for multinode training exist on the same local network.
|
||||
* `--machine_rank` (`int`) -- The rank of the machine on which this script is launched.
|
||||
* `--main_process_ip` (`str`) -- The IP address of the machine of rank 0.
|
||||
|
@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
|
||||
[Megatron-LM](https://github.com/NVIDIA/Megatron-LM) enables training large transformer language models at scale.
|
||||
It provides efficient tensor, pipeline and sequence based model parallelism for pre-training transformer based
|
||||
Language Models such as [GPT](https://arxiv.org/abs/2005.14165) (Decoder Only), [BERT](https://arxiv.org/pdf/1810.04805.pdf) (Encoder Only) and [T5](https://arxiv.org/abs/1910.10683) (Encoder-Decoder).
|
||||
For detailed information and how things work behind the scene please refer the github [repo](https://github.com/NVIDIA/Megatron-LM).
|
||||
For detailed information and how things work behind the scene please refer to the github [repo](https://github.com/NVIDIA/Megatron-LM).
|
||||
|
||||
## What is integrated?
|
||||
|
||||
@ -30,7 +30,7 @@ a. **Tensor Parallelism (TP)**: Reduces memory footprint without much additional
|
||||
Each tensor is split into multiple chunks with each shard residing on separate GPU. At each step, the same mini-batch of data is processed
|
||||
independently and in parallel by each shard followed by syncing across all GPUs (`all-reduce` operation).
|
||||
In a simple transformer layer, this leads to 2 `all-reduces` in the forward path and 2 in the backward path.
|
||||
For more details, please refer research paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using
|
||||
For more details, please refer to the research paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using
|
||||
Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf) and
|
||||
this section of blogpost [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed#tensor-parallelism).
|
||||
|
||||
@ -45,7 +45,7 @@ this section of blogpost [The Technology Behind BLOOM Training](https://huggingf
|
||||
|
||||
c. **Sequence Parallelism (SP)**: Reduces memory footprint without any additional communication. Only applicable when using TP.
|
||||
It reduces activation memory required as it prevents the same copies to be on the tensor parallel ranks
|
||||
post `all-reduce` by replacing then with `reduce-scatter` and `no-op` operation would be replaced by `all-gather`.
|
||||
post `all-reduce` by replacing them with `reduce-scatter` and `no-op` operation would be replaced by `all-gather`.
|
||||
As `all-reduce = reduce-scatter + all-gather`, this saves a ton of activation memory at no added communication cost.
|
||||
To put it simply, it shards the outputs of each transformer layer along sequence dimension, e.g.,
|
||||
if the sequence length is `1024` and the TP size is `4`, each GPU will have `256` tokens (1024/4) for each sample.
|
||||
@ -56,7 +56,7 @@ d. **Data Parallelism (DP)** via Distributed Optimizer: Reduces the memory footp
|
||||
(versus the traditional method of replicating the optimizer state across data parallel ranks).
|
||||
For example, when using Adam optimizer with mixed-precision training, each parameter accounts for 12 bytes of memory.
|
||||
This gets distributed equally across the GPUs, i.e., each parameter would account for 3 bytes (12/4) if we have 4 GPUs.
|
||||
For more details, please refer the research paper [ZeRO: Memory Optimizations Toward Training Trillion
|
||||
For more details, please refer to the research paper [ZeRO: Memory Optimizations Toward Training Trillion
|
||||
Parameter Models](https://arxiv.org/pdf/1910.02054.pdf) and following section of blog
|
||||
[The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed#zero-data-parallelism).
|
||||
|
||||
@ -66,7 +66,7 @@ For example, for GPT-3, this leads to 70% reduction in required memory for activ
|
||||
only 2.7% FLOPs overhead for recomputation of activations. For more details, please refer to the research paper
|
||||
[Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/pdf/2205.05198.pdf).
|
||||
|
||||
f. **Fused Kernels**: Fused Softmax, Mixed Precision Fused Layer Norm and Fused gradient accumulation to weight gradient computation of linear layer.
|
||||
f. **Fused Kernels**: Fused Softmax, Mixed Precision Fused Layer Norm and Fused gradient accumulation to weight gradient computation of linear layer.
|
||||
PyTorch JIT compiled Fused GeLU and Fused Bias+Dropout+Residual addition.
|
||||
|
||||
g. **Support for Indexed datasets**: Efficient binary format of datasets for large scale training. Support for the `mmap`, `cached` index file and the `lazy` loader format.
|
||||
@ -445,7 +445,7 @@ python checkpoint_utils/megatgron_gpt2/checkpoint_reshaping_and_interoperability
|
||||
## Megatron-LM GPT models support returning logits and `megatron_generate` function for text generation
|
||||
|
||||
1. Returning logits require setting `require_logits=True` in MegatronLMPlugin as shown below.
|
||||
These would be available on the in the last stage of pipeline.
|
||||
These would be available in the last stage of pipeline.
|
||||
```python
|
||||
megatron_lm_plugin = MegatronLMPlugin(return_logits=True)
|
||||
```
|
||||
@ -569,7 +569,7 @@ setting is synonymous with gradient accumulation.
|
||||
|
||||
7. When using Megatron-LM, use `accelerator.save_state` and `accelerator.load_state` for saving and loading checkpoints.
|
||||
|
||||
8. Below are the mapping from Megatron-LM model architectures to the the equivalent transformers model architectures.
|
||||
8. Below are the mapping from Megatron-LM model architectures to the equivalent transformers model architectures.
|
||||
Only these transformers model architectures are supported.
|
||||
|
||||
a. Megatron-LM [BertModel](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/bert_model.py) :
|
||||
|
@ -225,7 +225,7 @@ In [/slurm/submit_multinode.sh](./slurm/submit_multinode.sh) we must specify the
|
||||
|
||||
In [/slurm/submit_multicpu.sh](./slurm/submit_multicpu.sh) we must specify the number of nodes that will be part of the training (`--num_machines`), how many CPU processes we will use in total (`--num_processes`), the [`backend`](https://pytorch.org/docs/stable/elastic/run.html#note-on-rendezvous-backend), `--main_process_ip` which will be the address the master node and the `--main_process_port`. `mpirun_hostfile` specifies to run the job using MPIRun.
|
||||
|
||||
In both scripts, we run `activateEnviroment.sh` at the beginning. This script should contain the necessary instructions to initialize the environment for execution. Below, we show an example that loads the necessary libraries ([Environment modules](https://github.com/cea-hpc/modules)), activates the Python environment, and sets up various environment variables, most of them to run the scripts in offline mode in case we don't have internet connection from the cluster.
|
||||
In both scripts, we run `activateEnvironment.sh` at the beginning. This script should contain the necessary instructions to initialize the environment for execution. Below, we show an example that loads the necessary libraries ([Environment modules](https://github.com/cea-hpc/modules)), activates the Python environment, and sets up various environment variables, most of them to run the scripts in offline mode in case we don't have internet connection from the cluster.
|
||||
|
||||
```bash
|
||||
# activateEnvironment.sh
|
||||
|
@ -8,7 +8,7 @@
|
||||
#SBATCH --error=E-%x.%j
|
||||
|
||||
######################
|
||||
### Set enviroment ###
|
||||
### Set environment ###
|
||||
######################
|
||||
source activateEnvironment.sh
|
||||
|
||||
|
@ -11,7 +11,7 @@
|
||||
#SBATCH --time=01:59:00 # maximum execution time (HH:MM:SS)
|
||||
|
||||
######################
|
||||
### Set enviroment ###
|
||||
### Set environment ###
|
||||
######################
|
||||
source activateEnvironment.sh
|
||||
export GPUS_PER_NODE=4
|
||||
|
@ -11,7 +11,7 @@
|
||||
#SBATCH --time=01:59:00 # maximum execution time (HH:MM:SS)
|
||||
|
||||
######################
|
||||
### Set enviroment ###
|
||||
### Set environment ###
|
||||
######################
|
||||
source activateEnvironment.sh
|
||||
export GPUS_PER_NODE=4
|
||||
|
@ -11,7 +11,7 @@
|
||||
#SBATCH --time=01:59:00 # maximum execution time (HH:MM:SS)
|
||||
|
||||
######################
|
||||
### Set enviroment ###
|
||||
### Set environment ###
|
||||
######################
|
||||
source activateEnvironment.sh
|
||||
export GPUS_PER_NODE=4
|
||||
|
@ -1675,24 +1675,24 @@ class Accelerator:
|
||||
# * this attribute will always set by init_utils.init_core_state so its always not None.
|
||||
# * mixed_precision.param_dtype only regards _fwd_bwd_param_dtype
|
||||
# * if model is loaded in 16bit, and even if mixed_precision.param_dtype is None,
|
||||
# we sill want to upcast the flat_param.
|
||||
# we still want to upcast the flat_param.
|
||||
if self.mixed_precision != "no": # if mixed precision is set
|
||||
upcasted_log = []
|
||||
for module in FSDP.fsdp_modules(model):
|
||||
# Referencing DeepSpeed Zero3
|
||||
# - in Init, params are converted to 16bit while partitioning.
|
||||
# - in accelerator.prepare, deepspeed.initalize is called to:
|
||||
# * creates the DeepSpeeedEngine.
|
||||
# - in accelerator.prepare, deepspeed.initialize is called to:
|
||||
# * creates the DeepSpeedEngine.
|
||||
# * since zero_optimization() is True , calls engine._configure_zero_optimizer.
|
||||
#
|
||||
# Inside the DeepSpeed Zero3 optimizer configuration, which initalizes
|
||||
# Inside the DeepSpeed Zero3 optimizer configuration, which initializes
|
||||
# DeepSpeedZeroOptimizer_Stage3, during which:
|
||||
# * trainable_param_groups are obtained from the attached optimizer
|
||||
# (already partitioned in 16bit).
|
||||
# * then _setup_for_real_optimizer -> _create_fp32_partitions
|
||||
# which performs the fp32 upcasting.
|
||||
|
||||
# To mimick DeepSeepds's casting in FSDP, we look at the (single) FlatParameter held
|
||||
# To mimic DeepSeepds's casting in FSDP, we look at the (single) FlatParameter held
|
||||
# within an FSDP wrapper. This FlatParameter will be seen by the optimizer.
|
||||
# - even though there is a torch.device('meta') guard below, we
|
||||
# expect _init_utils._init_param_handle_from_module to already
|
||||
@ -3194,7 +3194,7 @@ class Accelerator:
|
||||
|
||||
If a `ProjectConfiguration` was passed to the `Accelerator` object with `automatic_checkpoint_naming` enabled
|
||||
then checkpoints will be saved to `self.project_dir/checkpoints`. If the number of current saves is greater
|
||||
than `total_limit` then the oldest save is deleted. Each checkpoint is saved in seperate folders named
|
||||
than `total_limit` then the oldest save is deleted. Each checkpoint is saved in separate folders named
|
||||
`checkpoint_<iteration>`.
|
||||
|
||||
Otherwise they are just saved to `output_dir`.
|
||||
|
@ -639,7 +639,7 @@ def get_cluster_input():
|
||||
else:
|
||||
machine_type = "GPU(s)"
|
||||
gpu_ids = _ask_field(
|
||||
f"What {machine_type} (by id) should be used for training on this machine as a comma-seperated list? [all]:",
|
||||
f"What {machine_type} (by id) should be used for training on this machine as a comma-separated list? [all]:",
|
||||
default="all",
|
||||
)
|
||||
|
||||
@ -703,7 +703,7 @@ def get_cluster_input():
|
||||
)
|
||||
tpu_command_file = os.path.abspath(tpu_command_file)
|
||||
else:
|
||||
print("Please enter each command seperately you wish to run on startup in each pod.")
|
||||
print("Please enter each command separately you wish to run on startup in each pod.")
|
||||
tpu_commands = []
|
||||
another_command = True
|
||||
while another_command:
|
||||
@ -721,11 +721,11 @@ def get_cluster_input():
|
||||
error_message="Please enter yes or no.",
|
||||
)
|
||||
tpu_vm = _ask_field(
|
||||
"If not using an instance group, what are the names of the Compute VM instances to be used, seperated by a comma: ",
|
||||
"If not using an instance group, what are the names of the Compute VM instances to be used, separated by a comma: ",
|
||||
default="",
|
||||
).split(",")
|
||||
tpu_env = _ask_field(
|
||||
"What environment variables do you wish to set in each pod, seperated by a comma: ",
|
||||
"What environment variables do you wish to set in each pod, separated by a comma: ",
|
||||
default="",
|
||||
).split(",")
|
||||
|
||||
|
@ -43,7 +43,7 @@ def write_basic_config(mixed_precision="no", save_location: str = default_json_c
|
||||
Mixed Precision to use. Should be one of "no", "fp16", or "bf16"
|
||||
save_location (`str`, *optional*, defaults to `default_json_config_file`):
|
||||
Optional custom save location. Should be passed to `--config_file` when using `accelerate launch`. Default
|
||||
location is inside the huggingface cache folder (`~/.cache/huggingface`) but can be overriden by setting
|
||||
location is inside the huggingface cache folder (`~/.cache/huggingface`) but can be overridden by setting
|
||||
the `HF_HOME` environmental variable, followed by `accelerate/default_config.yaml`.
|
||||
"""
|
||||
path = Path(save_location)
|
||||
|
@ -282,7 +282,7 @@ def launch_command_parser(subparsers=None):
|
||||
distributed_args.add_argument(
|
||||
"--gpu_ids",
|
||||
default=None,
|
||||
help="What GPUs (by id) should be used for training on this machine as a comma-seperated list",
|
||||
help="What GPUs (by id) should be used for training on this machine as a comma-separated list",
|
||||
)
|
||||
distributed_args.add_argument(
|
||||
"--same_network",
|
||||
@ -707,7 +707,7 @@ def launch_command_parser(subparsers=None):
|
||||
"--fp8_override_linear_precision",
|
||||
type=lambda x: tuple(map(str_to_bool, x.split(","))),
|
||||
default=(False, False, False),
|
||||
help="Whether or not to execute `fprop`, `dgrad`, and `wgrad` GEMMS in higher precision. Should be passed in a comma-seperated string of booleans (useful only when `--fp8_backend=te` is passed).",
|
||||
help="Whether or not to execute `fprop`, `dgrad`, and `wgrad` GEMMS in higher precision. Should be passed in a comma-separated string of booleans (useful only when `--fp8_backend=te` is passed).",
|
||||
)
|
||||
fp8_args.add_argument(
|
||||
"--fp8_opt_level",
|
||||
|
@ -28,7 +28,7 @@ def get_function_contents_by_name(lines: list[str], name: str):
|
||||
|
||||
Args:
|
||||
lines (`List[str]`):
|
||||
Source code of a script seperated by line.
|
||||
Source code of a script separated by line.
|
||||
name (`str`):
|
||||
The name of the function to extract. Should be either `training_function` or `main`
|
||||
"""
|
||||
@ -54,7 +54,7 @@ def clean_lines(lines: list[str]):
|
||||
|
||||
Args:
|
||||
lines (`List[str]`):
|
||||
Source code of a script seperated by line.
|
||||
Source code of a script separated by line.
|
||||
"""
|
||||
return [line for line in lines if not line.lstrip().startswith("#") and line != "\n"]
|
||||
|
||||
|
@ -771,7 +771,7 @@ class SubprocessCallException(Exception):
|
||||
def run_command(command: list[str], return_stdout=False, env=None):
|
||||
"""
|
||||
Runs `command` with `subprocess.check_output` and will potentially return the `stdout`. Will also properly capture
|
||||
if an error occured while running `command`
|
||||
if an error occurred while running `command`
|
||||
"""
|
||||
# Cast every path in `command` to a string
|
||||
for i, c in enumerate(command):
|
||||
|
@ -370,10 +370,10 @@ class MegatronLMOptimizerWrapper(AcceleratedOptimizer):
|
||||
super().__init__(optimizer, device_placement=False, scaler=None)
|
||||
|
||||
def zero_grad(self, set_to_none=None):
|
||||
pass # `model(**batch)` is doing that automatically. Therefore, it's implementation is not needed
|
||||
pass # `model(**batch)` is doing that automatically. Therefore, its implementation is not needed
|
||||
|
||||
def step(self):
|
||||
pass # `model(**batch)` is doing that automatically. Therefore, it's implementation is not needed
|
||||
pass # `model(**batch)` is doing that automatically. Therefore, its implementation is not needed
|
||||
|
||||
@property
|
||||
def step_was_skipped(self):
|
||||
@ -416,7 +416,7 @@ class MegatronLMSchedulerWrapper(AcceleratedScheduler):
|
||||
super().__init__(scheduler, optimizers)
|
||||
|
||||
def step(self, *args, **kwargs):
|
||||
return # `model(**batch)` is doing that automatically. Therefore, it's implementation is not needed
|
||||
return # `model(**batch)` is doing that automatically. Therefore, its implementation is not needed
|
||||
|
||||
|
||||
def prepare_scheduler(accelerator, optimizer, scheduler):
|
||||
@ -630,7 +630,7 @@ class GPTTrainStep(AbstractTrainStep):
|
||||
labels = tokens_[:, 1:].contiguous()
|
||||
tokens = tokens_[:, :-1].contiguous()
|
||||
|
||||
# Get the masks and postition ids.
|
||||
# Get the masks and position ids.
|
||||
attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
|
||||
tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, self.eod_mask_loss
|
||||
)
|
||||
@ -647,7 +647,7 @@ class GPTTrainStep(AbstractTrainStep):
|
||||
tokens_ = torch.concat([tokens_, padding], dim=1)
|
||||
labels = tokens_[:, 1:].contiguous()
|
||||
tokens = tokens_[:, :-1].contiguous()
|
||||
# Get the masks and postition ids.
|
||||
# Get the masks and position ids.
|
||||
attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
|
||||
tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, True
|
||||
)
|
||||
@ -1348,7 +1348,7 @@ class MegatronEngine(torch.nn.Module):
|
||||
sizes_list = [
|
||||
prompts_tokens_tensor.size(0), # Batch size
|
||||
prompts_tokens_tensor.size(1),
|
||||
] # Sequence lenght
|
||||
] # Sequence length
|
||||
|
||||
# First, broadcast the sizes.
|
||||
sizes_tensor = broadcast_int_list(2, int_list=sizes_list, rank=0)
|
||||
|
Reference in New Issue
Block a user