fix typos (#3549)

2025-10-20 18:13:46 +08:00 · 2025-05-08 14:10:12 +02:00
parent 8d8fd83672
commit 7013365791
17 changed files with 42 additions and 42 deletions
--- a/docs/source/basic_tutorials/notebook.md
+++ b/docs/source/basic_tutorials/notebook.md
@ -26,7 +26,7 @@ You will also learn how to setup a few requirements needed for ensuring your env

 ## Configuring the Environment

-Before any training can be performed, a Accelerate config file must exist in the system. Usually this can be done by running the following in a terminal and answering the prompts:
+Before any training can be performed, an Accelerate config file must exist in the system. Usually this can be done by running the following in a terminal and answering the prompts:

 ```bash
 accelerate config
@ -52,7 +52,7 @@ os._exit(00)  # Restart the notebook

 ## Preparing the Dataset and Model

-Next you should prepare your dataset. As mentioned at earlier, great care should be taken when preparing the `DataLoaders` and model to make sure that **nothing** is put on *any* GPU. 
+Next you should prepare your dataset. As mentioned earlier, great care should be taken when preparing the `DataLoaders` and model to make sure that **nothing** is put on *any* GPU. 

 If you do, it is recommended to put that specific code into a function and call that from within the notebook launcher interface, which will be shown later. 

--- a/docs/source/basic_tutorials/troubleshooting.md
+++ b/docs/source/basic_tutorials/troubleshooting.md
@ -153,7 +153,7 @@ To use [`find_executable_batch_size`], restructure your training function to inc

 <Tip warning={true}> 

-The inner function **must** take batch size as the first parameter, but we do not pass one to it when called. The wrapper will handles this for you. Any object (models, optimizers) that consumes device memory and is passed to the [`Accelerator`] also **must** be declared inside the inner function.
+The inner function **must** take batch size as the first parameter, but we do not pass one to it when called. The wrapper will handle this for you. Any object (models, optimizers) that consumes device memory and is passed to the [`Accelerator`] also **must** be declared inside the inner function.

 </Tip>

--- a/docs/source/concept_guides/fsdp_and_deepspeed.md
+++ b/docs/source/concept_guides/fsdp_and_deepspeed.md
@ -109,7 +109,7 @@ While FSDP require an explicit `--fsdp_cpu_ram_efficient_loading true` to activa
 <Tip>

    For FSDP, whenever setting `--fsdp_cpu_ram_efficient_loading true`, `accelerate` will automatically set `sync_module_states` to true. 
-    For RAM efficient loading the weights will be loaded only in a singe rank, and thus requires `sync_module_states` to broadcast weights to other ranks.
+    For RAM efficient loading the weights will be loaded only in a single rank, and thus requires `sync_module_states` to broadcast weights to other ranks.

 </Tip>

@ -125,7 +125,7 @@ FSDP requires an explicit `--fsdp_auto_wrap_policy` for the algorithm to decide

 ### Parameters Summoning

-FSDP requires an explicit `--fsdp_use_orig_params` flag if using `torch.compile`, see [the pytorch documenation](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp). For DeepSpeed this is transparent to the user.
+FSDP requires an explicit `--fsdp_use_orig_params` flag if using `torch.compile`, see [the pytorch documentation](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp). For DeepSpeed this is transparent to the user.

 <Tip>

@ -147,7 +147,7 @@ Deepspeed requires explicit `--gradient_accumulation_steps` and `--gradient_clip

 ## On Differences in Data Precision Handling

-To discuss the how data precision is handled in both FSDP and Deepspeed, it is instructive to first give an overview of how model parameters are handled in these frameworks. Before the model / optimizer parameters are distributed across GPUs, parameter preparation is involved to first "flatten" them to  one-dimensional [`torch.Tensor`](https://pytorch.org/docs/stable/tensors.html#torch-tensor). The implementation of FSDP / DeepSpeed varies in the respect of the `dtype` in which these "flattened" parameters are stored, and there are ramifications with regards to how [`torch.Optimizer`](https://pytorch.org/docs/stable/optim.html#module-torch.optim) allocate their `dtype`s. The table below outlines the processes for both frameworks; the "Local" column indicates the process occurring at a per-gpu level, therefore any memory overheads by upcasting should be understood to be amortized by the number of gpus used.
+To discuss how data precision is handled in both FSDP and Deepspeed, it is instructive to first give an overview of how model parameters are handled in these frameworks. Before the model / optimizer parameters are distributed across GPUs, parameter preparation is involved to first "flatten" them to one-dimensional [`torch.Tensor`](https://pytorch.org/docs/stable/tensors.html#torch-tensor). The implementation of FSDP / DeepSpeed varies in the respect of the `dtype` in which these "flattened" parameters are stored, and there are ramifications with regards to how [`torch.Optimizer`](https://pytorch.org/docs/stable/optim.html#module-torch.optim) allocate their `dtype`s. The table below outlines the processes for both frameworks; the "Local" column indicates the process occurring at a per-gpu level, therefore any memory overheads by upcasting should be understood to be amortized by the number of gpus used.

 <Tip>

@ -166,7 +166,7 @@ Optimizer (Actual Step) | ✅ | FSDP<br>DeepSpeed  | occurs in `torch_dtype` <br

 <Tip warning={true}>

-    Therefore when using DeepSpeed a small number of GPUs, be aware of potentially significant memory overheads due to the upcasting during preperation.
+    Therefore when using DeepSpeed a small number of GPUs, be aware of potentially significant memory overheads due to the upcasting during preparation.

 </Tip>

--- a/docs/source/package_reference/cli.md
+++ b/docs/source/package_reference/cli.md
@ -164,7 +164,7 @@ The following arguments are useful for selecting which training paradigm to use.

 The following arguments are only useful when `multi_gpu` is passed or multi-gpu training is configured through `accelerate config`: 

-* `--gpu_ids` (`str`) -- What GPUs (by id) should be used for training on this machine as a comma-seperated list
+* `--gpu_ids` (`str`) -- What GPUs (by id) should be used for training on this machine as a comma-separated list
 * `--same_network` (`bool`) -- Whether all machines used for multinode training exist on the same local network.
 * `--machine_rank` (`int`) -- The rank of the machine on which this script is launched.
 * `--main_process_ip` (`str`) -- The IP address of the machine of rank 0.
--- a/docs/source/usage_guides/megatron_lm.md
+++ b/docs/source/usage_guides/megatron_lm.md
@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
 [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) enables training large transformer language models at scale.
 It provides efficient tensor, pipeline and sequence based model parallelism for pre-training transformer based
 Language Models such as [GPT](https://arxiv.org/abs/2005.14165) (Decoder Only), [BERT](https://arxiv.org/pdf/1810.04805.pdf) (Encoder Only) and [T5](https://arxiv.org/abs/1910.10683) (Encoder-Decoder).
-For detailed information and how things work behind the scene please refer the github [repo](https://github.com/NVIDIA/Megatron-LM).
+For detailed information and how things work behind the scene please refer to the github [repo](https://github.com/NVIDIA/Megatron-LM).

 ## What is integrated?

@ -30,7 +30,7 @@ a. **Tensor Parallelism (TP)**: Reduces memory footprint without much additional
 Each tensor is split into multiple chunks with each shard residing on separate GPU. At each step, the same mini-batch of data is processed
 independently and in parallel by each shard followed by syncing across all GPUs (`all-reduce` operation). 
 In a simple transformer layer, this leads to 2 `all-reduces` in the forward path and 2 in the backward path.
-For more details, please refer research paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using
+For more details, please refer to the research paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using
 Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf) and 
 this section of blogpost [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed#tensor-parallelism).

@ -45,7 +45,7 @@ this section of blogpost [The Technology Behind BLOOM Training](https://huggingf

 c. **Sequence Parallelism (SP)**: Reduces memory footprint without any additional communication. Only applicable when using TP.
 It reduces activation memory required as it prevents the same copies to be on the tensor parallel ranks 
-post `all-reduce` by replacing then with `reduce-scatter` and `no-op` operation would be replaced by `all-gather`. 
+post `all-reduce` by replacing them with `reduce-scatter` and `no-op` operation would be replaced by `all-gather`. 
 As `all-reduce = reduce-scatter + all-gather`, this saves a ton of activation memory at no added communication cost. 
 To put it simply, it shards the outputs of each transformer layer along sequence dimension, e.g., 
 if the sequence length is `1024` and the TP size is `4`, each GPU will have `256` tokens (1024/4) for each sample. 
@ -56,7 +56,7 @@ d. **Data Parallelism (DP)** via Distributed Optimizer: Reduces the memory footp
 (versus the traditional method of replicating the optimizer state across data parallel ranks). 
 For example, when using Adam optimizer with mixed-precision training, each parameter accounts for 12 bytes of memory.
 This gets distributed equally across the GPUs, i.e., each parameter would account for 3 bytes (12/4) if we have 4 GPUs.
-For more details, please refer the research paper [ZeRO: Memory Optimizations Toward Training Trillion
+For more details, please refer to the research paper [ZeRO: Memory Optimizations Toward Training Trillion
 Parameter Models](https://arxiv.org/pdf/1910.02054.pdf) and following section of blog 
 [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed#zero-data-parallelism).

@ -66,7 +66,7 @@ For example, for GPT-3, this leads to 70% reduction in required memory for activ
 only 2.7% FLOPs overhead for recomputation of activations. For more details, please refer to the research paper 
 [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/pdf/2205.05198.pdf).

-f. **Fused Kernels**: Fused Softmax, Mixed Precision Fused Layer Norm and  Fused gradient accumulation to weight gradient computation of linear layer.
+f. **Fused Kernels**: Fused Softmax, Mixed Precision Fused Layer Norm and Fused gradient accumulation to weight gradient computation of linear layer.
 PyTorch JIT compiled Fused GeLU and Fused Bias+Dropout+Residual addition.

 g. **Support for Indexed datasets**: Efficient binary format of datasets for large scale training. Support for the `mmap`, `cached` index file and the `lazy` loader format.
@ -445,7 +445,7 @@ python checkpoint_utils/megatgron_gpt2/checkpoint_reshaping_and_interoperability
 ## Megatron-LM GPT models support returning logits and `megatron_generate` function for text generation

 1. Returning logits require setting `require_logits=True` in MegatronLMPlugin as shown below. 
-These would be available on the in the last stage of pipeline.
+These would be available in the last stage of pipeline.
 ```python
 megatron_lm_plugin = MegatronLMPlugin(return_logits=True)
 ```
@ -569,7 +569,7 @@ setting is synonymous with gradient accumulation.

 7. When using Megatron-LM, use `accelerator.save_state` and `accelerator.load_state` for saving and loading checkpoints.

-8. Below are the mapping from Megatron-LM model architectures to the the equivalent transformers model architectures.
+8. Below are the mapping from Megatron-LM model architectures to the equivalent transformers model architectures.
 Only these transformers model architectures are supported.

 a. Megatron-LM [BertModel](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/bert_model.py) : 
--- a/examples/README.md
+++ b/examples/README.md
@ -225,7 +225,7 @@ In [/slurm/submit_multinode.sh](./slurm/submit_multinode.sh) we must specify the

 In [/slurm/submit_multicpu.sh](./slurm/submit_multicpu.sh) we must specify the number of nodes that will be part of the training (`--num_machines`), how many CPU processes we will use in total (`--num_processes`), the [`backend`](https://pytorch.org/docs/stable/elastic/run.html#note-on-rendezvous-backend), `--main_process_ip` which will be the address the master node and the `--main_process_port`. `mpirun_hostfile` specifies to run the job using MPIRun.

-In both scripts, we run `activateEnviroment.sh` at the beginning. This script should contain the necessary instructions to initialize the environment for execution. Below, we show an example that loads the necessary libraries ([Environment modules](https://github.com/cea-hpc/modules)), activates the Python environment, and sets up various environment variables, most of them to run the scripts in offline mode in case we don't have internet connection from the cluster.
+In both scripts, we run `activateEnvironment.sh` at the beginning. This script should contain the necessary instructions to initialize the environment for execution. Below, we show an example that loads the necessary libraries ([Environment modules](https://github.com/cea-hpc/modules)), activates the Python environment, and sets up various environment variables, most of them to run the scripts in offline mode in case we don't have internet connection from the cluster.

 ```bash
 # activateEnvironment.sh 
--- a/examples/slurm/submit_multicpu.sh
+++ b/examples/slurm/submit_multicpu.sh
@ -8,7 +8,7 @@
 #SBATCH --error=E-%x.%j

 ######################
-### Set enviroment ###
+### Set environment ###
 ######################
 source activateEnvironment.sh

--- a/examples/slurm/submit_multigpu.sh
+++ b/examples/slurm/submit_multigpu.sh
@ -11,7 +11,7 @@
 #SBATCH --time=01:59:00             # maximum execution time (HH:MM:SS)

 ######################
-### Set enviroment ###
+### Set environment ###
 ######################
 source activateEnvironment.sh
 export GPUS_PER_NODE=4
--- a/examples/slurm/submit_multinode.sh
+++ b/examples/slurm/submit_multinode.sh
@ -11,7 +11,7 @@
 #SBATCH --time=01:59:00             # maximum execution time (HH:MM:SS)

 ######################
-### Set enviroment ###
+### Set environment ###
 ######################
 source activateEnvironment.sh
 export GPUS_PER_NODE=4
--- a/examples/slurm/submit_multinode_fsdp.sh
+++ b/examples/slurm/submit_multinode_fsdp.sh
@ -11,7 +11,7 @@
 #SBATCH --time=01:59:00             # maximum execution time (HH:MM:SS)

 ######################
-### Set enviroment ###
+### Set environment ###
 ######################
 source activateEnvironment.sh
 export GPUS_PER_NODE=4
--- a/src/accelerate/accelerator.py
+++ b/src/accelerate/accelerator.py
@ -1675,24 +1675,24 @@ class Accelerator:
                #   * this attribute will always set by init_utils.init_core_state so its always not None.
                #   * mixed_precision.param_dtype only regards _fwd_bwd_param_dtype
                #   * if model is loaded in 16bit, and even if mixed_precision.param_dtype is None,
-                #     we sill want to upcast the flat_param.
+                #     we still want to upcast the flat_param.
                if self.mixed_precision != "no":  # if mixed precision is set
                    upcasted_log = []
                    for module in FSDP.fsdp_modules(model):
                        # Referencing DeepSpeed Zero3
                        # - in Init, params are converted to 16bit while partitioning.
-                        # - in accelerator.prepare, deepspeed.initalize is called to:
-                        #   * creates the DeepSpeeedEngine.
+                        # - in accelerator.prepare, deepspeed.initialize is called to:
+                        #   * creates the DeepSpeedEngine.
                        #   * since zero_optimization() is True , calls engine._configure_zero_optimizer.
                        #
-                        # Inside the DeepSpeed Zero3 optimizer configuration, which initalizes
+                        # Inside the DeepSpeed Zero3 optimizer configuration, which initializes
                        # DeepSpeedZeroOptimizer_Stage3, during which:
                        #   * trainable_param_groups are obtained from the attached optimizer
                        #     (already partitioned in 16bit).
                        #   * then _setup_for_real_optimizer -> _create_fp32_partitions
                        #     which performs the fp32 upcasting.

-                        # To mimick DeepSeepds's casting in FSDP, we look at the (single) FlatParameter held
+                        # To mimic DeepSeepds's casting in FSDP, we look at the (single) FlatParameter held
                        # within an FSDP wrapper. This FlatParameter will be seen by the optimizer.
                        #  - even though there is a torch.device('meta') guard below, we
                        #    expect _init_utils._init_param_handle_from_module to already
@ -3194,7 +3194,7 @@ class Accelerator:

        If a `ProjectConfiguration` was passed to the `Accelerator` object with `automatic_checkpoint_naming` enabled
        then checkpoints will be saved to `self.project_dir/checkpoints`. If the number of current saves is greater
-        than `total_limit` then the oldest save is deleted. Each checkpoint is saved in seperate folders named
+        than `total_limit` then the oldest save is deleted. Each checkpoint is saved in separate folders named
        `checkpoint_<iteration>`.

        Otherwise they are just saved to `output_dir`.
--- a/src/accelerate/commands/config/cluster.py
+++ b/src/accelerate/commands/config/cluster.py
@ -639,7 +639,7 @@ def get_cluster_input():
        else:
            machine_type = "GPU(s)"
        gpu_ids = _ask_field(
-            f"What {machine_type} (by id) should be used for training on this machine as a comma-seperated list? [all]:",
+            f"What {machine_type} (by id) should be used for training on this machine as a comma-separated list? [all]:",
            default="all",
        )

@ -703,7 +703,7 @@ def get_cluster_input():
                    )
                    tpu_command_file = os.path.abspath(tpu_command_file)
                else:
-                    print("Please enter each command seperately you wish to run on startup in each pod.")
+                    print("Please enter each command separately you wish to run on startup in each pod.")
                    tpu_commands = []
                    another_command = True
                    while another_command:
@ -721,11 +721,11 @@ def get_cluster_input():
                            error_message="Please enter yes or no.",
                        )
            tpu_vm = _ask_field(
-                "If not using an instance group, what are the names of the Compute VM instances to be used, seperated by a comma: ",
+                "If not using an instance group, what are the names of the Compute VM instances to be used, separated by a comma: ",
                default="",
            ).split(",")
            tpu_env = _ask_field(
-                "What environment variables do you wish to set in each pod, seperated by a comma: ",
+                "What environment variables do you wish to set in each pod, separated by a comma: ",
                default="",
            ).split(",")

--- a/src/accelerate/commands/config/default.py
+++ b/src/accelerate/commands/config/default.py
@ -43,7 +43,7 @@ def write_basic_config(mixed_precision="no", save_location: str = default_json_c
            Mixed Precision to use. Should be one of "no", "fp16", or "bf16"
        save_location (`str`, *optional*, defaults to `default_json_config_file`):
            Optional custom save location. Should be passed to `--config_file` when using `accelerate launch`. Default
-            location is inside the huggingface cache folder (`~/.cache/huggingface`) but can be overriden by setting
+            location is inside the huggingface cache folder (`~/.cache/huggingface`) but can be overridden by setting
            the `HF_HOME` environmental variable, followed by `accelerate/default_config.yaml`.
    """
    path = Path(save_location)
--- a/src/accelerate/commands/launch.py
+++ b/src/accelerate/commands/launch.py
@ -282,7 +282,7 @@ def launch_command_parser(subparsers=None):
    distributed_args.add_argument(
        "--gpu_ids",
        default=None,
-        help="What GPUs (by id) should be used for training on this machine as a comma-seperated list",
+        help="What GPUs (by id) should be used for training on this machine as a comma-separated list",
    )
    distributed_args.add_argument(
        "--same_network",
@ -707,7 +707,7 @@ def launch_command_parser(subparsers=None):
        "--fp8_override_linear_precision",
        type=lambda x: tuple(map(str_to_bool, x.split(","))),
        default=(False, False, False),
-        help="Whether or not to execute `fprop`, `dgrad`, and `wgrad` GEMMS in higher precision. Should be passed in a comma-seperated string of booleans (useful only when `--fp8_backend=te` is passed).",
+        help="Whether or not to execute `fprop`, `dgrad`, and `wgrad` GEMMS in higher precision. Should be passed in a comma-separated string of booleans (useful only when `--fp8_backend=te` is passed).",
    )
    fp8_args.add_argument(
        "--fp8_opt_level",
--- a/src/accelerate/test_utils/examples.py
+++ b/src/accelerate/test_utils/examples.py
@ -28,7 +28,7 @@ def get_function_contents_by_name(lines: list[str], name: str):

    Args:
        lines (`List[str]`):
-            Source code of a script seperated by line.
+            Source code of a script separated by line.
        name (`str`):
            The name of the function to extract. Should be either `training_function` or `main`
    """
@ -54,7 +54,7 @@ def clean_lines(lines: list[str]):

    Args:
        lines (`List[str]`):
-            Source code of a script seperated by line.
+            Source code of a script separated by line.
    """
    return [line for line in lines if not line.lstrip().startswith("#") and line != "\n"]

--- a/src/accelerate/test_utils/testing.py
+++ b/src/accelerate/test_utils/testing.py
@ -771,7 +771,7 @@ class SubprocessCallException(Exception):
 def run_command(command: list[str], return_stdout=False, env=None):
    """
    Runs `command` with `subprocess.check_output` and will potentially return the `stdout`. Will also properly capture
-    if an error occured while running `command`
+    if an error occurred while running `command`
    """
    # Cast every path in `command` to a string
    for i, c in enumerate(command):
--- a/src/accelerate/utils/megatron_lm.py
+++ b/src/accelerate/utils/megatron_lm.py
@ -370,10 +370,10 @@ class MegatronLMOptimizerWrapper(AcceleratedOptimizer):
        super().__init__(optimizer, device_placement=False, scaler=None)

    def zero_grad(self, set_to_none=None):
-        pass  # `model(**batch)` is doing that automatically. Therefore, it's implementation is not needed
+        pass  # `model(**batch)` is doing that automatically. Therefore, its implementation is not needed

    def step(self):
-        pass  # `model(**batch)` is doing that automatically. Therefore, it's implementation is not needed
+        pass  # `model(**batch)` is doing that automatically. Therefore, its implementation is not needed

    @property
    def step_was_skipped(self):
@ -416,7 +416,7 @@ class MegatronLMSchedulerWrapper(AcceleratedScheduler):
        super().__init__(scheduler, optimizers)

    def step(self, *args, **kwargs):
-        return  # `model(**batch)` is doing that automatically. Therefore, it's implementation is not needed
+        return  # `model(**batch)` is doing that automatically. Therefore, its implementation is not needed


 def prepare_scheduler(accelerator, optimizer, scheduler):
@ -630,7 +630,7 @@ class GPTTrainStep(AbstractTrainStep):
            labels = tokens_[:, 1:].contiguous()
            tokens = tokens_[:, :-1].contiguous()

-            # Get the masks and postition ids.
+            # Get the masks and position ids.
            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
                tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, self.eod_mask_loss
            )
@ -647,7 +647,7 @@ class GPTTrainStep(AbstractTrainStep):
            tokens_ = torch.concat([tokens_, padding], dim=1)
            labels = tokens_[:, 1:].contiguous()
            tokens = tokens_[:, :-1].contiguous()
-            # Get the masks and postition ids.
+            # Get the masks and position ids.
            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
                tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, True
            )
@ -1348,7 +1348,7 @@ class MegatronEngine(torch.nn.Module):
            sizes_list = [
                prompts_tokens_tensor.size(0),  # Batch size
                prompts_tokens_tensor.size(1),
-            ]  # Sequence lenght
+            ]  # Sequence length

        # First, broadcast the sizes.
        sizes_tensor = broadcast_int_list(2, int_list=sizes_list, rank=0)