From 3bc69cc08d45ec4357ad109e0f1d00dd2c9c9956 Mon Sep 17 00:00:00 2001
From: Jane Xu <janeyx@meta.com>
Date: Mon, 5 May 2025 09:22:07 -0700
Subject: [PATCH] Document that dampening is skipped in SGD momentum first step
 (#152833)

Pointed out by https://x.com/hi_tysam/status/1917318692276174977/photo/2.

It would be BC breaking to change this behavior 7 years after it has been decided, so we are documenting it first at the very least.

<img width="642" alt="image" src="https://github.com/user-attachments/assets/3febcb07-e0ed-44a1-bd3b-a8e685711cb4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152833
Approved by: https://github.com/albanD
---
 torch/optim/sgd.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/torch/optim/sgd.py b/torch/optim/sgd.py
index ddf247ef6559..8002df8b308d 100644
--- a/torch/optim/sgd.py
+++ b/torch/optim/sgd.py
@@ -238,7 +238,9 @@ SGD.__doc__ = (
 
         Moreover, the initial value of the momentum buffer is set to the
         gradient value at the first step. This is in contrast to some other
-        frameworks that initialize it to all zeros.
+        frameworks that initialize it to all zeros. One notable side effect
+        of this decision is that the first momentum value will not be scaled
+        by dampening. Dampening will be applied starting at the second step.
 
     """
 )