Updates to the release notes scripts and documentation (#94560)

# Summary This PR made some significant changes to the scripts around Release Scripts. At a high level: - Turned the quips into docs and updated links - Update the common.categorizes list in the hopes to make this the source of truth for releases- This is hard since the release_notes labels can be changed at will. An alternative would be to poll from github api. However, I think that is overkill. The notebook does a set compare and will show you knew categories. I think we want this to be manual so that the release note engineer will decided how to categorize. - Create cateogry group from speaking with folks on distributed and AO that told me these different release categories can be merged. - I am the newest person to Core and don't use ghstack soo made token getting a lil more generic. - Added a classifier.py file. This file will train a commit categorizer for you, hopefully with decent accuracy. I was able to achieve 75% accuracy. I drop the highest frequency class - "skip" since this creates a more useful cateogrizer. - I updated the categorize.py script so that the prompt will be what the classifier thinks, gated by a flag. - Added a readme that will hopefully help future release notes engineers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94560 Approved by: https://github.com/albanD
2025-10-20 12:54:11 +08:00 · 2023-03-16 00:09:26 +00:00
parent 731bb6e61b
commit dcafe3f271
7 changed files with 885 additions and 42 deletions
--- a/scripts/release_notes/READEME.md
+++ b/scripts/release_notes/READEME.md
@ -0,0 +1,220 @@
+# Summary
+These are a collection of scripts for access lists of commits between releases. There are other scripts for automatically generating labels for commits.
+
+The release_notes Runbook and other supporting docs can be found here: [Release Notes Supporting Docs](https://drive.google.com/drive/folders/1J0Uwz8oE7TrdcP95zc-id1gdSBPnMKOR?usp=sharing)
+
+An example of generated docs for submodule owners: [2.0 release notes submodule docs](https://drive.google.com/drive/folders/1zQtmF_ak7BkpGEM58YgJfnpNXTnFl25q?usp=share_link)
+
+### Authentication:
+First run the `test_release_notes.py` script to make sure you have the correct authentication set up. This script will try to access the GitHub API and will fail if you are not authenticated.
+
+- If you have enabled ghstack then authentication should be set up correctly.
+- Otherwise go to `https://github.com/settings/tokens` and create a token. You can either follow the steps to setup ghstack or set the env variable `GITHUB_TOKEN`.
+
+
+## Steps:
+
+### Part 1: getting a list of commits
+
+You are going to get a list of commits since the last release in csv format. The usage is the following:
+Assuming tags/v1.13.1 is last released version
+From this directory run:
+`python commitlist.py --create_new tags/v1.13.1 <commit_hash> `
+
+This saves a commit list to `results/commitlist.csv`.  Please confirm visually that the oldest commits weren’t included in the branch cut for the last release as a sanity check.
+
+NB: the commit list contains commits from the merge-base of tags/<most_recent_release_tag> and whatever commit hash you give it, so it may have commits that were cherry-picked to <most_recent_release_tag>!
+
+* Go through the list of cherry-picked commits to the last release and delete them from results/commitlist.csv.
+* This is done manually:
+    * Look for all the PRs that were merged in the release branch with a github query like: https://github.com/pytorch/pytorch/pulls?q=is%3Apr+base%3Arelease%2F<most_recent_release_tag>+is%3Amerged
+    *  Look at the commit history https://github.com/pytorch/pytorch/commits/release/<most_recent_release_tag>, to find all the direct push in the release branch (usually for reverts)
+
+
+If you already have a commit list and want to update it, use the following command. This command can be helpful if there are cherry-picks to the release branch or if you’re categorizing commits throughout the three months up to a release. Warning: this is not very well tested. Make sure that you’re on the same branch (e.g., release/<upcoming_release_tag>) as the last time you ran this command, and that you always *commit* your csv before running this command to avoid losing work.
+
+`python commitlist.py --update_to <commit_hash>`
+
+### Part 2: categorizing commits
+
+#### Exploration and cleanup
+
+In this folder is an ipython notebook that I used for exploration and finding relevant commits. For example the commitlist attempts to categorize commits based off the `release notes:` label. Users of PyTorch often add new release notes labels. This Notebook has a cell that can help you identify new labels.
+
+There is a list of all known categories defined in `common.py`. It has designations for types of categories as well such as `_frontend`.
+
+The `categorize` function in commitlist.py does an adequate job of adding the appropriate categories. Since new categories though may be created for your release you may find it helpful to add new heuristics around files changed to help with categorization.
+
+If you update the automatic categorization you can run the following to update the commit list.
+`python commitlist.py --rerun_with_new_filters` Note that this will only update the commits in the commit list that have a category of "Uncategorized".
+
+One you have dug through the commits and done as much automated categorization you can run the following for an interface to categorize any remaining commits.
+
+#### Training a commit classifier
+I added scripts to train a commit classifier from the set of labeled commits in commitlist.csv. This will utilize the title, author, and files changed features of the commits. The file requires torchtext, and tqdm. I had to install torchtext from source but if you are also a PyTorch developer this would likely already be installed.
+
+- There should already exist a `results/` directory from gathering the commitlist.csv. The next step is to create `mkdir results/classifier`
+- Run `python classifier.py --train` This will train the model and save for inference.
+- Run `python categorize.py --use_classifier` This will pre-populate the output with the most likely category. And pressing enter will confirm selection.
+ - Or run `python categorize.py` to label without the classifier.
+
+The interface modifies results/commitlist.csv. If you want to take a coffee break, you can CTRL-C out of it (results/commitlist.csv gets written to on each categorization) and then commit and push results/commitlist.csv to a branch for safekeeping.
+
+If you want to revert a change you just made, you can edit results/commitlist.csv directly.
+
+For each commit, after choosing the category, you can also choose a topic. For the frontend category, you should take the time to do it to save time in the next step. For other categories, you can do it but only of you are 100% sure as it is confusing for submodule owners otherwise.
+
+The categories are as follow:
+ Be sure to update this list if you add a new category to common.py
+
+* jit: Everything related to the jit (including tensorexpr)
+* quantization: Everything related to the quantization mode/passes/operators
+* mobile: Everything related to the mobile build/ops/features
+* onnx: Everything related to onnx
+* caffe2: Everything that happens in the caffe2 folder. No need to add any topics here as these are ignored (they don’t make it into the final release notes)
+* distributed: Everything related to distributed training and rpc
+* visualization: Everything related to tensorboard and visualization in general
+* releng: Everything related to release engineering (circle CI, docker images, etc)
+* amd: Everything related to rocm and amd CPUs
+* cuda: Everything related to cuda backend
+* benchmark: Everything related to the opbench folder and utils.benchmark submodule
+* package: Everything related to torch.package
+* performance as a product: All changes that improve perfs
+* profiler: Everything related to the profiler
+* composability: Everything related to the dispatcher and ATen native binding
+* fx: Everything related to torch.fx
+* code_coverage: Everything related to the code coverage tool
+* vulkan: Everything related to vulkan support (mobile GPU backend)
+* skip: Everything that is not end user or dev facing like code refactoring or internal inplementation changes
+* frontend: To ease your future work, we split things here (may be merged in the final document)
+    * python_api
+    * cpp_api
+    * complex
+    * vmap
+    * autograd
+    * build
+    * memory_format
+    * foreach
+    * dataloader
+    * nestedtensor
+    * sparse
+    * mps
+
+
+The topics are as follow:
+
+* bc_breaking: All commits marked as BC-breaking (the script should highlight them). If any other commit look like it could be BC-breaking, add it here as well!
+* deprecation: All commits introducing deprecation. Should be clear from commit msg.
+* new_features: All commits introducing a new feature (new functions, new submodule, new supported platform etc)
+* improvements: All commits providing improvements to existing feature should be here (new backend for a function, new argument, better numerical stability)
+* bug fixes: All commits that fix bugs and behaviors that do not match the documentation
+* performance: All commits that are here mainly for performance (we separate this from improvements above to make it easier for users to look for it)
+* documentation: All commits that add/update documentation
+* devs: All commits that are not end-user facing but still impact people that compile from source, develop into pytorch, extend pytorch, cpp extensions, etc
+* unknown
+
+
+### Part 3: export categories to markdown
+
+`python commitlist.py --export_markdown`
+
+The above exports results/commitlist.csv to markdown by listing every commit under its respective category.
+It will create one file per category in the results/export/ folder.
+
+This part is a little tedious but it seems to work. May want to explore using pandoc to convert the markdown to google doc format.
+
+1. Make sure you are using the light theme of VSCode.
+2. Open a preview of the markdown file and copy the Preview.
+3. In the correct google doc copy the preview and make sure to paste WITH formatting.
+4. You can now send these google docs to the relevant submodule owners for review.
+5. Install the google doc extension [docs to markdown](https://github.com/evbacher/gd2md-html)
+6. Start to compile back down these markdown files into a single markdown file.
+
+`TODO`: This is by far the most manual process and is ripe for automation. If the next person up would like to investigate Google Doc APIS there is some room hor improvement here.
+
+### Part 4: Cherry Picks
+
+You will likely have started this process prior to the branch-cut being finalized. This means Cherry Picks.
+This was my process for keeping track. I use a notes app to log my progress as I periodically incorporate the new cherry picks.
+I will have initially ran something like:
+``` Bash
+python commitlist.py --create_new tags/v1.13.1 <commit-hash>
+```
+I keep track of that commit-hash. Once there are some cherry-picks that you would like to incorporate I rebase the release branch to upstream
+and run:
+```Bash
+python commitlist.py --update_to <latest-cherry-pick-hash>
+```
+I then run
+``` Python
+import pandas as pd
+
+commit_list_df = pd.read_csv("results/commitlist.csv")
+last_known_good_hash = "<the most recent hash>"
+
+previous_index = commit_list_df[commit_list_df.commit_hash == last_known_good_hash].index.values[0]
+cherry_pick_df = commit_list_df.iloc[previous_index+1:]
+path = "<your_path>/cherry_picks.csv"
+cherry_pick_df.to_csv(path, index=False)
+
+
+from commitlist import CommitList, to_markdown
+cherry_pick_commit_list = CommitList.from_existing(path)
+
+import os
+categories = list(cherry_pick_commit_list.stat().keys())
+for category in categories:
+    print(f"Exporting {category}...")
+    lines =to_markdown(cherry_pick_commit_list, category)
+    filename = f'/tmp/cherry_pick/results/result_{category}.md'
+    os.makedirs(os.path.dirname(filename), exist_ok=True)
+    with open(filename, 'w') as f:
+        f.writelines(lines)
+
+```
+
+This will create new markdown files only from cherry picked commits. And I manually copied and pasted these into the submodule google docs and commented so that
+the submodule owners will see these new commits.
+
+
+### Part 5: Pulling on the submodules into one
+I pretty much followed the run book here. One thing I did was use the [markdown-all-in-one](https://marketplace.visualstudio.com/items?itemName=yzhang.markdown-all-in-one)
+extension to create a table of contents which was really helpful in jumping to sections and copy and pasting the appropriate commits.
+
+You will then create a release at [Pytorch Release](https://github.com/pytorch/pytorch/releases) and if you save as a draft you can see how it will be rendered.
+
+
+
+#### Tidbits
+You will probably have a release note that doesn't fit into the character limit of github. I used the following regex:
+`\[#(\d+)\]\(https://github.com/pytorch/pytorch/pull/\d+\)` to replace the full lunks to (#<pull-request-number>).
+This will get formatted correctly in the github UI and can be checked when creating a draft release.
+
+
+The following markdown code is helpful for creating side-by-side tables of BC breaking/ deprecated code:
+
+
+``` Markdown
+<table>
+<tr>
+<th>PRIOR RELEASE NUM</th>
+<th>NEW RELEASE NUM</th>
+</tr>
+<tr>
+<td>
+
+```Python
+# Code Snippet 1
+```
+
+</td>
+<td>
+
+```Python
+# Code Snippet 2
+```
+
+</td>
+</tr>
+</table>
+```
--- a/scripts/release_notes/categorize.py
+++ b/scripts/release_notes/categorize.py
@ -1,14 +1,30 @@
 import argparse
 import os
 import textwrap
-from common import categories, topics, get_commit_data_cache
+from common import topics, get_commit_data_cache
 from commitlist import CommitList

+# Imports for working with classi
+from classifier import CommitClassifier, CategoryConfig, XLMR_BASE, get_author_map, get_file_map, CommitClassifierInputs
+import common
+import torch
+from pathlib import Path
+
 class Categorizer:
-    def __init__(self, path, category='Uncategorized'):
+    def __init__(self, path, category='Uncategorized', use_classifier:bool = False):
        self.cache = get_commit_data_cache()
        self.commits = CommitList.from_existing(path)
-
+        if use_classifier:
+            print("Using a classifier to aid with categorization.")
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+            classifier_config = CategoryConfig(common.categories)
+            author_map = get_author_map(Path("results/classifier"), regen_data=False, assert_stored=True)
+            file_map = get_file_map(Path("results/classifier"), regen_data=False, assert_stored=True)
+            self.classifier = CommitClassifier(XLMR_BASE, author_map, file_map, classifier_config).to(device)
+            self.classifier.load_state_dict(torch.load(Path("results/classifier/commit_classifier.pt")))
+            self.classifier.eval()
+        else:
+            self.classifier = None
        # Special categories: 'Uncategorized'
        # All other categories must be real
        self.category = category
@ -64,6 +80,15 @@ class Categorizer:
            potential_reverts = ""

        features = self.features(commit)
+        if self.classifier is not None:
+            # Some commits don't have authors:
+            author = features.author if features.author else "Unknown"
+            files = ' '.join(features.files_changed)
+            classifier_input = CommitClassifierInputs(title=[features.title], files=[files], author=[author])
+            classifier_category = self.classifier.get_most_likely_category_name(classifier_input)[0]
+
+        else:
+            classifier_category = commit.category

        breaking_alarm = ""
        if 'module: bc-breaking' in features.labels:
@ -88,17 +113,19 @@ Labels: {features.labels}

 Current category: {commit.category}

-Select from: {', '.join(categories)}
+Select from: {', '.join(common.categories)}

        ''')
        print(view)
        cat_choice = None
        while cat_choice is None:
-            value = input('category> ').strip()
+            print("Enter category: ")
+            value = input(f'{classifier_category} ').strip()
            if len(value) == 0:
-                cat_choice = commit.category
+                # The user just pressed enter and likes the default value
+                cat_choice = classifier_category
                continue
-            choices = [cat for cat in categories
+            choices = [cat for cat in common.categories
                       if cat.startswith(value)]
            if len(choices) != 1:
                print(f'Possible matches: {choices}, try again')
@ -124,7 +151,7 @@ Select from: {', '.join(categories)}
        return None

    def update_commit(self, commit, category, topic):
-        assert category in categories
+        assert category in common.categories
        assert topic in topics
        commit.category = category
        commit.topic = topic
@ -136,9 +163,10 @@ def main():
                        help='Which category to filter by. "Uncategorized", None, or a category name')
    parser.add_argument('--file', help='The location of the commits CSV',
                        default='results/commitlist.csv')
+    parser.add_argument('--use_classifier', action='store_true', help="Whether or not to use a classifier to aid in categorization.")

    args = parser.parse_args()
-    categorizer = Categorizer(args.file, args.category)
+    categorizer = Categorizer(args.file, args.category, args.use_classifier)
    categorizer.categorize()


--- a/scripts/release_notes/classifier.py
+++ b/scripts/release_notes/classifier.py
@ -0,0 +1,357 @@
+import argparse
+from pathlib import Path
+import torch
+import torchtext
+from torchtext.functional import to_tensor
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import List, Dict
+import pandas as pd
+from dataclasses import dataclass
+import math
+import pickle
+import random
+from tqdm import tqdm
+from itertools import chain
+
+import common
+
+
+XLMR_BASE = torchtext.models.XLMR_BASE_ENCODER
+# This should not be here but it works for now
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+HAS_IMBLEARN = False
+try:
+    import imblearn
+    HAS_IMBLEARN = True
+except ImportError:
+    HAS_IMBLEARN = False
+
+# 94% of all files are captured at len 5, good hyperparameter to play around with.
+MAX_LEN_FILE = 6
+
+UNKNOWN_TOKEN = "<Unknown>"
+
+# Utilities for working with a truncated file graph
+
+
+def truncate_file(file: Path, max_len: int = 5):
+    return ('/').join(file.parts[:max_len])
+
+
+def build_file_set(all_files: List[Path], max_len: int):
+    truncated_files = [truncate_file(file, max_len) for file in all_files]
+    return set(truncated_files)
+@dataclass
+class CommitClassifierInputs:
+    title: List[str]
+    files: List[str]
+    author: List[str]
+
+
+@dataclass
+class CategoryConfig:
+    categories: List[str]
+    input_dim: int = 768
+    inner_dim: int = 128
+    dropout: float = 0.1
+    activation = nn.ReLU
+    embedding_dim: int = 8
+    file_embedding_dim: int = 32
+
+
+class CommitClassifier(nn.Module):
+    def __init__(self, encoder_base: torchtext.models.XLMR_BASE_ENCODER, author_map: Dict[str, int], file_map: [str, int], config: CategoryConfig):
+        super().__init__()
+        self.encoder = encoder_base.get_model().requires_grad_(False)
+        self.transform = encoder_base.transform()
+        self.author_map = author_map
+        self.file_map = file_map
+        self.categories = config.categories
+        self.num_authors = len(author_map)
+        self.num_files = len(file_map)
+        self.embedding_table = nn.Embedding(self.num_authors, config.embedding_dim)
+        self.file_embedding_bag = nn.EmbeddingBag(self.num_files, config.file_embedding_dim, mode='sum')
+        self.dense_title = nn.Linear(config.input_dim, config.inner_dim)
+        self.dense_files = nn.Linear(config.file_embedding_dim, config.inner_dim)
+        self.dense_author = nn.Linear(config.embedding_dim, config.inner_dim)
+        self.dropout = nn.Dropout(config.dropout)
+        self.out_proj_title = nn.Linear(config.inner_dim, len(self.categories))
+        self.out_proj_files = nn.Linear(config.inner_dim, len(self.categories))
+        self.out_proj_author = nn.Linear(config.inner_dim, len(self.categories))
+        self.activation_fn = config.activation()
+
+    def forward(self, input_batch: CommitClassifierInputs):
+        # Encode input title
+        title: List[str] = input_batch.title
+        model_input = to_tensor(self.transform(title), padding_value=1).to(device)
+        title_features = self.encoder(model_input)
+        title_embed = title_features[:, 0, :]
+        title_embed = self.dropout(title_embed)
+        title_embed = self.dense_title(title_embed)
+        title_embed = self.activation_fn(title_embed)
+        title_embed = self.dropout(title_embed)
+        title_embed = self.out_proj_title(title_embed)
+
+        files: list[str] = input_batch.files
+        batch_file_indexes = []
+        for file in files:
+            paths = [truncate_file(Path(file_part), MAX_LEN_FILE) for file_part in file.split(" ")]
+            batch_file_indexes.append([self.file_map.get(file, self.file_map[UNKNOWN_TOKEN]) for file in paths])
+
+        flat_indexes = torch.tensor(list(chain.from_iterable(batch_file_indexes)), dtype=torch.long, device=device)
+        offsets = [0]
+        offsets.extend(len(files) for files in batch_file_indexes[:-1])
+        offsets = torch.tensor(offsets, dtype=torch.long, device=device)
+        offsets = offsets.cumsum(dim=0)
+
+        files_embed = self.file_embedding_bag(flat_indexes, offsets)
+        files_embed = self.dense_files(files_embed)
+        files_embed = self.activation_fn(files_embed)
+        files_embed = self.dropout(files_embed)
+        files_embed = self.out_proj_files(files_embed)
+
+        # Add author embedding
+        authors: List[str] = input_batch.author
+        author_ids = [self.author_map.get(author, self.author_map[UNKNOWN_TOKEN]) for author in authors]
+        author_ids = torch.tensor(author_ids).to(device)
+        author_embed = self.embedding_table(author_ids)
+        author_embed = self.dense_author(author_embed)
+        author_embed = self.activation_fn(author_embed)
+        author_embed = self.dropout(author_embed)
+        author_embed = self.out_proj_author(author_embed)
+
+        return title_embed + files_embed + author_embed
+
+    def convert_index_to_category_name(self, most_likely_index):
+        if isinstance(most_likely_index, int):
+            return self.categories[most_likely_index]
+        elif isinstance(most_likely_index, torch.Tensor):
+            return [self.categories[i] for i in most_likely_index]
+
+    def get_most_likely_category_name(self, inpt):
+        # Input will be a dict with title and author keys
+        logits = self.forward(inpt)
+        most_likely_index = torch.argmax(logits, dim=1)
+        return self.convert_index_to_category_name(most_likely_index)
+
+
+def get_train_val_data(data_folder: Path, regen_data: bool, train_percentage=0.95):
+    if not regen_data and Path(data_folder / "train_df.csv").exists() and Path(data_folder / "val_df.csv").exists():
+        train_data = pd.read_csv(data_folder / "train_df.csv")
+        val_data = pd.read_csv(data_folder / "val_df.csv")
+        return train_data, val_data
+    else:
+        print("Train, Val, Test Split not found generating from scratch.")
+        commit_list_df = pd.read_csv(data_folder / "commitlist.csv")
+        test_df = commit_list_df[commit_list_df['category'] == 'Uncategorized']
+        all_train_df = commit_list_df[commit_list_df['category'] != 'Uncategorized']
+        # We are going to drop skip from training set since it is so imbalanced
+        print("We are removing skip categories, YOU MIGHT WANT TO CHANGE THIS, BUT THIS IS A MORE HELPFUL CLASSIFIER FOR LABELING.")
+        all_train_df = all_train_df[all_train_df['category'] != 'skip']
+        all_train_df = all_train_df.sample(frac=1).reset_index(drop=True)
+        split_index = math.floor(train_percentage * len(all_train_df))
+        train_df = all_train_df[:split_index]
+        val_df = all_train_df[split_index:]
+        print("Train data size: ", len(train_df))
+        print("Val data size: ", len(val_df))
+
+        test_df.to_csv(data_folder / "test_df.csv", index=False)
+        train_df.to_csv(data_folder / "train_df.csv", index=False)
+        val_df.to_csv(data_folder / "val_df.csv", index=False)
+        return train_df, val_df
+
+
+def get_author_map(data_folder: Path, regen_data, assert_stored=False):
+    if not regen_data and Path(data_folder / "author_map.pkl").exists():
+        with open(data_folder / "author_map.pkl", 'rb') as f:
+            return pickle.load(f)
+    else:
+        if assert_stored:
+            raise FileNotFoundError(
+                "Author map not found, you are loading for inference you need to have an author map!")
+        print("Regenerating Author Map")
+        all_data = pd.read_csv(data_folder / "commitlist.csv")
+        authors = all_data.author.unique().tolist()
+        authors.append(UNKNOWN_TOKEN)
+        author_map = {author: i for i, author in enumerate(authors)}
+        with open(data_folder / "author_map.pkl", 'wb') as f:
+            pickle.dump(author_map, f)
+        return author_map
+
+
+
+def get_file_map(data_folder: Path, regen_data, assert_stored=False):
+    if not regen_data and Path(data_folder / "file_map.pkl").exists():
+        with open(data_folder / "file_map.pkl", 'rb') as f:
+            return pickle.load(f)
+    else:
+        if assert_stored:
+            raise FileNotFoundError("File map not found, you are loading for inference you need to have a file map!")
+        print("Regenerating File Map")
+        all_data = pd.read_csv(data_folder / "commitlist.csv")
+        # Lets explore files
+        files = all_data.files_changed.to_list()
+
+        all_files = []
+        for file in files:
+            paths = [Path(file_part) for file_part in file.split(" ")]
+            all_files.extend(paths)
+        all_files.append(Path(UNKNOWN_TOKEN))
+        file_set = build_file_set(all_files, MAX_LEN_FILE)
+        file_map = {file: i for i, file in enumerate(file_set)}
+        with open(data_folder / "file_map.pkl", 'wb') as f:
+            pickle.dump(file_map, f)
+        return file_map
+
+#  Generate a dataset for training
+
+
+def get_title_files_author_categories_zip_list(dataframe: pd.DataFrame):
+    title = dataframe.title.to_list()
+    files_str = dataframe.files_changed.to_list()
+    author = dataframe.author.fillna(UNKNOWN_TOKEN).to_list()
+    category = dataframe.category.to_list()
+    return list(zip(title, files_str, author, category))
+
+
+def generate_batch(batch):
+    title, files, author, category = zip(*batch)
+    title = list(title)
+    files = list(files)
+    author = list(author)
+    category = list(category)
+    targets = torch.tensor([common.categories.index(cat) for cat in category]).to(device)
+    return CommitClassifierInputs(title, files, author), targets
+
+
+def train_step(batch, model, optimizer, loss):
+    inpt, targets = batch
+    optimizer.zero_grad()
+    output = model(inpt)
+    l = loss(output, targets)
+    l.backward()
+    optimizer.step()
+    return l
+
+
+@torch.no_grad()
+def eval_step(batch, model, loss):
+    inpt, targets = batch
+    output = model(inpt)
+    l = loss(output, targets)
+    return l
+
+
+def balance_dataset(dataset: List):
+    if not HAS_IMBLEARN:
+        return dataset
+    title, files, author, category = zip(*dataset)
+    category = [common.categories.index(cat) for cat in category]
+    inpt_data = list(zip(title, files, author))
+    from imblearn.over_sampling import RandomOverSampler
+    # from imblearn.under_sampling import RandomUnderSampler
+    rus = RandomOverSampler(random_state=42)
+    X, y = rus.fit_resample(inpt_data, category)
+    merged = list(zip(X, y))
+    merged = random.sample(merged, k=2 * len(dataset))
+    X, y = zip(*merged)
+    rebuilt_dataset = []
+    for i in range(len(X)):
+        rebuilt_dataset.append((*X[i], common.categories[y[i]]))
+    return rebuilt_dataset
+
+
+def gen_class_weights(dataset: List):
+    from collections import Counter
+    epsilon = 1e-1
+    title, files, author, category = zip(*dataset)
+    category = [common.categories.index(cat) for cat in category]
+    counter = Counter(category)
+    percentile_33 = len(category) // 3
+    most_common = counter.most_common(percentile_33)
+    least_common = counter.most_common()[-percentile_33:]
+    smoothed_top = sum(i[1] + epsilon for i in most_common) / len(most_common)
+    smoothed_bottom = sum(i[1] + epsilon for i in least_common) / len(least_common) // 3
+    class_weights = torch.tensor([1.0 / (min(max(counter[i], smoothed_bottom), smoothed_top) + epsilon)
+                                 for i in range(len(common.categories))], device=device)
+    return class_weights
+
+
+def train(save_path: Path, data_folder: Path, regen_data: bool, resample: bool):
+    train_data, val_data = get_train_val_data(data_folder, regen_data)
+    train_zip_list = get_title_files_author_categories_zip_list(train_data)
+    val_zip_list = get_title_files_author_categories_zip_list(val_data)
+
+    classifier_config = CategoryConfig(common.categories)
+    author_map = get_author_map(data_folder, regen_data)
+    file_map = get_file_map(data_folder, regen_data)
+    commit_classifier = CommitClassifier(XLMR_BASE, author_map, file_map, classifier_config).to(device)
+
+    # Lets train this bag of bits
+    class_weights = gen_class_weights(train_zip_list)
+    loss = torch.nn.CrossEntropyLoss(weight=class_weights)
+    optimizer = torch.optim.Adam(commit_classifier.parameters(), lr=3e-3)
+
+    num_epochs = 25
+    batch_size = 256
+
+    if resample:
+        # Lets not use this
+        train_zip_list = balance_dataset(train_zip_list)
+    data_size = len(train_zip_list)
+
+    print(f"Training on {data_size} examples.")
+    # We can fit all of val into one batch
+    val_batch = generate_batch(val_zip_list)
+
+    for i in tqdm(range(num_epochs), desc="Epochs"):
+        start = 0
+        random.shuffle(train_zip_list)
+        while start < data_size:
+            end = start + batch_size
+            # make the last batch bigger if needed
+            if end > data_size:
+                end = data_size
+            train_batch = train_zip_list[start:end]
+            train_batch = generate_batch(train_batch)
+            l = train_step(train_batch, commit_classifier, optimizer, loss)
+            start = end
+
+        val_l = eval_step(val_batch, commit_classifier, loss)
+        tqdm.write(f"Finished epoch {i} with a train loss of: {l.item()} and a val_loss of: {val_l.item()}")
+
+    with torch.no_grad():
+        commit_classifier.eval()
+        val_inpts, val_targets = val_batch
+        val_output = commit_classifier(val_inpts)
+        val_preds = torch.argmax(val_output, dim=1)
+        val_acc = torch.sum(val_preds == val_targets).item() / len(val_preds)
+        print(f"Final Validation accuracy is {val_acc}")
+
+    print(f"Jobs done! Saving to {save_path}")
+    torch.save(commit_classifier.state_dict(), save_path)
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Tool to create a classifier for helping to categorize commits')
+
+    parser.add_argument('--train', action='store_true', help='Train a new classifier')
+    parser.add_argument("--commit_data_folder", default="results/classifier/")
+    parser.add_argument('--save_path', default='results/classifier/commit_classifier.pt')
+    parser.add_argument('--regen_data', action='store_true',
+                        help="Regenerate the training data, helps if labeld more examples and want to re-train.")
+    parser.add_argument('--resample', action='store_true',
+                        help="Resample the training data to be balanced. (Only works if imblearn is installed.)")
+    args = parser.parse_args()
+
+    if args.train:
+        train(Path(args.save_path), Path(args.commit_data_folder), args.regen_data, args.resample)
+        return
+
+    print("Currently this file only trains a new classifier please pass in --train to train a new classifier")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/release_notes/commitlist.py
+++ b/scripts/release_notes/commitlist.py
@ -1,10 +1,11 @@
 import argparse
-from common import run, topics, get_features
+from common import run, topics, get_features, frontend_categories
 from collections import defaultdict
 import os
 from pathlib import Path
 import csv
 import pprint
+import common
 from common import get_commit_data_cache, features_to_dict
 import re
 import dataclasses
@ -30,6 +31,7 @@ class Commit:
    category: str
    topic: str
    title: str
+    files_changed: str
    pr_link: str
    author: str

@ -86,7 +88,7 @@ class CommitList:
            writer.writerow(commit_fields)
            for commit in commit_list:
                writer.writerow(dataclasses.astuple(commit))
-
+    @staticmethod
    def keywordInFile(file, keywords):
        for key in keywords:
            if key in file:
@ -103,8 +105,56 @@ class CommitList:
            pr_link = f"https://github.com/pytorch/pytorch/pull/{features['pr_number']}"
        else:
            pr_link = None
+        files_changed_str = ' '.join(features['files_changed'])
+        return Commit(commit_hash, category, topic, features["title"], files_changed_str,  pr_link, features["author"], a1, a2, a3)

-        return Commit(commit_hash, category, topic, features["title"], pr_link, features["author"], a1, a2, a3)
+    @staticmethod
+    def category_remapper(category: str) -> str:
+        if category in frontend_categories:
+            category = category + '_frontend'
+            return category
+        if category == 'Meta API':
+            category = 'composability'
+            return category
+        if category in common.quantization.categories:
+            category = common.quantization.name
+            return category
+        if category in common.distributed.categories:
+            category = common.distributed.name
+            return category
+        return category
+
+    @staticmethod
+    def bracket_category_matcher(title: str):
+        """Categorize a commit based on the presence of a bracketed category in the title.
+
+        Args:
+            title (str): title to seaarch
+
+        Returns:
+            optional[str]
+        """
+        pairs = [
+            ('[dynamo]', 'dynamo'),
+            ('[torchdynamo]', 'dynamo'),
+            ('[torchinductor]', 'inductor'),
+            ('[inductor]', 'inductor'),
+            ('[codemod', 'skip'),
+            ('[profiler]', 'profiler'),
+            ('[functorch]', 'functorch'),
+            ('[autograd]', 'autograd_frontend'),
+            ('[quantization]', 'quantization'),
+            ('[nn]', 'nn_frontend'),
+            ('[complex]', 'complex_frontend'),
+            ('[mps]', 'mps'),
+            ('[optimizer]', 'optimizer_frontend'),
+            ('[xla]', 'xla'),
+        ]
+        title_lower = title.lower()
+        for bracket, category in pairs:
+            if bracket in title_lower:
+                return category
+        return None

    @staticmethod
    def categorize(features):
@ -113,6 +163,10 @@ class CommitList:
        category = 'Uncategorized'
        topic = 'Untopiced'

+        # Revert commits are merged directly to master with no associated PR number
+        if features['pr_number'] is None:
+            if title.startswith("Revert"):
+                return 'skip', topic

        # We ask contributors to label their PR's appropriately
        # when they're first landed.
@ -121,6 +175,7 @@ class CommitList:
        for label in labels:
            if label.startswith('release notes: '):
                category = label.split('release notes: ', 1)[1]
+                category = CommitList.category_remapper(category)
                already_categorized = True
            if label.startswith('topic: '):
                topic = label.split('topic: ', 1)[1]
@ -131,8 +186,6 @@ class CommitList:
        # update this to check if each file starts with caffe2
        if 'caffe2' in title:
            return 'caffe2', topic
-        if '[codemod]' in title.lower():
-            return 'skip', topic
        if 'Reverted' in labels:
            return 'skip', topic
        if 'bc_breaking' in labels:
@ -140,6 +193,10 @@ class CommitList:
        if 'module: deprecation' in labels:
            topic = 'deprecation'

+        found_bracket_category = CommitList.bracket_category_matcher(title)
+        if found_bracket_category:
+            return found_bracket_category, topic
+
        files_changed = features['files_changed']
        for file in files_changed:
            file_lowercase = file.lower()
@ -169,11 +226,11 @@ class CommitList:
                category = 'fx'
                break
            if CommitList.keywordInFile(file, ['torch/ao', 'test/ao']):
-                category = 'ao'
+                category = common.quantization.name
                break
            # torch/quantization, test/quantization, aten/src/ATen/native/quantized, torch/nn/{quantized, quantizable}
            if CommitList.keywordInFile(file, ['torch/quantization', 'test/quantization', 'aten/src/ATen/native/quantized', 'torch/nn/quantiz']):
-                category = 'quantization'
+                category = common.quantization.name
                break
            if CommitList.keywordInFile(file, ['torch/package', 'test/package']):
                category = 'package'
@ -196,6 +253,15 @@ class CommitList:
            if CommitList.keywordInFile(file, ['torch/csrc/jit', 'torch/jit']):
                category = 'jit'
                break
+            if CommitList.keywordInFile(file, ['torch/_meta_registrations.py', 'torch/_decomp', 'torch/_prims', 'torch/_refs']):
+                category = 'composability'
+                break
+            if CommitList.keywordInFile(file, ['torch/_dynamo']):
+                category = 'dynamo'
+                break
+            if CommitList.keywordInFile(file, ['torch/_inductor']):
+                category = 'inductor'
+                break
        else:
            # Below are some extra quick checks that aren't necessarily file-path related,
            # but I found that to catch a decent number of extra commits.
@ -210,6 +276,9 @@ class CommitList:
                # individual torch_docs changes are usually for python ops
                category = 'python_frontend'

+        # If we couldn't find a category but the topic is not user facing we can skip these:
+        if category == "Uncategorized" and topic == "not user facing":
+            category = "skip"

        return category, topic

@ -260,13 +329,13 @@ def update_existing(path, new_version):

 def rerun_with_new_filters(path):
    current_commits = CommitList.from_existing(path)
-    for i in range(len(current_commits.commits)):
-        c = current_commits.commits[i]
-        if 'Uncategorized' in str(c):
-            feature_item = get_commit_data_cache().get(c.commit_hash)
+    for i, commit in enumerate(current_commits.commits):
+        current_category = commit.category
+        if current_category == 'Uncategorized' or current_category not in common.categories:
+            feature_item = get_commit_data_cache().get(commit.commit_hash)
            features = features_to_dict(feature_item)
            category, topic = CommitList.categorize(features)
-            current_commits[i] = dataclasses.replace(c, category=category, topic=topic)
+            current_commits.commits[i] = dataclasses.replace(commit, category=category, topic=topic)
    current_commits.write_result()

 def get_hash_or_pr_url(commit: Commit):
@ -318,14 +387,14 @@ def get_markdown_header(category):

 The main goal of this process is to rephrase all the commit messages below to make them clear and easy to read by the end user. You should follow the following instructions to do so:

-* **Please cleanup, and format commit titles to be readable by the general pytorch user.** [Detailed intructions here](https://fb.quip.com/OCRoAbEvrRD9#HdaACARZZvo)
+* **Please cleanup, and format commit titles to be readable by the general pytorch user.** [Detailed instructions here](https://docs.google.com/document/d/14OmgGBr1w6gl1VO47GGGdwrIaUNr92DFhQbY_NEk8mQ/edit)
 * Please sort commits into the following categories (you should not rename the categories!), I tried to pre-sort these to ease your work, feel free to move commits around if the current categorization is not good.
 * Please drop any commits that are not user-facing.
 * If anything is from another domain, leave it in the UNTOPICED section at the end and I'll come and take care of it.

 The categories below are as follows:

-* BC breaking: All commits that are BC-breaking. These are the most important commits. If any pre-sorted commit is actually BC-breaking, do move it to this section. Each commit should contain a paragraph explaining the rational behind the change as well as an example for how to update user code (guidelines here: https://quip.com/OCRoAbEvrRD9)
+* BC breaking: All commits that are BC-breaking. These are the most important commits. If any pre-sorted commit is actually BC-breaking, do move it to this section. Each commit should contain a paragraph explaining the rational behind the change as well as an example for how to update user code [BC-Guidelines](https://docs.google.com/document/d/14OmgGBr1w6gl1VO47GGGdwrIaUNr92DFhQbY_NEk8mQ/edit#heading=h.a9htwgvvec1m).
 * Deprecations: All commits introducing deprecation. Each commit should include a small example explaining what should be done to update user code.
 * new_features: All commits introducing a new feature (new functions, new submodule, new supported platform etc)
 * improvements: All commits providing improvements to existing feature should be here (new backend for a function, new argument, better numerical stability)
@ -357,6 +426,7 @@ def main():

    if args.create_new:
        create_new(args.path, args.create_new[0], args.create_new[1])
+        print("Finished creating new commit list. Results have been saved to results/commitlist.csv")
        return
    if args.update_to:
        update_existing(args.path, args.update_to)
--- a/scripts/release_notes/common.py
+++ b/scripts/release_notes/common.py
@ -6,10 +6,61 @@ import re
 import requests
 import os
 import json
+from dataclasses import dataclass
+
+@dataclass
+class CategoryGroup:
+    name: str
+    categories: list
+
+frontend_categories = [
+    'meta',
+    'nn',
+    'linalg',
+    'cpp',
+    'python',
+    'complex',
+    'vmap',
+    'autograd',
+    'build',
+    'memory_format',
+    'foreach',
+    'dataloader',
+    'sparse',
+    'nested tensor',
+    'optimizer'
+]
+
+pytorch_2_categories = [
+    'dynamo',
+    'inductor',
+]
+
+# These will all get mapped to quantization
+quantization = CategoryGroup(
+    name="quantization",
+    categories=[
+        'quantization',
+        'AO frontend',
+        'AO Pruning', ]
+)
+
+# Distributed has a number of release note labels we want to map to one
+distributed = CategoryGroup(
+    name="distributed",
+    categories=[
+        'distributed',
+        'distributed (c10d)',
+        'distributed (composable)',
+        'distributed (ddp)',
+        'distributed (fsdp)',
+        'distributed (rpc)',
+        'distributed (sharded)',
+    ]
+)

 categories = [
    'Uncategorized',
-    'distributed',
    'lazy',
    'hub',
    'mobile',
@ -17,11 +68,12 @@ categories = [
    'visualization',
    'onnx',
    'caffe2',
-    'quantization',
    'amd',
    'rocm',
    'cuda',
+    'cpu',
    'cudnn',
+    'xla',
    'benchmark',
    'profiler',
    'performance_as_product',
@ -33,20 +85,15 @@ categories = [
    'vulkan',
    'skip',
    'composability',
-    'meta_frontend',
-    'nn_frontend',
-    'linalg_frontend',
-    'cpp_frontend',
-    'python_frontend',
-    'complex_frontend',
-    'vmap_frontend',
-    'autograd_frontend',
-    'build_frontend',
-    'memory_format_frontend',
-    'foreach_frontend',
-    'dataloader_frontend',
-    'sparse_frontend'
-]
+    # 2.0 release
+    'mps',
+    'intel',
+    'functorch',
+    'gnn',
+    'distributions',
+    'serialization',
+ ]  + [f'{category}_frontend' for category in frontend_categories] + pytorch_2_categories + [quantization.name] + [distributed.name]
+

 topics = [
    'bc_breaking',
@ -141,7 +188,16 @@ def get_ghstack_token():
        raise RuntimeError("Can't find a github oauth token")
    return matches[0]

-token = get_ghstack_token()
+def get_token():
+    env_token = os.environ.get("GITHUB_TOKEN")
+    if env_token is not None:
+        print("using GITHUB_TOKEN from environment variable")
+        return env_token
+    else:
+        return get_ghstack_token()
+
+token = get_token()
+
 headers = {"Authorization": f"token {token}"}

 def run_query(query):
@ -149,7 +205,7 @@ def run_query(query):
    if request.status_code == 200:
        return request.json()
    else:
-        raise Exception("Query failed to run by returning code of {}. {}".format(request.status_code, query))
+        raise Exception("Query failed to run by returning code of {}. {}".format(request.status_code, request.json()))


 def github_data(pr_number):
@ -179,7 +235,8 @@ def github_data(pr_number):
    }
    """ % pr_number
    query = run_query(query)
-
+    if query.get('errors'):
+        raise Exception(query['errors'])
    edges = query['data']['repository']['pullRequest']['labels']['edges']
    labels = [edge['node']['name'] for edge in edges]
    author = query['data']['repository']['pullRequest']['author']['login']
--- a/scripts/release_notes/explore.ipynb
+++ b/scripts/release_notes/explore.ipynb
@ -0,0 +1,110 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from pprint import pprint\n",
+    "from collections import Counter\n",
+    "import common\n",
+    "import math"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "commit_list_df = pd.read_csv(\"results/classifier/commitlist.csv\")\n",
+    "mean_authors=commit_list_df.query(\"category == 'Uncategorized' & topic != 'not user facing'\").author.to_list()\n",
+    "counts = Counter(mean_authors)\n",
+    "commit_list_df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "commit_list_df.category.describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# The number un categorized and no topic commits\n",
+    "no_category = commit_list_df.query(\"category == 'Uncategorized' & topic != 'not user facing'\")\n",
+    "print(len(no_category))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# check for cherry-picked commits\n",
+    "example_sha = '55c76baf579cb6593f87d1a23e9a49afeb55f15a'\n",
+    "commit_hashes = set(commit_list_df.commit_hash.to_list())\n",
+    "\n",
+    "example_sha[:11] in commit_hashes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get the difference between known categories and categories from commits\n",
+    "\n",
+    "diff_categories = set(commit_list_df.category.to_list()) - set(common.categories)\n",
+    "print(len(diff_categories))\n",
+    "pprint(diff_categories)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Counts of categories\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "a867c59af434d7534e61ccb37014830daefd5fcd3816cab68d595dde5e446f52"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/scripts/release_notes/requirements.txt
+++ b/scripts/release_notes/requirements.txt
@ -1 +1,2 @@
 PyGithub
+tqdm