Replace the db download script with a faster one and add GCP post-processing scripts

PiperOrigin-RevId: 697631760
2025-10-20 13:23:47 +08:00 · 2024-11-18 16:12:16 +00:00
parent 2ffe43f574
commit 01b87c245c
7 changed files with 263 additions and 165 deletions
--- a/docs/installation.md
+++ b/docs/installation.md
@ -186,7 +186,7 @@ git clone https://github.com/google-deepmind/alphafold3.git

 ## Obtaining Genetic Databases

-This step requires `curl` and `zstd` to be installed on your machine.
+This step requires `wget` and `zstd` to be installed on your machine.

 AlphaFold 3 needs multiple genetic (sequence) protein and RNA databases to run:

@ -200,20 +200,20 @@ AlphaFold 3 needs multiple genetic (sequence) protein and RNA databases to run:
 *   [RFam](https://rfam.org/)
 *   [RNACentral](https://rnacentral.org/)

-We provide a Python program `fetch_databases.py` that can be used to download
-and set up all of these databases. This process takes around 45 minutes when not
+We provide a bash script `fetch_databases.sh` that can be used to download and
+set up all of these databases. This process takes around 45 minutes when not
 installing on local SSD. We recommend running the following in a `screen` or
 `tmux` session as downloading and decompressing the databases takes some time.

 ```sh
 cd alphafold3  # Navigate to the directory with cloned AlphaFold 3 repository.
-python3 fetch_databases.py --download_destination=<DATABASES_DIR>
+./fetch_databases.sh <DB_DIR>
 ```

 This script downloads the databases from a mirror hosted on GCS, with all
 versions being the same as used in the AlphaFold 3 paper.

-:ledger: **Note: The download directory `<DATABASES_DIR>` should *not* be a
+:ledger: **Note: The download directory `<DB_DIR>` should *not* be a
 subdirectory in the AlphaFold 3 repository directory.** If it is, the Docker
 build will be slow as the large databases will be copied during the image
 creation.
@ -221,17 +221,17 @@ creation.
 :ledger: **Note: The total download size for the full databases is around 252 GB
 and the total size when unzipped is 630 GB. Please make sure you have sufficient
 hard drive space, bandwidth, and time to download. We recommend using an SSD for
-better genetic search performance, and faster runtime of `fetch_databases.py`.**
+better genetic search performance.**

 :ledger: **Note: If the download directory and datasets don't have full read and
 write permissions, it can cause errors with the MSA tools, with opaque
 (external) error messages. Please ensure the required permissions are applied,
-e.g. with the `sudo chmod 755 --recursive <DATABASES_DIR>` command.**
+e.g. with the `sudo chmod 755 --recursive <DB_DIR>` command.**

 Once the script has finished, you should have the following directory structure:

 ```sh
-pdb_2022_09_28_mmcif_files.tar  # ~200k PDB mmCIF files in this tar.
+mmcif_files/  # Directory containing ~200k PDB mmCIF files.
 bfd-first_non_consensus_sequences.fasta
 mgy_clusters_2022_05.fa
 nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta
@ -242,6 +242,18 @@ uniprot_all_2021_04.fa
 uniref90_2022_05.fa
 ```

+Optionally, after the script finishes, you may want copy databases to an SSD.
+You can use theses two scripts:
+
+*   `src/scripts/gcp_mount_ssd.sh <SSD_MOUNT_PATH>` Mounts and formats an
+    unmounted GCP SSD drive. It will skip the either step if the disk is either
+    already formatted or already mounted. The default `<SSD_MOUNT_PATH>` is
+    `/mnt/disks/ssd`.
+*   `src/scripts/copy_to_ssd.sh <DB_DIR> <SSD_DB_DIR>` this will copy as many
+    files that it can fit on to the SSD. The default `<DATABASE_DIR>` is
+    `$HOME/public_databases` and the default `<SSD_DB_DIR>` is
+    `/mnt/disks/ssd/public_databases`.
+
 ## Obtaining Model Parameters

 To request access to the AlphaFold 3 model parameters, please complete
@ -267,7 +279,7 @@ docker run -it \
    --volume $HOME/af_input:/root/af_input \
    --volume $HOME/af_output:/root/af_output \
    --volume <MODEL_PARAMETERS_DIR>:/root/models \
-    --volume <DATABASES_DIR>:/root/public_databases \
+    --volume <DB_DIR>:/root/public_databases \
    --gpus all \
    alphafold3 \
    python run_alphafold.py \
@ -280,6 +292,27 @@ docker run -it \
 persistent disk, which is slow.** If you want better genetic and template search
 performance, make sure all databases are placed on a local SSD.

+If you have databases on SSD in `<SSD_DB_DIR>` you can use uses it as the
+location to look for databases but allowing for a multiple fallbacks with
+`--db_dir` which can be specified multiple times.
+
+```
+docker run -it \
+    --volume $HOME/af_input:/root/af_input \
+    --volume $HOME/af_output:/root/af_output \
+    --volume <MODEL_PARAMETERS_DIR>:/root/models \
+    --volume <SSD_DB_DIR>:/root/public_databases \
+    --volume <DB_DIR>:/root/public_databases_fallback \
+    --gpus all \
+    alphafold3 \
+    python run_alphafold.py \
+    --json_path=/root/af_input/fold_input.json \
+    --model_dir=/root/models \
+    --db_dir=/root/public_databases \
+    --db_dir=/root/public_databases_fallback \
+    --output_dir=/root/af_output
+```
+
 If you get an error like the following, make sure the models and data are in the
 paths (flags named `--volume` above) in the correct locations.

@ -346,7 +379,7 @@ singularity exec \
     --bind $HOME/af_input:/root/af_input \
     --bind $HOME/af_output:/root/af_output \
     --bind <MODEL_PARAMETERS_DIR>:/root/models \
-     --bind <DATABASES_DIR>:/root/public_databases \
+     --bind <DB_DIR>:/root/public_databases \
     alphafold3.sif \
     python alphafold3/run_alphafold.py \
     --json_path=/root/af_input/fold_input.json \
@ -354,3 +387,21 @@ singularity exec \
     --db_dir=/root/public_databases \
     --output_dir=/root/af_output
 ```
+
+Or with some databases on SSD in location `<SSD_DB_DIR>`:
+
+```sh
+singularity exec \
+     --nv alphafold3.simg \
+     --bind $HOME/af_input:/root/af_input \
+     --bind $HOME/af_output:/root/af_output \
+     --bind <MODEL_PARAMETERS_DIR>:/root/models \
+     --bind <SSD_DB_DIR>:/root/public_databases \
+     --bind <DB_DIR>:/root/public_databases_fallback \
+     python alphafold3/run_alphafold.py \
+     --json_path=/root/af_input/fold_input.json \
+     --model_dir=/root/models \
+     --db_dir=/root/public_databases \
+     --db_dir=/root/public_databases_fallback \
+     --output_dir=/root/af_output
+```